66 PP16 Abstracts

IP1 [email protected] Scalability of Sparse Direct Codes IP4 As part of the H2020 FET-HPC Project NLAFET, we are studying the scalability of algorithms and software for us- Next-Generation AMR ing direct methods for solving large sparse equations. We Block-structured adaptive mesh refinement (AMR) is a examine how techniques that work well for dense system powerful tool for improving the computational efficiency solution, for example communication avoiding strategies, and reducing the memory footprint of structured-grid nu- can be adapted for the sparse case. We also study the merical simulations. AMR techniques have been used for benefits of using standard run time systems to assist us in over 25 years to solve increasingly complex problems. I developing codes for extreme scale computers. We will look will give an overview of what we are doing in Berkeley at algorithms for solving both sparse symmetric indefinite Labs AMR framework, BoxLib, to address the challenges systems and unsymmetric systems. of next-generation multicore architectures and the com- plexity of multiscale, multiphysics problems, including new Iain Duff ways of thinking about multilevel algorithms and new ap- Rutherford Appleton Laboratory, Oxfordshire, UK OX11 proaches to data layout and load balancing, in situ and 0QX in transit visualization and analytics, and run-time perfor- and CERFACS, Toulouse, France mance modeling and control. iain.duff@stfc.ac.uk Ann S. Almgren Lawrence Berkeley National Laboratory IP2 [email protected] Towards Solving Coupled Flow Problems at the Ex- treme Scale IP5 Scalability is only a necessary prerequisite for extreme scale Beating Heart Simulation Based on a Stochastic computing. Beyond scalability, we propose here an archi- Biomolecular Model tecture aware co-design of the models, algorithms, and data structures. The benefit will be demonstrated for the Lat- In my talk, I will introduce a new parallel computational tice Boltzmann method as example of an explicit time step- approach indispensable to simulate a beating human heart ping scheme and for a finite element multigrid method as driven by stochastic biomolecular dynamics. In this ap- an implicit solver. In both cases, systems with more than proach, we start from a molecular level modeling, and we ten trillion (1013) degrees of freedom can be solved already directly simulate individual molecules in sarcomere by the on the current peta-scale machines. Monte Carlo (MC) method, thereby naturally handle the cooperative and stochastic molecular behaviors. Then we Ulrich J. Ruede imbed these sarcomere MC samples into contractile ele- University of Erlangen-Nuremberg ments in continuum material model discretized by finite Department of Computer Science (Simulation) element method. The clinical applications of our heart [email protected] simulator are also introduced. Takumi Washio IP3 University of Tokyo, [email protected] Why Future Systems should be Data Centric

We are immersed in data. Classical high performance com- IP6 puting often generates enormous structured data through Conquering Big Data with Spark simulation; but, a vast amount of unstructured data is also being generated and is growing at exponential rates. To Today, big and small organizations alike collect huge meet the data challenge, we expect computing to evolve in amounts of data, and they do so with one goal in mind: two fundamental ways: first through a focus on workflows, extract ”value” through sophisticated exploratory analy- and second to explicitly accommodate the impact of big sis, and use it as the basis to make decisions as varied data and complex analytics. As an example, HPC systems as personalized treatment and ad targeting. To address should be optimized to perform well on modeling and simu- this challenge, we have developed Berkeley Data Analytics lation; but, should also focus on other important elements Stack (BDAS), an open source data analytics stack for big of the overall workflow which include data management data processing. and manipulation coupled with associated cognitive ana- lytics. At a macro level, workflows will take advantage of Ion Stoica different elements of the systems hierarchy, at dynamically University of California, Berkeley, USA varying times, in different ways and with different data [email protected] distributions throughout the hierarchy. This leads us to a data centric design point that has the flexibility to handle these data-driven demands. I will explain the attributes IP7 of a data centric system and discuss challenges we see for A Systems View of High Performance Computing future computing systems. It is highly desirable that the data centric architecture and implementation lead to com- HPC systems comprise increasingly many fast components, mercially viable Exascale-class systems on premise and in all working together on the same problem concurrently. the cloud. Yet, HPC efficient-programming dogma dictates that com- municating is too expensive and should be avoided. Pro- Paul Coteus gram ever more components to work together with ever less IBM Research, USA frequent communication?!? Seriously? I object! In other PP16 Abstracts 67

disciplines, deriving efficiency from large Systems requires nology and reaching the capabilities of the human brain. an approach qualitatively different from that used to derive In spite of good progress in high performance computing efficiency from small Systems. Should we not expect the (HPC) and techniques such as machine learning, this gap same in HPC? Using Grappa, a latency-tolerant runtime is still very large. I will then explore two related topics for HPC systems, I will illustrate through examples that through a discussion of recent examples: what can we learn such a qualitatively different approach can indeed yield ef- from the brain and apply to HPC, e.g., through recent ef- ficiency from many components without stifling communi- forts in neuromorphic computing? And how much progress cation. have we made in modeling brain function? The talk will be concluded with my perspective on the true dangers of su- Simon Kahan perintelligence, and on our ability to ever build self-aware Institute for Systems Biology and University of or sentient computers. Washington, USA Horst D. Simon [email protected] Lawrence Berkeley Laboratory [email protected] SP0 SIAG/Supercomputing Best Paper Prize: CP1 Communication-Optimal Parallel and Sequen- Scale-Bridging Modeling of Material Dynamics: tial QR and LU Factorizations Petascale Assessments on the Road to Exascale

We present parallel and sequential dense QR factoriza- Within the multi-institutional, multi-disciplinary Exascale tion algorithms that are both optimal (up to polylogarith- Co-design Center for Materials in Extreme Environments mic factors) in the amount of communication they per- (ExMatEx), we are developing adaptive physics refinement form and just as stable as Householder QR. We prove methods that exploit hierarchical, heterogeneous architec- optimality by deriving new lower bounds for the num- tures to achieve more realistic large-scale simulations. To ber of multiplications done by non-Strassen-like QR, and evaluate and exercise the task-based programming mod- using these in known communication lower bounds that els, databases, and runtime systems required to perform are proportional to the number of multiplications. We many-task computation workflows that accumulate a dis- not only show that our QR algorithms attain these lower tributed response database from fine-scale calculations, we bounds (up to polylogarithmic factors), but that existing are preforming petascale demonstrations on the Trinity su- LAPACK and ScaLAPACK algorithms perform asymptot- percomputer, combining Haswell, Knights Landing, and ically more communication. We derive analogous commu- burst buffer nodes. nication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least Timothy C. Germann some of these lower bounds. The sequential and parallel Los Alamos National Laboratory QR algorithms for tall and skinny matrices lead to signif- [email protected] icant speedups in practice over some of the existing algo- rithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model CP1 for the parallel algorithm for general rectangular matrices Parallel Recursive Density Matrix Expansion in predicts significant speedups over ScaLAPACK. Electronic Structure Calculations

James W. Demmel We propose a parallel recursive polynomial expansion al- University of California gorithm for construction of the density matrix in linear Division of Computer Science scaling electronic structure theory. The expansion con- [email protected] tains operations on matrices with irregular sparsity pat- terns, in particular matrix-matrix multiplication. By using Laura Grigori the Chunks and Tasks programming model[Parallel Com- INRIA put. 40, 328 (2014)] beforehand unknown sparsity pattern France is exploited while preventing load imbalance. We evaluate [email protected] the performance in electronic structure calculations, inves- tigating linear scaling with system size and parallel strong Mark Hoemmen and weak scaling. Sandia National Laboratories Anastasia Kruchinina [email protected] Uppsala universitet [email protected] Julien Langou University of Colorado Denver EmanuelH.Rubensson [email protected] Uppsala University [email protected] SP1 SIAG/Supercomputing Career Prize: Supercom- Elias Rudberg puters and Superintelligence Division of Scientific Computing, Department of I.T. Uppsala University In recent years the idea of emerging superintelligence has [email protected] been discussed widely by popular media, and many experts voiced grave warnings about its possible consequences. This talk will use an analysis of progress in supercomputer CP1 performance to examine the gap between current tech- Comparison of Optimization Techniques in Parallel 68 PP16 Abstracts

Ab Initio Nuclear Structure Calculations [email protected]

Many tasks in ab initio nuclear structure calculations may rely on search techniques, such as tuning nucleon-nucleon CP2 and three-nucleon interactions to fit light nuclei. This field Increasing the Arithmetic Intensity of a Sparse is still under-served, however, by the existing optimiza- Eigensolver for Better Performance on Heteroge- tion software, mainly due to the complexity of the under- nous Clusters lying function and derivative evaluations. Previously, we have developed a flexible framework to interface the MFDn The performance of sparse eigensolvers is typically limited package for large-scale ab initio nuclear structure calcula- by the network and/or main memory bandwidth. In this tions with derivative-free optimization packages. Recently, ‘bandwidth bounded’ regime it is possible to perform ad- we have obtained results with the derivative-based opti- ditional ‘free flops’ to improve the robustness and/or con- mization techniques. In this talk we compare these two vergence rate of a method. We present block orthogonal- broad optimization approaches as applied to parallel ab ization techniques in simulated quadruple precision that initio nuclear structure calculations and note on efficient exploit specific hardware features. As an example of use, function evaluations when several nuclei are fitted simulta- we investigate the performance of a block Jacobi-Davidson neously. method on compute clusters comprising multi/manycore CPUs and GPGPUs. Masha Sosonkina Old Dominion University Jonas Thies Ames Laboratory/DOE German Aerospace Center (DLR) [email protected] [email protected]

CP2 CP2 Feast Eigenvalue Solver Using Three Levels of MPI Highly Scalable Selective Inversion Using Sparse Parallelism and Dense Matrix Algebra

We present an additional level of MPI parallelism for the The computation of a particular subset of entries of in- FEAST Eigenvalue solver that can augment its numeri- verse matrices is required in several different disciplines, cal scalability performances and data processing capabili- such as genomic prediction, reservoir characterization and ties for challenging eigenvalue problems at extreme scale. finance. Methods based on the matrix factorization can be Any MPI solver including domain-decomposition methods, applied to obtain a significant reduction of the complex- ity in a parallel and distributed-memory environment. We canbeinterfacedwiththesoftwarekernelthroughanRCI introduce an algorithm that implements an efficient, par- mechanism. The algorithm contains three levels of paral- allel, and scalable selective inversion method based on LU lelism and is ideally suited to run on parallel architectures. and Schur-complement factorization of sparse and dense Performance results from a real-world engineering applica- matrices. tion are discussed. Fabio Verbosio James Kestyn USI Universit`a della Svizzera Italiana University of Massachusetts, Amherst, MA [email protected] [email protected] Drosos Kourounis Eric Polizzi USI - Universita della Svizzera italiana University of Massachusetts, Amherst, USA Institute of Computational Science [email protected] [email protected]

Peter Tang Olaf Schenk Intel Corporation USI - Universit`a della Svizzera italiana 2200 Mission College Blvd, Santa Clara, CA Institute of Computational Science [email protected] [email protected]

CP2 CP2 Parallel Computation of Recursive Bounds for Exploring Openmp Task Priorities on the MR3 Eigenvalues of Matrix Polynomials Eigensolver

We derive bounds for eigenvalues of polynomial matrices As part of the OpenMP 4.1 draft, the runtime incorporates by generalizing to matrix polynomials a recent result for task priorities. We use the Method of Multiple Relatively scalar polynomials that can be recursively refined, increas- Robust Representations (MR3), for which a pthreads- ing the computational cost and decreasing sparsity, making based task parallel version already exists (MR3SMP), to it too costly to carry out sequentially for large matrices. analyze and compare the performance of MR3SMP with Instead, we use a parallel architecture (GPU) and achieve three different OpenMP runtimes, with and without the better bounds in the same amount of time necessary to ob- support of priorities. From a dataset consisting of applica- tain a single unrefined bound on a sequential machine. We tion matrices, it appears that OpenMP is always on par or illustrate with engineering examples. better than the pthreads implementation.

Aaron Melman Jan Winkelmann Santa Clara University AICES Department of Applied Mathematics RWTH Aachen PP16 Abstracts 69

[email protected] requirements. Both numerical and parallel scalabilities of the algorithm are presented for a diffusion equation having Paolo Bientinesi spatially varying diffusion coefficient modeled by a non- AICES, RWTH Aachen Gaussian stochastic process. [email protected] Abhijit Sarkar Associate Professor CP3 carleton University, Ottawa, Canada Fast Mesh Operations Using Spatial Hashing: abhijit [email protected] Remaps and Neighbor Finding Ajit Desai Mesh operations based on hashing techniques are fast Carleton University, Canada and scale linearly with problem-size. Hashing is also [email protected] comparison-free which is better suited for the fine grained parallelism of Exascale hardware. We present how to lever- Mohammad Khalil age hashing for fast remaps and neighbor finding and com- Carleton University, Canada pare performance with standard tree-based approaches on (Currently at Sandia National Laboratories, USA) GPUs, MICs and multicore systems. [email protected] Robert Robey Los Alamos National Lab Chris Pettit XCP-2 Eulerian Applications Naval Academy, USA [email protected] [email protected]

David Nicholaeff Dominique Poirel XCP-2 Eulerian Applications Royal Military College of Canada, Canada Los Alamos National Lab [email protected] [email protected] CP4 Gerald Collom University of Washington ”Mystic”: Highly-Constrained Non-Convex Opti- Los Alamos National Lab mization and Uncertainty Quantification [email protected] Highly-constrained, large-dimensional, and nonlinear op- timizations are at the root of many difficult problems in Redman Colin uncertainty quantification (UQ), risk, operations research, Arizona State University and other predictive sciences. The prevalence of parallel Los Alamos National Laboratory computing has stimulated a shift from reduced-dimensional [email protected] models toward more brute-force methods for solving high- dimensional nonlinear optimization problems. The ‘mys- CP3 tic’ software enables the solving of difficult UQ problems as embarrassingly parallel non-convex optimizations; and Traversing Time-Dependent Networks in Parallel with the OUQ algorithm, can provide validation and cer- Large-scale time-dependent networks arise in a range of tification of standard UQ approaches. digital technologies, such as citation networks, schedule Michael McKerns networks and networks of social users interacting through California Institute of Technology messaging. In this talk, we consider ways of mining such [email protected] complex networks, including computing in parallel the shortest temporal path and the connected components. We generalize breadth first search and depth first search algo- CP4 rithms for time-dependent networks. All algorithms are implemented in Julia, a new dynamic programming lan- A Resilient Domain Decomposition Approach for guage for scientific computing. Extreme Scale UQ Computations Weijian Zhang One challenging aspect of extreme scale computing con- School of Mathematics cerns combining uncertainty quantification (UQ) methods The University of Manchester with resilient PDE solvers. We present a resilient poly- [email protected] nomial chaos solver for uncertain elliptic PDEs. In a do- main decomposition framework, our solver constructs the Dirichlet-to-Dirichlet operator by means of sampling and CP4 robust regression techniques. This leads to a resilient UQ Scalability Studies of Two-Level Domain Decompo- solver that requires only a limited amount of global com- sition Algorithms for Stochastic Pdes Having Large munication, thus benefiting the scalability of the computa- Random Dimensions tions. The scalability of two-level domain decomposition algo- Paul Mycek, Andres Contreras rithms developed by Subber (PhD thesis, Carleton Uni- Duke University versity, 2012) for uncertainty quantification of large-scale [email protected], [email protected] stochastic PDEs is enhanced to handle high-dimensional stochastic problems. Parallel sparse matrix-vector opera- Olivier P. Le Matre tions are used to cut floating-point operations and memory LIMSI-CNRS 70 PP16 Abstracts

[email protected] zeyao [email protected]

Khachik Sargsyan, Francesco Rizzi, Karla Morris, Cosmin Safta CP5 Sandia National Laboratories The Elaps Framework: Experimental Linear Alge- [email protected], [email protected], bra Performance Studies [email protected], [email protected] The multi-platform open source framework ELAPS facili- tates easy and fast, yet powerful performance experimen- Bert J. Debusschere tation and prototyping of dense linear algebra algorithms. Energy Transportation Center In contrast to most existing performance analysis tools, it Sandia National Laboratories, Livermore CA targets the initial stages of the development process and as- [email protected] sists developers in both algorithmic and optimization de- cisions. Users construct experiments to investigate how Omar M. Knio performance and efficiency vary from one algorithm to an- Duke University other, depending on factors such as caching, algorithmic [email protected] parameters, problem size, and parallelism.

Elmar Peise, Paolo Bientinesi CP5 AICES, RWTH Aachen [email protected], [email protected] Reliable Bag-of-Tasks Dispatch System

We introduce a reliable bag-of-tasks dispatch system that CP6 will allow those bag-of-tasks applications run efficiently on Sparse Matrix Factorization on GPUs non-reliable hardware at excascale. The system is built on top of MPI to guarantee general applicability across We present a sparse multifrontal QR Factorization method different architectures. It utilizes a mathematical model on a eterogeneous platform multiple GPUs. Our method that, based on the execution time of some of the tasks is extremely efficient for different sparse matrices. Our and their sizes, determines, at runtime, if a task failure is method benefits from both the highly parallel general- recovered via task re-execution or whether the task requires purpose computing cores available on a graphics processing a replica. unit (GPU) and from multiple GPUs provided on a single platform. It exploits two types of parallelism: the first Tariq L. Alturkestani over the available GPUs on the platform and the second King Abdullah University of Science and Technology over the Streaming Multiprocessor (SMPs)within a GPU. University of Pittsburgh We are also developing a technique for factorizing a dense [email protected] frontal matrix on multiple GPUs using our GPU engine. Mohamed Gadou Esteban Meneses University of Florida Costa Rica Institute of Technology [email protected]fl.edu [email protected] Tim Davis Rami Melhem Texas A&M University University of Pittsburgh, USA [email protected] [email protected] Sanjay Ranka CISE, University of Florida, USA CP5 [email protected]fl.edu HiPro: An Interactive Programming Tool for Par- allel Programming CP6 The scientific computing community is suffering from a A New Approximate Inverse Technique Based on lack of good development tool that can handle well the the GLT Theory unique problems of coding for high performance comput- ing.HiPro is an interactive graphical tool designed for rapid Sparse approximate inverse is a well-known precondition- development of large-scale numerical simulations. The pro- ing technique. Given a matrix A, the approximation M of the A−1, in general, is based on Frobenius norm minimiza- gramming tool is based on the domain specific framework | − | JASMIN and is designed to ease parallel programming for tion I MA . The sparsity pattern of the inverse matrix domain experts. Real applications demonstrate that it is is prescribed a priori or adjusted dynamically during its helpful on enabling increased software productivity. construction. In this presentation, a new method for com- puting an approximate inverse of a matrix is presented. We utilize the available knowledge of the matrix eigenval- Li Liao, Aiqing Zhang ues attained by applying the Generalized Locally Toeplitz Institute of Applied Physics and Computational (GLT) theory. The sparsity pattern of the resulting ap- Mathematics proximation can be controlled dynamically. Moreover, the [email protected], zhang [email protected] construction of the approximate inverse is parallel and gen- erally cheap since the function describing the matrix A is Zeyao Mo known a priori. Laboratory of Computational Physics, IAPCM P.O. Box 8009, Beijing 100088, P.R. China Ali Dorostkar PP16 Abstracts 71

Uppsala University Bilkent University Department of Information Technology [email protected] [email protected] Murat Manguoglu Middle East Technical University CP6 [email protected] Designing An Efficient and Scalable Block Low- Rank Direct Solver for Large Scale Clusters Cevdet Aykanat Bilkent University Electromagnetic problems with an integral equation for- [email protected] mulation lead to solve large dense linear systems. To ad- dress large problems, we enhanced our Full-MPI direct solver with a block low-rank compression method. We CP7 show how we derived a block low-rank version from our Towards Exascalable Algebraic Multigrid existing solver, and detail the optimizations we used to in- crease its scalability on thousands of cores of the TERA Algebraic multigrid is the solver of choice in many and supercomputer at CEA/DAM, for example by introducing growing applications in today’s petascale environments. hybrid programming models, or with algorithmic improve- Exascale architectures exacerbate the challenges of in- ments. creased SIMT concurrency, reduced memory capacity and bandwidth per core, and vulnerability to global synchro- Xavier Lacoste nization. The recent Vassilevski-Yang (2014) mult-additive CEA - CESTA AMG takes important steps in reducing communication [email protected] and synchronization frequency while controlling memory growth, but remains bulk-synchronous. We extend the al- C´edric Augonnet, David Goudin gorithm further towards exascale environments with a task- CEA/DAM based directed acyclic graph implementation. [email protected], [email protected] Amani Alonazi,DavidE.Keyes KAUST CP6 [email protected], [email protected] Locality-Aware Parallel Sparse Matrix-Matrix Multiplication CP7 We propose a parallel sparse matrix-matrix multiplication A Matrix-Free Preconditioner for Elliptic Solvers for distributed memory clusters. Our method exploits lo- Based on the Fast Multipole Method cality of the nonzero matrix elements without beforehand knowledge of the matrix sparsity pattern. This is achieved We combine the Fast Multipole Method, originally devel- by using a distributed hierarchical matrix representation, oped as a free-standing solver, with Krylov iteration as a straightforward to implement using the Chunks and Tasks scalable and performant preconditioner for traditional low- programming model [Parallel Comput. 40, 328 (2014)]. order finite discretizations of elliptic boundary value prob- We demonstrate favorable weak and strong scaling of the lems. FMM possesses advantages for emergent architec- communication cost with the number of processes, both tures including: low synchrony, high arithmetic intensity, theoretically and in numerical experiments. and exploitable SIMT concurrency. We compare FMM- preconditioned Krylov solvers with multilevel and other EmanuelH.Rubensson solvers on a variety of elliptic problems and illustrate their Uppsala University effectiveness in a computational fluid dynamics applica- [email protected] tion.

Elias Rudberg Huda Ibeid Division of Scientific Computing, Department of I.T. King Abdullah University of Science and Technology Uppsala University [email protected] [email protected] Rio Yokota Tokyo Institute of Technology CP6 [email protected] Solving Sparse Underdetermined Linear Least Squares Problems on Parallel Computing Plat- David E. Keyes forms KAUST [email protected] Computing the minimum 2-norm solution is essential for many areas such as geophysics, signal processing and biomedical engineering. In this work, we present a new CP7 parallel algorithm for solving sparse underdetermined lin- Performance of Smoothers for Algebraic Multi- ear systems where the coefficient matrix is block diagonal grid Preconditioners for Finite Element Variational with overlapping columns. The proposed parallel approach Multiscale Incompressible Magnetohydrodynamics handles the diagonal blocks independently and the reduced system involving the shared unknowns. Experimental re- We examine the performance of various smoothers for a sults show the effectiveness of the proposed scheme on var- fully-coupled algebraic multigrid preconditioned Newton- ious parallel computing platforms. Krylov solution approach for a finite element variational multiscale turbulence model for incompressible magneto- F. Sukru Torun hydrodynamics (MHD). Our focus will be on large-scale, 72 PP16 Abstracts

high fidelity transient MHD simulations on unstructured [email protected] meshes. We present scaling results for resistive MHD test cases, including results for over 500,000 cores on an IBM Blue Gene/Q platform. CP8 Parallel Linear Programming Using Zonotopes Paul Lin Sandia National Laboratories Zonotope is an affine projection of n-dimensional cube. [email protected] The solution of a linear programming problem on a zono- tope is given by a closed-form expression. We consider lin- T ear programming problem min c xs.t.Ax= b, ||x||∞ ≤ 1. John Shadid Following work of Fujishige et al., the problem with the Sandia National Laboratories help of lifting method can be reduced to finding an inter- Albuquerque, NM section between a zonotope and a line. We consider par- [email protected] allelization of this approach using multiple projections of line points on the zonotope via first-order methods. Paul Tsuji Lawrence Livermore National Laboratory Max Demenkov [email protected] Institute of Control Sciences [email protected] Jonathan J. Hu Sandia National Laboratories Livermore, CA 94551 CP8 [email protected] Vertex Separators with Mixed-Integer Linear Op- timization

We demonstrate a method of computing vertex separators CP7 in arbitrary graphs using a multi-level coarsening and re- Efficiency Through Reuse in Algebraic Multigrid finement strategy coupled with a mixed-integer linear pro- gramming solver. This approach is easily parallelizable due to the use of branch-and-bound trees, leading to significant Algebraic multigrid methods (AMG) are utilized by many speedups on many-core systems. Our method contrasts large-scale scientific applications. Even well developed with other parallelized graph partitioning schemes in that AMG libraries may have few scalability areas of concern, the addition of processors does not decrease the separator be it an incomplete factorization smoother or coarse solver, quality. or a graph traversal algorithms required in smoothed ag- gregation. We will discuss strategies of mitigating the ef- Scott P. Kolodziej,TimDavis fect of such kernels through reuse of multigrid hierarchy Texas A&M University information in multiple setups during the simulation, and [email protected], [email protected] demonstrate the effectiveness of such strategies in parallel applications. CP8 Andrey Prokopenko Large-Scale Parallel Combinatorial Optimization Sandia National Laboratories – the Role of Estimation and Ordering the Sub- [email protected] Problems One major problem in parallel algorithms for a combinato- Jonathan J. Hu rial optimization problems is that the search space for the Sandia National Laboratories sub-problems differ in several magnitudes. This makes the Livermore, CA 94551 scheduling hard, and the resulting algorithm often ineffi- [email protected] cient. Measuring the search space of these sub-problems helps to solve this, and have other usage as well. We will show actual running results using several hundred proces- CP7 sors and show that our algorithm can solve previously un- solved maximum clique search problems. A Dynamical Preconditioning Strategy and Its Ap- plications in Large-Scale Multi-Physics Simulations Bogdan Zavalnij, Sandor Szabo Institute of Mathematics and Informatics University of Pecs In the implicit time-stepping based simulations, a fixed [email protected], [email protected] setup-based preconditioner, e.g. AMG, ILU etc, is usually used due to its robustness, which we called the static pre- conditioning strategy. Setup-based preconditioners, how- CP9 ever, lead to poor parallel scalability due to its setup phase. In this work, a dynamical preconditioning strategy is intro- Investigating IBM Power8 System Configuration duced in order to reduce the setup overhead while maintain To Minimize Energy Consumption During the Ex- the robustness. Results for a practical multi-physics simu- ecution of CoreNeuron Scientific Application lation on O(104) cores show its efficiency and improvement. Leveraging from AMESTER library [P. Klavik et al. Changing Computing Paradigms Towards Power Effi- Xiaowen Xu ciency, Klavk P, Malossi ACI, Bekas C, Curioni A. Institute of Applied Physics and Computational 2014 Changing computing paradigms towards power ef- Mathematics ficiency. Phil. Trans. R. Soc. A 372: 20130278. PP16 Abstracts 73

http://dx.doi.org/10.1098/rsta.2013.0278], the power con- [email protected] sumption of the POWER8 system will be investigated when executing CoreNeuron scientific application. Ex- tending the study presented in [T. Ewart et al., Perfor- CP9 mance Evaluation of the IBM POWER8 Architecture to Support Computational Neuroscientific Application Using Quantifying Performance of a Resilient Elliptic Morphologically Detailed Neurons, International Bench- PDE Solver on Uncertain Architectures Using marking and Optimization Workshop, 2015 Supercomput- SST/Macro ing Conference, Austin, TX, USA] both time to solution and energy consumption of several configurations of the Architectural uncertainty and resiliency are two key chal- POWER8 system will be analyzed. The optimal design lenges of exascale computing. In this talk, we present a per- point which allows maximizing application performance formance study of a resilient PDE solver on uncertain archi- while lowering energy consumption will be suggested, of- tectures using the architecture simulator software package fering useful insights for future system designs. SST/macro. We adopt a task-based server-client model to enable resiliency to hard faults. Using this framework, we Fabien Delalondre conduct an uncertainty quantification study allowing us to Ecole Polytechnique Federale de Lausanne perform a sensitivity analysis to identify the main param- fabien.delalondre@epfl.ch eters affecting performance and their correlations.

Timothee Ewart Karla Morris, Francesco Rizzi, Khachik Sargsyan EPFL Sandia National Laboratories timothee.ewart@epfl.ch [email protected], [email protected], [email protected] Pramod Shivaj EPFL - Blue Brain Project Kathryn Dahlgren pramod.kumbhar@epfl.ch Sandia National Labs [email protected] Felix Schuermann EPFL Paul Mycek felix.schuermann@epfl.ch Duke University [email protected]

Cosmin Safta CP9 Sandia National Laboratories Scalable Parallel I/O Using HDF5 for HPC Fluid [email protected] Flow Simulations Olivier P. Le Matre LIMSI-CNRS According to the latest US Presidents Council of Advisors [email protected] on Science and Technology ”high-performance computing must now assume the ability to efficiently manipulate vast Omar M. Knio and rapidly increasing quantities of data”. Therefore, we Duke University present a framework for scientific I/O using Hierarchical [email protected] Data Format 5 (HDF5), focusing on highly scalable paral- lel read and write routines. Furthermore, this framework allows to visualise subsets of the computed data already Bert J. Debusschere during runtime of a CFD simulation on massive parallel Energy Transportation Center systems. Sandia National Laboratories, Livermore CA [email protected] Christoph M. Ertl Chair of Computation in Engineering, Prof. Rank CP9 Technische Universit¨at M¨unchen [email protected] Implicitly Parallel Interfaces to Operational Quan- tum Computation Ralf-Peter Mundani TUM, Faculty of Civil, Geo and Environmental Many high-value computational problems involve search Engineering spaces that grow exponentially with problem size. Ap- Chair for Computation in Engineering plying traditional parallelism to such problems has had [email protected] limited success. An alternative strategy exploits quantum parallelism occurring within operational quantum comput- J´erˆome Frisch ers. Building on recently demonstrated benefits of quan- Institute of Energy Efficiency and Sustainable Building tum effects in such devices, we show how subject matter E3D experts using familiar interfaces can seamlessly tap into- RWTH Aachen University quantum superposition, entanglement, and tunneling to [email protected] perform their computations in entirely new ways,without learning quantum mechanics. Ernst Rank Computation in Engineering Steve P. Reinhardt Technische Universit¨at M¨unchen Cray, Inc. 74 PP16 Abstracts

[email protected] overlapping DDM is similar to the popular DDM for Pois- son problem. The complexity of the overlapping DDM is O(Nniter), where niter is the number of iteration of the CP9 method, and it is shown numerically that niter is propor- On the Path to Trinity - Experiences Bringing tional to the number of subdomains in one direction. 2D Code to the Next Generation ASC Capability Plat- Helmholtz problem with more than a billion unknowns and form 3D Helmholtz problem are solved efficiently with the pre- conditioner on massively parallel machines. Our next system, Trinity, a Cray XC40, will be a mixture of Intel Haswell and Intel Xeon Phi Knights Landing proces- Wei Leng sors. We are currently engaged in porting applications to Laboratory of Scientific and Engineering Computing these new processor designs where vectorization, threading Chinese Academy of Sciences, Beijing, China and use multi-level memories are required for high perfor- [email protected] mance. We will present initial results from these applica- tion ports and comment on the level of work required to Lili Ju get codes running on each partition of the machine. University of South Carolina Department of Mathematics Courtenay T. Vaughan [email protected] Sandia National Laboratories [email protected] CP10 Simon D. Hammond GPU Performance Analysis of Discontinuous Scalable Computer Architectures Galerkin Implementations for Time-Domain Wave Sandia National Laboratories Simulations [email protected] Finite element schemes based on discontinuous Galerkin methods exhibit interesting features for massively parallel CP10 computing and GPGPU. Nowadays, they represent a credi- Wave Propagation in Highly Heterogeneous Me- ble alternative for large-scale industrial applications. Com- dia: Scalability of the Mesh and Random Proper- putational performance of such schemes however strongly ties Generator depends on their implementations, and several strategies have been proposed or discarded in the past. In this talk, We present in this talk elastic wave propagation simula- we present and compare up-to-date performance results of tions in highly randomly heterogeneous geophysical media different implementations tested in a unified framework. using a spectral element solver. When using thousand of cores, meshing and generation of properties become issues. Axel Modave,JesseChan,TimWarburton We propose in this talk: a scalable meshing algorithm for Virginia Tech hexahedra that considers topography; and a scalable ran- [email protected], [email protected], [email protected] dom field generator that constructs elastic parameter fields with chosen statistics. Large scale simulations and scala- bility analyses illustrate the effectiveness of the approach. CP10 Efficient Solver for High-Resolution Numerical Regis Cottereau Weather Prediction Over the Alps CNRS, CentraleSupelec, Universit´e Paris-Saclay [email protected] Prototype implementation of new ”EULAG” dynamical core, based on nonoscillatory forward in time methods, Jos´e Camata into the operational COSMO framework for numerical NACAD - High Performance Computing Center weather prediction is discussed. The core aims at the next- COPPE/Federal University of Rio de Janeiro generation very high regional weather prediction over the [email protected] Alps. Currently, its numerical formulation based on MP- DATA advection and iterative Krylov solver offers compet- Lucio de Abreu Corra, Luciano de Carvalho Paludo itive performance to the operational dynamical core based CNRS, CentraleSupelec, Universit´e Paris-Saclay on Runge-Kutta methods. Moreover, it offers an unprece- [email protected], [email protected] dented choice of the soundproof/compressible equation sys- tems, offering similar quality of the solution but different ratio of global and local communication cost. Alvaro Coutinho The Federal University of Rio de Janeiro Zbigniew P. Piotrowski Brazil Institute of Meteorology and Water Management [email protected] [email protected]

CP10 Damian W´ojcik Poznan Supercomputing and Networking Center An Overlapping Domain Decomposition Precondi- [email protected] tioner for the Helmholtz Equation An overlapping domain decomposition preconditioner is Andrzej Wyszogrodzki proposed to solve the high frequency Helmholtz equation. Institute of Meteorology and Water Management, Poland The computation domain is decomposed in d directions for [email protected] problem in Rd, and the local solution on each subdomain is updated simultaneously in one iteration, thus there is Milosz Ciznicki no sweeping along certain directions. In these ways, the Institute of Meteorology and Water Management PP16 Abstracts 75

[email protected] AICES RWTH Aachen University [email protected] CP10 An Adaptive Multi-Resolution Data Model for Paolo Bientinesi Feature-Rich Environmental Simulations on Mul- AICES, RWTH Aachen tiple Scales [email protected] The accuracy and predictive value of environmental sim- ulations depend largely on the underlying geometric and CP11 semantic data. Therefore, a data-model suitable for driv- Fresh: A Framework for Real-World Structured ing such simulations with special emphasis on multi-scale Grid Stencil Computations on Heterogeneous Plat- urban flooding scenarios is presented. This includes fu- forms sion and orchestration of heterogeneous and ubiquitous data ranging from GIS to BIM sources, efficient hybrid To tackle the heterogeneous programming challenge, for parallelization of storage, access, and processing of large an important computation pattern structured-grid stencil datasets, and the full integration and interaction with the computations, we propose FRESH (Framework for REal- pipeline for parallel numerical simulations. world structured-grid Stencil computations on Heteroge- neous platforms), which consists of a Fortran-like domain- Vasco Varduhn specific language and a source-to-source compiler. The University of Minnesota DSL contains domain specific features that ease program- [email protected] ming. The compiler generates code for multiple platforms. A real-world computational electromagnetic application CP11 was ported , and speedups of 1.47, 16.68 and 15.32 were achieved on CPU, GPU and MIC. Maximizing CoreNeuron Application Performance on All CPU Architectures Using Cyme Library Yang Yang, Aiqing Zhang Institute of Applied Physics and Computational The Cyme library first presented in [T. Ewart et al., A Mathematics Library Maximizing SIMD Computation on User-Defined yang [email protected], zhang [email protected] Containers. In J.M. Kunkel, T. Ludwig, and H.W. Meuer (Eds.): ISC 2014, LNCS 8488, pp. 440–449. Springer International Publishing Switzerland] supports the auto- Zhang Yang generation of optimized code for specific targeted system Institute of Applied Physics and Computational architectures. Its last extension allows supporting both Mathematics IBM POWER8 and ARM processors in addition to Intel China Academy of Engineering Physics x86 and MIC as well as IBM Blue Gene/Q. In this pre- yang [email protected] sentation we present the integration of the Cyme library into CoreNeuron scientific application which proved to offer close optimal performance on IBM Blue Gene/Q systems CP12 except in the case of data hazard. Thanks to the integra- Load Balancing Issues in Particle In Cell Simula- tion, we demonstrate that better performance could ben tions obtained, overcoming the shortcomings of the XLC com- piler. Recent simulation campaigns of laser wakefield electrons acceleration, showed that, performance-wise, the most ur- Timothee Ewart, Kai Langen, Pramod Kumbhar, Felix gent concern is to find a way to face the strong load imbal- Schuermann ance that arises on very large full 3D simulations. The hy- EPFL brid MPI-openMP typical implementation performs quite timothee.ewart@epfl.ch, [email protected], well on systems with a couple hundreds of cores. But pramod.kumbhar@epfl.ch, felix.schuermann@epfl.ch the accessible number of openMP threads is limited and, as the number of MPI processes increases, this relatively Fabien Delalondre small number of threads is not able to balance the load Ecole Polytechnique Federale de Lausanne any more. Imbalance issues come back, wasting our pre- fabien.delalondre@epfl.ch cious computation time again. Hence the need for an effi- cient dynamic load balancing algorithm. The algorithm we present is based on the division of each MPI domain into CP11 many smaller, so called, patches. These patches are used Ttc: A Compiler for Tensor Transpositions as data sorting structures and can be exchanged between MPI processes in order to balance the computational load. We present TTC, an open-source compiler for multidimen- It has been implemented in the SMILEI code, a new open sional tensor transpositions. Thanks to a range of opti- source Particle In Cell (PIC) code, developed jointly by mizations (software prefetching, blocking, loop-reordering, physicists and HPC experts in order to make sure that it explicit vectorization), TCC generates high-performance performs well on the newest supercomputers architectures. parallel C/C++ code. We use generic heuristics and auto- tuning to manage the huge search space. Performance re- sults show that TTC achieves close to peak memory band- Arnaud Beck width on both the Intel Haswell and the AMD Steamroller K.U.Leuven architectures, yielding performance gains of up to 15× over [email protected] modern compilers. Julien Derouillat Paul Springer CEA - Maison De La Simulation 76 PP16 Abstracts

[email protected] John C. Bowman Dept. of Mathematical and Statistical Sciences University of Alberta CP12 [email protected] Particle-in-Cell Simulations for Nonlinear Vlasov- Like Models CP12 We report on the progress in the development of a particle- A Massively Parallel Simulation of the Coupled Ar- in-cell code for high-performance computing. We focus on terial Cells in Branching Network nonlinear Vlasov-like models as applications and on ab- solute performance per core on the implementation side. Arterial wall is composed of a network of multiple types of Results on classical physics test cases are presented. We cells connected via intercellular gap junctions. Under the report on the code’s organization and the approach to mul- environmental stimuli, e.g. the blood borne agonists and tiprocess and multithread issues. flow, a network of these cells elicits vasomotion. We present a massively parallel computational model simulating the Sever Hirstoaga coupled arterial cells in branching network, to study the Inria mechanistic basis of the coordinated vascular response up- [email protected] stream and downstream of a localized stimulus, mediated by intracellular calcium signaling. The goal is to emulate Edwin Chacon-Golcher and validate wet lab experiments. Institute of Physics of the ASCR, ELI-Beamlines,18221 Prague Mohsin A. Shaikh [email protected] Pawsey Supercomputing Center [email protected]

CP12 Constantine Zakkaroff Parallel in Time Solvers for DAEs UC HPC University of Canterbury Time integration of dynamic systems is a sequential process constantine.zakkaroff@canterbury.ac.nz by nature. With the increase of the number of cores and processors in modern computers, making this process par- Stephen Moore allel becomes an important issue. In this context, Braid IBM Research Collaboratory for Life Sciences-Melbourne software has been developed at LLNL as a non-intrusive The University of Melbourne parallel-in-time black box overlay. In this talk, we will fo- [email protected] cus on Differential Algebraic Equations and how parallel- in-time algorithms perform for such problems. These DAEs Tim David arise in many fields including the modelling of power grid University of Canterbury systems. [email protected] Matthieu Lecouvez Lawrence Livermore National Laboratory CP13 [email protected] Adaptive Transpose Algorithms for Distributed Multicore Processors Robert D. Falgout Center for Applied Scientific Computing An adaptive parallel block transpose algorithm optimized Lawrence Livermore National Laboratory for distributed multicore hybrid OpenMP/MPI architec- [email protected] tures is presented. A hybrid configuration can decrease communication costs by reducing the number of MPI Carol S. Woodward nodes, thereby increasing message sizes. This allows for Lawrence Livermore Nat’l Lab a more slab-like than pencil-like domain decomposition for [email protected] multidimensional transforms (such as the FFT), reducing the cost of, or even eliminating the need for, a second dis- tributed transpose. Nonblocking all-to-all transfers enable CP12 user computation and communication to be overlapped. Implicitly Dealiased Convolutions for Distributed Memory Architectures John C. Bowman Dept. of Mathematical and Statistical Sciences Convolutions are a fundamental tool of scientific comput- University of Alberta ing. For multi-dimensional data, implicitly dealiased con- [email protected] volutions [Bowman and Roberts, SIAM J. Sci. Comput. 2011] are faster and require less memory than conventional Malcolm Roberts FFT-based methods. We present a hybrid OpenMP/MPI im- Universit´edeStrasbourg plementation in the open-source software library FFTW++. [email protected] The reduced memory footprint translates to a reduced communication cost, and the separation of input and work arrays allows one to overlap computation and communica- CP13 tion. Efficient Two-Level Parallel Algorithm Based on Domain Decomposition Method for Antenna Sim- Malcolm Roberts ulation Universit´edeStrasbourg [email protected] This paper presents a scalable method for solving complex PP16 Abstracts 77

antenna problems. The method is based on FETI-2LM [email protected] domain decomposition method plus a block ORTHODIR approach for multiple right hand sides. The block ap- Nicole Spillane proach allows efficient multi-threading parallelization on Centre de Math´ematiques Appliqu´ees multi-core nodes for simultaneous local backward substi- Ecole Polytechnique tutions within the subdomains and decreases the overhead [email protected] due to message passing between subdomains for the or- thogonalization of search directions. Performance results for large antenna arrays on systems with up to tens of CP14 thousands of cores are analyzed. A Parallel Contact-Impact Simulation Framework Francois-Xavier Roux Parallel contact-impact simulation poses great challenges ONERA on both software design and performance. We present our [email protected] framework approach targeting at these challenges. Parallel details such as global contact search, dynamic load balanc- ing among processes and threads, data exchange among CP13 sub-domains, etc., are encapsulated into our framework. GPU Acceleration of Polynomial Homotopy Con- Application developers are then able to focus only on the tinuation physics and numerical schemes within each sub-domain. Algorithms and numerical experiments over 10K cores are Homotopy continuation methods apply predictor-corrector presented. algorithms to solve polynomial systems. Our implementa- tion on a Graphics Processing Unit (GPU) combines algo- Jie Cheng, Qingkai Liu rithmic differentiation with double double and quad double Institute of Applied Physics and Computational arithmetic. Our experiments with GPU acceleration show Mathematics that we can offset the cost overhead of double double pre- cheng [email protected], [email protected] cision arithmetic. We report on our implementation on the NVIDIA K20C and the integration into the software PHCpack and phcpy. CP14 Turbomachinery CFD on Modern Multicore and Jan Verschelde Manycore Architectures Department of Mathematics, Statistics and Computer Science The exploitation of parallelism across all granularities (in- University of Illinois at Chicago struction, data, core) is mandatory for achieving a high [email protected] degree of performance on current and future architectures. We present the performance tuning of multiblock and un- Xiangcheng Yu structured CFD solvers used for aero-engine design on lead- Department of Mathematics, Statistics, and Computer ing multicore (SandyBridge, Haswell) and manycore (Xeon Science Phi) platforms. Optimisations include efficient SIMD com- University of Illinois at Chicago putations of stencil operations derived from PDE dis- [email protected] cretizations and domain decomposition for thread paral- lelism for both structured and unstructured grids.

CP14 Ioan C. Hadade,LucaDiMare Robust Domain Decomposition Methods for Struc- Imperial College London tural Mechanics Computations [email protected], [email protected]

Classical domain decomposition strategies (FETI, BDD) are known to slowly converge when applied to real engi- CP14 neering problems. We present more robust block-based do- Optimization of the Photon’s Simulation on GPU main decomposition methods which can be viewed as mul- tiprecondioned CG algorithms, tailored for medium-sized Fast simulation of photon’s propagation has applications computer cluster. The talk focuses on numerical imple- in medical imaging and dosimetry. Our aim is to imple- mentation of this solver. Some numerical results demon- ment a CUDA-based GPU method for the simulation of strating its robustness regarding classical hard points (het- photon’s propagation in materials. Using profiling tools, erogeneity, incompressibility) will be shown. Slight mod- we present how the algorithm has been modified in order ifications allowing to deal with large displacement finite to improve thread level parallelism and memory access pat- element problem will be introduced. tern design and so the execution time. A large-scale test will be implemented to evaluate the total execution time. Christophe Bovet LMT, ENS Cachan, CNRS, Universit´e Paris-Saclay, Clement P. Rey, Benjamin Auer, Emmanuel Medernach, 94235 Cachan Ziad El Bitar [email protected] CNRS,UMR7178,Universit´edeStrasbourg [email protected], [email protected], em- [email protected], [email protected] Augustin Parret-Fr´eaud Safran Tech, Rue des Jeunes Bois, Chteaufort CS 80112 78772 Magny-Les-Hameaux MS1 [email protected] Evaluation of Energy Profile for Krylov Iterative Methods on Cluster of Accelerators Pierre Gosselet LMT, ENS Cachan Krylov iterative methods are used to solve large sparse lin- 78 PP16 Abstracts

ear problems, which require optimization work to address Using Scientific Libraries on Emerging Architec- the communication bottleneck. With respect to the energy tures profile, we shall investigate whether optimizations in com- munication are also energy efficient and scalable. In this Abstract not available presentation, we will reveal the relation between time and energy profile of our optimized kernels for GMRES through Michael Heroux an extensive test on GPU clusters, which demands a smart- Sandia National Laboratories tuning scheme for optimal performances. [email protected]

Langshi Chen MS2 CNRS/Maison de la Simulation [email protected] Correlating Facilities, System and Application Data Via Scalable Analytics and Visualization

Serge G. Petiton Understanding application, system and facilities perfor- University Lille 1, Science and Technologies - CNRS mance and throughput over time necessitates monitoring serge.petiton@lifl.fr and recording massive amounts of data. The next step is storing and performing scalable analysis over this data. We will present efforts at LLNL to create an ecosystem for MS1 managing historical performance data and using data ana- Linear Algebra for Data Science Through Exam- lytics and visualization to find root causes and insights into ples performance issues. We will also present some preliminary results on correlating data from various sources. This talk presents some linear algebra problems behind the techniques for analyzing data. Through examples from Abhinav Bhatele, Todd Gamblin disciplines as astronomy and health, which generate data Lawrence Livermore National Laboratory masses that must to be properly analyzed, we show the role [email protected], [email protected] of certain methods of linear algebra. Then we highlight the importance of high performance computing in the analysis Alfredo Gimenez, Yin Yee Ng of these data. University of California, Davis [email protected], [email protected] Nahid Emad PRiSM laboratory and Maison de la Simulation, University of MS2 [email protected] A Monitoring Tool for HPC Operations Support

TACC Stats is a performance and resource usage monitor- MS1 ing tool that has been running on multiple HPC systems Coarse Grid Aggregation for Sa-Amg Method with for the past five years. It collects data at regular intervals Multiple Kernel Vectors for every job from hardware counters, sysfs, and procfs. TACC uses this data in a variety of system management The Smoothed Agregation Algebraic Multigrid (SA-AMG) and user support efforts. Analyses characterizing common method is a multi-level linear solver that can be used in performance issues on TACC systems will be presented us- many kinds of applications. Its coarse grid, however, does ing components of the tool that facilitate data exploration not have enough number of unknowns for parallelism the and visualization. computing environment may offer, so special care has to be taken at that stage. In this contribution we discuss the R. Todd Evans coarse grid aggregation approach. Convergence of the SA- University of Texas AMG solver can be much improved by specifying multiple [email protected] near-kernel vectors, and our study investigates the effec- tiveness of the coarse grid aggregation for SA-AMG solver James C. Browne with multiple kernel vectors. Dept. of Computer Sciences University of Texas-Austin Akihiro Fujii,NaoyaNomura,TeruoTanaka [email protected] Kogakuin University [email protected], [email protected], John West [email protected] University of Texas at Austin Texas Advanced Computing Center Kengo Nakajima tba The University of Tokyo Information Technology Center Bill Barth [email protected] Texas Advanced Computing Center University of Texas at Austin Osni A. Marques [email protected] Lawrence Berkeley National Laboratory Berkeley, CA [email protected] MS2 Developing a Holistic Understanding of I/O Work- loads on Future Architectures MS1 Opportunities and Challenges in Developing and The I/O subsystems of extreme-scale computing systems PP16 Abstracts 79

are becoming more complex as they stratify into tiers op- have beaten Lapack as well as specialized code in scientific timized for different balances of performance and capacity. applications. Here, we review recent RLA results in im- Consequentially, obtaining a coherent picture of the work- plementing in parallel and distributed environments, e.g., load that reflects everything from application-level I/O in high performance computing applications and in dis- to back-end file system performance will be significantly tributed data centers. harder. To this end, we are developing a framework for col- lecting and normalizing I/O workload data from disparate Michael Mahoney, Alex Gittens sources throughout the I/O subsystem to characterize the UC Berkeley holistic utilization of future storage platforms. [email protected], [email protected]

Glenn Lockwood Farbod Roosta-Khorasani NERSC Computer Science Lawrence Berkeley National Laboratory UBC [email protected] [email protected]

Nicholas Wright Lawrence Berkeley National Laboratory MS3 [email protected] Parallel Combinatorial Algorithms in Sparse Ma- trix Computation?

MS2 Combinatorial techniques are used in several phases of MuMMI: A Modeling Infrastructure for Exploring sparse matrix computation. For large-scale problems, while Power and Execution Time Tradeoffs in Parallel numerical phases are often executed in parallel, most of Applications these combinatorial techniques are serial and can become bottlenecks. We are investigating the extent to which some MuMMI (Multiple Metrics Modeling Infrastructure) facil- of the combinatorial techniques can be performed in par- itates the measurement, modeling and prediction of per- allel. formance and power consumptions for parallel applica- tions. The MuMMI framework develops models of run- Mathias Jacquelin time and power based upon performance counters; these Lawrence Berkeley National Lab models are used to explore tradeoffs and identify methods [email protected] for improving energy efficiency. In this talk we will dis- cuss the MuMMI framework and its recent development, Esmond G. Ng and present examples of the use of MuMMI to improve the Lawrence Berkeley National Laboratory energy efficiency of parallel applications. [email protected]

Valerie Taylor,XingfuWu Barry Peyton Texas A&M University Dalton State College [email protected], [email protected] [email protected]

MS3 Yili Zhrng, Kathy Yelick Lawrence Berkeley National Laboratory Parallel Graph Matching Algorithms Using Matrix [email protected], [email protected] Algebra

We present distributed-memory parallel algorithms for MS3 computing matchings in bipartite graphs. We consider both exact and approximate algorithms for cardinality and Parallel Algorithms for Automatic Differentiation weighted matching problems. We substitute the asyn- chronous data access patterns of traditional matching al- Despite three decades of effort, parallel algorithms for Au- gorithms by a small subset of more structured, bulk- tomatic Differentiation have been challenging to design. synchronous functions based on matrix algebra. Rely- We describe a new approach to parallelization for second ing on communication-avoiding algorithms for the underly- and higher order derivatives using the Reverse mode of AD ing matrix-algebric modules, different matching algorithms implemented with the MPI library. By interpreting a send achieve good speedups on tens of thousands of cores on cur- message as creating a dependent variable, and a receive rent supercomputers. message as creating an independent variable, we perform all communications in the Forward mode, thus avoiding the Ariful Azad, Aydin Buluc reversal of communication operations. Lawrence Berkeley National Laboratory [email protected], [email protected] Alex Pothen Purdue University Department of Computer Science MS3 [email protected] Randomized Linear Algebra in Large-Scale Com- putational Environments Mu Wang Purdue University Randomized linear algebra (RLA) is an area that exploits [email protected] randomness for the development of improved algorithms for fundamental matrix problems. It has led to the best worst- cast algorithms for problems such as over-determined least MS4 squares approximation, and high-quality implementations Enlarged GMRES for Reducing Communication 80 PP16 Abstracts

When Solving Reservoir Simulation Problems [email protected], [email protected]

In this work we discuss enlarged Krylov subspaces, [Grigori et al., Enlarged Krylov subspace conjugate gradient meth- MS4 ods for reducing communication], to solve linear systems CA-SVM: Communication-Avoiding Support Vec- with one right hand side, as arising from reservoir simula- tor Machines on Distributed Systems tions. Our work focuses on GMRES since the considered linear systems are neither positive definite nor symmet- We consider the design and implementation of ric. We discuss the usage of deflation, inexact breakdowns, communication-efficient parallel support vector ma- [Robb´e and Sadkane, Exact and inexact breakdowns in the chines, widely used in statistical machine learning, for block GMRES method], in each iteration and deflation of distributed memory clusters and supercomputers. Prior approximated eigenvalues after each cycle when restarting to our study, the parallel isoefficiency of a state-of-the-art 3 GMRES. implementation scaled as W =Ω(P ), where W is the problem size and P the number of processors. Our Hussam Al Daas Communication-Avoiding SVM method improves the Universit´e Paris-Sud 11 isoefficiency to nearly W =Ω(P ). We evaluate these [email protected] methods on 96 to 1536 processors, and show speedups of up to 16× over Dis-SMO, and a 95% weak-scaling Laura Grigori efficiency on six real-world datasets with only modest INRIA losses in overall classification accuracy. France [email protected] Yang You University of California, Berkeley [email protected] Pascal Henon TOTAL E&P [email protected] MS5 Time Parallel and Space Time Solution Methods - Philippe Ricoux From Heat Equation to Fluid Flow TOTAL SA [email protected] In this talk, we discuss the parallel solution in space and time of the Navier Stokes equation using space-time multi- grid methods. The choice of the smoother is discussed MS4 as well as the proper interwtwining of the Picard itera- Communication-Avoiding Krylov Subspace Meth- tion, which is used for resolving the nonlinear terms in the ods in Theory and Practice Navier Stokes equations, with the linear multigrid method used. Solvers for sparse linear algebra problems, ubiquitous throughout scientific codes, often limit application per- Rolf Krause formance due to a low computation/communication ra- Institute of Computational Science tio. In this talk, we discuss the development of parallel University of Lugano communication-avoiding Krylov subspace methods, which [email protected] can achieve performance improvements in a wide variety of scientific applications. We present results of a paral- Pietro Benedusi lel scaling study and discuss complex tradeoffs that arise Institute of Computational Science due to both machine parameters and the numerical prop- USI Lugano erties of the problem. We conclude with recommendations pietro,[email protected] on when we expect communication-avoiding variants to be beneficial in practice. Peter Arbenz ETH Zurich Erin C. Carson Computer Science Department New York University [email protected] [email protected] Daniel Hupp ETH Z¨urich MS4 [email protected] Outline of a New Roadmap to Permissive Commu- nication and Applications That Can Benefit MS5 FFT has been a classic computation engine for numerous A Parallel-in-Time Solver for Time-Periodic applications but is bandwidth-intensive, limiting its perfor- Navier-Stokes Problems mance on off-the-shelf parallel machines. Using the FFT and other applications as examples, we examine the impact In science and engineering, many problems are driven by that adoption of some enabling technologies would have on time-periodic forcing. In fluid dynamics, this occurs for the performance of a manycore architecture. These tech- example in turbines or in human blood flows. In the ab- nologies include 3D VLSI and silicon photonics, with a spe- sence of turbulence, if a fluid is excited long enough, a cial midwifing role for microfluidic cooling, in a roadmap time-periodic steady state establishes. In numerical mod- towards permissive communication. els of such flows, the steady state is computed by simulating this transient phase with a time-stepping method until the James A. Edwards,UziVishkin periodic state is reached. In this work, we present an al- University of Maryland ternative approach by solving directly for the steady-state PP16 Abstracts 81

solution, introducing periodic boundary conditions in time. [email protected] This allows us to use a multi-harmonic ansatz in time for the solution of the Navier-Stokes equations. The solution of the different Fourier modes is distributed to different MS5 groups of processing units, which then have to solve a spa- Micro-Macro Parareal Method for Stochastic Slow- tial problem. Therefore, the time axis is parallelized such Fast Systems with Applications in Molecular Dy- that the number of parallel threads scales with the number namics of resolved Fourier modes. We introduce a micro-macro parareal algorithm for the Daniel Hupp time-parallel integration of multiscale-in-time systems. ETH Z¨urich The algorithm first computes a cheap, but inaccurate, solu- [email protected] tion using a coarse propagator (simulating an approximate slow macroscopic model), which is iteratively corrected us- Peter Arbenz ing a fine-scale propagator (accurately simulating the full ETH Zurich microscopic dynamics). This correction is done in paral- Computer Science Department lel over many subintervals, thereby reducing the wall-clock [email protected] time needed to obtain the solution, compared to the inte- gration of the full microscopic model. We provide a numeri- Dominik Obrist cal analysis of the algorithm for a prototypical example of a University of Bern micro-macro model, namely singularly perturbed ordinary [email protected] differential equations. We show that the computed solu- tion converges to the full microscopic solution (when the parareal iterations proceed) only if special care is taken during the coupling of the microscopic and macroscopic MS5 levels of description. The convergence rate depends on the Time Parallelizing Edge Plasma Simulations Using modeling error of the approximate macroscopic model. We the Parareal Algorithm illustrate these results with numerical experiments, includ- ing a non-trivial model inspired by molecular dynamics. This work explores the exploitation of the parareal algo- rithm to achieve temporal parallelization of simulations of Giovanni Samaey magnetically confined, fusion plasma at the edge of the de- Department of Computer Science, K. U. Leuven vice. The physics is extremely complex due to the presence [email protected] of neutrals as well as the interaction of the plasma with the wall. The SOLPS code package, widely adapted by the fu- sion community, is used here to scale the performance of Keith Myerscough the parareal algorithm in various ITER relevant plasmas. Department of Computer Science KU Leuven [email protected] Debasmita Samaddar CCFE Frederic Legoll UK Atomic Energy Authority LAMI, ENPC [email protected] [email protected] David Coster Tony Lelievre Max-Planck-Institut f¨ur Plasmaphysik, Germany Ecole des Ponts ParisTech [email protected] [email protected] Xavier Bonnin ITER Organization, France MS6 [email protected] Parallel GMRES Using A Variable S-Step Krylov Basis Eva Havlickova UK Atomic Energy Authority [email protected] Sparse linear systems arise in computational science and engineering. The goal is to reduce the memory require- ments and the computational cost, by means of high per- Wael R. Elwasif formance computing algorithms. We introduce a new vari- ORNL, USA ation on s-step GMRES in order to improve its stability, [email protected] reduce the number of iterations necessary to ensure conver- gence, and thereby improve parallel performance. In doing Donald B. Batchelor so, we develop a new block variant that allows us to express Oak Ridge National Laboratory the stability difficulties in s-step GMRES more fully. [email protected] David Imberti Lee A. Berry Inria-Rennes ORNL, USA [email protected] [email protected] Jocelyne Erhel Christoph Bergmeister INRIA-Rennes, France University of Innsbruck, Austria Campus de Beaulieu 82 PP16 Abstracts

[email protected] [email protected]

MS6 MS7 A Tiny Step Toward a Unified Task Scheduler for Optimizing Numerical Simulations of Elastody- Large Scale Heterogeneous Architectures namic Wave Propagation Thanks to Task-Based Parallel Programming Abstract not available Elastodynamics equations can be discretized into an intrin- sic parallelizable algorithm. The programming paradigm Clement Fontenaille choice is critical as its capabilities to take advantage of UVSQ the exposed degree of parallelism will vary. MPI may [email protected] probe to be too static, especially concerning load balanc- ing issues and ease of granularity adaptation. Instead of MPI, we used a task-based programming over runtime sys- MS6 tems, PaRSEC. We analyze this novel implementation and Extraction the Compute and Communications Pat- compare the performances on both shared and distributed terns from An Industrial Application memory machines.

Abstract not available Lionel Boillot Team-Project Magique-3D Thomas Guillet INRIA Bordeaux Sud-Ouest Intel [email protected] [email protected] Corentin Rossignon TOTAL MS6 [email protected] Toward Large-Eddy Simulation of Complex Burn- ers with Exascale Super-Computers: A Few Chal- George Bosilca lenges and Solutions University of Tennessee - Knoxville [email protected] The modeling of turbulent combustion in realistic devices is a multi-scale problem, which can be tackled with future Emmanuel Agullo exascale super-computers. However, the multi-scale nature INRIA of the flow often requires h- or p-adaptivity and the use of [email protected] splitting approaches to decouple the different time-scales. These methods introduce a large load unbalance that needs Henri Calandra to be addressed when thousands of cores are used. Total CSTJF, Geophysical Operations & Technology, R&D Vincent R. Moureau Team CORIA Laboratory [email protected] Universit´eetINSAdeRouen [email protected] MS7 Pierre B´enard Parsec: A Distributed Runtime System for Task- CORIA - CNRS UMR6614, Universit´eetINSAdeRouen Based Heterogeneous Computing [email protected] Abstract not available

Ghislain Lartigue Aurelien Bouteiller CORIA - CNRS UMR6614 University of Tennessee [email protected] [email protected]

George Bosilca MS7 University of Tennessee - Knoxville Programming Linear Algebra with Python: A [email protected] Task-Based Approach

Task-based programming has been extensively used to pro- MS7 gram linear algebra applications. While most of times the Controlling the Memory Subscription of Applica- chosen programming language falls in the category of tra- tions with a Task-Based Runtime System ditional ones, there is a trend in the HPC community in developing applications in scripting languages. COMPSs This talk will discuss the cooperation between a task-based programming model has been recently extended to support compressed linear algebra code and a runtime to control the a syntax in Python (PyCOMPSs). The talk will present memory subscription levels throughout the execution. The the features of PyCOMPSs and how linear algebra Python task paradigm enables controlling the memory footprint codes can be run in parallel in clusters. of the application by throttling the task submission rate, striking a compromise between the performance benefits Rosa M. Badia of anticipative task submission and the resulting memory Barcelona Supercomputing Center / CSIC consumption. We illustrate our contribution on an ACA- PP16 Abstracts 83

based compressed dense linear algebra solver. Institute of Computational Science University of Lugano Marc Sergent [email protected] INRIA, France [email protected] Gabriel Wittum G-CSC David Goudin University of Frankfurt, Germany CEA/CESTA [email protected] [email protected]

Samuel Thibault MS8 University of Bordeaux 1, Inria Scalable Shape Optimization Methods for Struc- [email protected] tured Inverse Modeling Using Large Scale Hpc

Olivier Aumage In this talk, we present a method combining HPC tech- INRIA, France niques and large scale applications with an optimized in- [email protected] verse shape identification procedure. It is demonstrated, how existing multigrid infrastructure can be incorporated into a shape optimization framework with a focus on scal- MS8 ability of the algorithm. These techniques are utilized in order to fit a model of the human skin to data measure- Uncertainty Quantification Using Tensor Methods ments and not only estimate the permeability coefficients and Large Scale Hpc but also the shape of the cells.

We consider the problem of uncertainty quantification for Volker H. Schulz extreme scale parameter dependent problems where an un- University of Trier derlying low rank property of the parameter dependency Department of Mathematics is assumed. For this type of dependency the hierarchical [email protected] Tucker format offers a suitable framework to approximate a given output function of the solutions of the parameter dependent problem from a number of samples that is lin- Martin Siebenborn ear in the number of parameters. In particular we can a University of Trier posteriori compute the mean, variance or other interest- [email protected] ing statistical quantities of interest. In the extreme scale setting it is already assumed that the underlying fixed- MS8 parameter problem is distributed and solved for in parallel. We provide in addition a parallel evaluation scheme for the Parallel Adaptive and Robust Multigrid sampling phase that allows us on the one hand to combine Abstract not available several solves and on the other hand parallelise the sam- pling. Finally, we address the problem of estimating the Gabriel Wittum accuracy of the computations in low rank tensor format. G-CSC University of Frankfurt, Germany Lars Grasedyck,ChristianL¨obbert [email protected] RWTH Aachen [email protected], [email protected] MS9 MS8 Iterative ILU Preconditioning Parallel Methods in Space and Time and Their Ap- Iterative and asynchronous algorithms, with their ability to plication in Computational Medicine tolerate memory latency, form an important class of algo- rithms for modern computer architectures. In this talk we In this talk, we consider parallel discretization methods present computational and numerical aspects of replacing in space and time for the efficient solution of problems traditional ILU preconditioners with iterative and poten- arising from medicine. We start from permeation through tially asynchronous variants. the human skin, were we combine parallelization in time through the parareal method with multigrid methods in Hartwig Anzt space. Then, we discuss the full space time discretization Innovate Computing Lab of reaction diffusion problems as occuring in electrophysi- University of Tennessee ological simulations of cardiac activity. [email protected] Rolf Krause Edmond Chow Institute of Computational Science School of Computational Science and Engineering University of Lugano Georgia Institute of Technology [email protected] [email protected] Andreas Kreienb¨uhl Institute of Computational Science MS9 USI Lugano Kokkos: Manycore Programmability and Perfor- [email protected] mance Portability

Dorian Krause Kokkos enables development of HPC applications that are 84 PP16 Abstracts

performance portable across disparate manycore devices [email protected] such as NVIDIA Kepler, Intel Xeon Phi, and AMD Fusion. Kokkos abstractions leverage C++11 features to enable de- veloper productivity in implementing thread parallel al- MS10 gorithms; especially hierarchical parallel algorithms and Data Management and Analysis for An Energy Ef- architecture-polymorphic data structures required for high ficient HPC Center degree of parallelism and performant memory access pat- terns. In contrast, language extensions such as OpenMP, The HPC center at LLNL is concerned about the qual- OpenACC, or OpenCL fail to address memory access pat- ity, cost, and environmental impact; in addition, we share terns and are unwieldy to implement non-trivial parallel with our providers the interest to reduce energy costs and algorithms. Kokkos abstractions, highlights of its standard improve electrical grid reliability. We implemented an ex- C++11 library interface, and sampling of performance re- tensive monitoring and data collection system to collect sults will be presented. and analyze data from power meters, PMU’s, and com- puter environmental and power sensors. We will describe H. Carter Edwards, Christian Trott our work on managing (storage and integration), analysis Sandia National Laboratories and visualization of the collected data. [email protected], [email protected] Ghaleb Abdulla, Anna Maria Bailey, John Weaver Lawrence Livermore National Laboratory MS9 [email protected], [email protected], [email protected] ICB: IC Preconditioning with a Fill-in Strategy Considering Simd Instructions MS10 Smart HPC Centers: Data, Analysis, Feedback, Most of current processors are equipped with SIMD in- and Response structions that are used to increase the performance of ap- plication programs. We analyze the effective use of SIMD More efficient HPC operation can be enabled by analysis instructions in the Incomplete Cholesky Conjugate Gradi- of full system data, feedback of analysis results to appli- ent (ICCG) solver. A new fill-in strategy in the IC fac- cations and subsystems, and dynamic adaptive response. torization is proposed for the SIMD vectorization of the However, significant gaps for support of such operations preconditioning step and to increase the convergence rate. still exist. We present progress on an integrated full sys- Our numerical results confirm that the proposed method tem approach to HPC operations addressing our work on has better performance than the conventional IC(0)-CG a) application and global platform analysis b) architecture method. for data integration, analysis, and feedback and c) system (e.g., resource manager, facilities, partitioners) responses. Takeshi Iwashita Hokkaido University James Brandt, Ann Gentile [email protected] Sandia National Laboratories [email protected], [email protected] Naokazu Takemura Kyoto University Cindy Martin [email protected] Los Alamos National Laboratory c [email protected] Akihiro Ida Academic Center for Computing and Media Studies Benjamin Allan Kyoto University Sandia National Laboratories [email protected] Livermore, CA 94550 [email protected] Hiroshi Nakashima ACCMS Karen D. Devine Kyoto University Sandia National Laboratories [email protected] [email protected]

MS9 MS10 Low-Overhead Monitoring of Parallel Applications Thread-Scalable FEM Next-generation parallel systems will have hundreds of Finite Element Method (FEM) is widely used for solv- hardware threads per node and in some cases attached ing Partial Differential Equations (PDE) in various types accelerators. Growth in node-level parallelism is a chal- of applications of computational science and engineering. lenge for monitoring. This talk will lay out a vision for Matrix assembly and sparse matrix solver are the most integrated, low-overhead, sampling based measurement of expensive processes in FEM procedures. In the present computation, communication, and I/O of hybrid programs work, these processes are optimized on Intel Xeon Phi and on next-generation supercomputers. Using sampling, one NVIDIA Tesla K40 based on features of each architecture. can pinpoint inefficiencies and scaling bottlenecks at all This talk describes details of optimization and results of levels within and across nodes of parallel systems. performance evaluation. John Mellor-Crummey Kengo Nakajima Computer Science Dept. The University of Tokyo Rice University. Information Technology Center [email protected] PP16 Abstracts 85

Mark Krentel [email protected] Rice University [email protected] MS11 Community Detection on GPU MS10 There has been considerable interest in community detec- Discovering Opportunities for Data Analytics tion for finding the modularity structure in real world data. Through Visualization of Large System Monitor- Such data sets can arise from social networks as well as vari- ing Datasets ous scientific domains. The Louvain method is one popular method for this problem as it is simple and fast. It can also Blue Waters is a Cray supercomputer comprised of over be used to detect hierarchical structures in the data. How- 27,000 nodes and has been instrumented with detailed ever, its inherently sequential nature and cache unfriendly monitoring across all of its subsystems. NCSA has been fo- workloads makes it difficult to parallelize. This is partic- cused on integrating the data and producing insights into ularly true for co-processor architectures. In this work we application and systems behavior. While realtime event show how these obstacles can be overcome and present re- processing and visualization of the data has proved suc- sults from implementing the algorithm on a GPU. cessful in demonstrating some clear data patterns, we look forward to the opportunities in numerical analysis and data Md Naim analytics on this long term dataset. University of Bergen Norway Michael Showerman [email protected] NCSA University of Illinois at Urbana-Champaign Fredrik Manne [email protected] University of Bergen, Norway [email protected] MS11 A Partitioning Problem for Load Balancing and MS12 Reducing Communication from the Field of Quan- Lower Bound Techniques for Communication in the tum Chemistry Memory Hierarchy

We present a combinatorial problem and potential solu- Communication requirements between different levels of tions arising in parallel computational chemistry. The the memory hierarchy of a DAG computation are strictly Hartree-Fock (HF) method has a very complex data access connected to the space requirements of the subset in suit- pattern. Much research has been devoted over 20 years able partitions of the computation DAG. We discuss a ver- for parallelizing this important method, based primarily satile technique based on a generalized notion of visit of a on intuition and experience. A formal approach for paral- DAG to obtain space lower bounds. We illustrate the tech- lelizing HF while reducing communication may come from nique by deriving space lower bounds for specific DAGs. graph and hypergraph partitioning. Besides providing a We also establish some limitations to the power of the pro- potential solution, this approach may also shed light on posed technique. the optimality of existing approaches. Gianfranco Bilardi Edmond Chow University of Padova School of Computational Science and Engineering [email protected] Georgia Institute of Technology [email protected] MS12 Communication-Optimal Loop Nests MS11 Communication (data movement) often dominates a com- Scalable Parallel Algorithms for De Novo Assembly putation’s runtime and energy costs, motivating organiz- of Complex Genomes ing an algorithm’s operations to minimize communication. We study communication costs of a class of algorithms in- A critical problem for computational genomics is the prob- cluding many-body and matrix/tensor computations and, lem of de novo genome assembly: the development of ro- more generally, loop nests operating on array variables sub- bust scalable methods for transforming short randomly scripted by linear functions of the loop iteration vector. We sampled sequences into the contiguous and accurate re- use this algebraic relationship between variables and op- construction of complex genomes. While advanced meth- erations to derive communication lower bounds for these ods exist for assembling the small and haploid genomes algorithms. We also discuss communication-optimal im- of prokaryotes, the genomes of eukaryotes are more com- plementations that attain these bounds. plex. We address this challenge head on by developing HipMer, an end-to-end high performance de novo assem- Nicholas Knight bler designed to scale to massive concurrencies. HipMer New York University employs an efficient Unified Parallel C (UPC) implementa- [email protected] tion and computes the assembly of the human genome in only 8.4 minutes using 15,360 cores of a Cray XC30 system. MS12 Write-Avoiding Algorithms Evangelos Georganas EECS Communication complexity of algorithms on parallel ma- UC Berkeley chines and memory hierarchies is well understand. Most of 86 PP16 Abstracts

this work analyzes the amount of communication without Method PFASST separately counting induced reads and writes. However, in upcoming memory technologies, such as non-volatile mem- For time-dependent PDEs, parallel-in-time integration ories, there is a significant asymmetry in the cost of reads with e.g. the ”parallel full approximation scheme in space and writes. This motivates us to study lower bounds on the and time” (PFASST) is a promising way to accelerate exist- number of writes, which are relatively expensive. While we ing space-parallel approaches beyond their scaling limits. prove that for many algorithms there can not be asymp- In addition, the iterative, multi-level nature of PFASST totically fewer writes than reads, in the case of direct lin- can be exploited to recover from hardware failures, pre- ear algebra and N-body methods the opposite is true. We venting a breakdown of the time integration process. In present algorithms that match these lower bounds. this talk, we present various recovery strategies, show their impact and discuss challenges for the implementation on Harsha Vardhan Simhadri HPC architectures. Lawrence Berkeley National Laboratory [email protected] Robert Speck, Torbj¨orn Klatt Juelich Supercomputing Centre Forschungszentrum Juelich GmbH MS12 [email protected], [email protected] Communication-Efficient Evaluation of Matrix Daniel Ruprecht Polynomials School of Mechanical Engineering University of Leeds Abstract not available [email protected] Sivan A. Toledo Tel Aviv University MS13 [email protected] Parareal in Time and Fixed-Point Schemes

In this talk, we will discuss how the parallel efficiency of the MS13 paralleal in time algorithm can be improved by coupling it Parallel-in-Time Coupling of Models and Numeri- with fixed-point schemes. cal Methods Olga Mula We present a new numerical procedure for the simula- CEREMADE (Paris Dauphine University) tion of time-dependent problems based on the coupling be- [email protected] tween the finite element method and the lattice Boltzmann method. The procedure is based on the Parareal paradigm and allows to couple two numerical methods having opti- MS13 mal efficiency at different space and time scales. The main Implementating Parareal - OpenMP Or MPI? motivation, among other, is that one technique may be more efficient, or physically more appropriate or less mem- I will introduce an OpenMP implementation of Parareal ory consuming than the other depending on the target of with pipelining and compare it to a standard MPI imple- the simulation and/or on the sub-region of the computa- mentation with respect to runtime, memory and energy. tional domain. We detail the theoretical and numerical The used benchmark code is PararealF90, which provides framework for linear parabolic equations even though its three different implementations of Parareal with and with- potential applicability should be wider. We show various out pipelining using OpenMP or MPI to parallelise in time. numerical examples on the heat equation that will validate Benchmarks from a Linux cluster and a Cray XC40 show the proposed procedure and illustrate its advantages. Fi- that while both versions differ only very little in runtime, nally some recent extensions and/or variants of our original using OpenMP can save memory. algorithm will be discussed. Daniel Ruprecht Franz Chouly School of Mechanical Engineering Besancon University of Leeds [email protected] [email protected]

Matteo Astorino Ecole Pol. Fed. de Lausanne MS14 [email protected] Combining Software Pipelining with Numerical Pipelining in the Conjugate Gradient Algorithm Alexei Lozinski Laboratoire de Math´ematiques de Besan¸con Modern hardware architectures provide an extremely high CNRS et Universit´e de Franche-Comt´e level of concurrency. One the one hand, new program- [email protected] ming models such as task-based methods are being in- vestigated by the HPC community in order to pipeline tasks at a fine-grain level. On the other hand, new nu- Alfio Quarteroni merical methods are being studied in order to relieve nu- Ecole Pol. Fed. de Lausanne merical synchronization points of most inner-most kernels Alfio.Quarteroni@epfl.ch used in simulations. In this talk, we propose a formula- tion of the Conjugate Gradient algorithm that combines software pipelining through task-based programming with MS13 numerical pipelining through the use of a recently intro- Fault-Tolerance of the Parallel-in-Time Integration duce latency-hiding algorithm. We study the potential of PP16 Abstracts 87

themethodonmodernmulti-GPU,multicoreandhetero- we tackle the task granularity problem and we propose ag- geneous arcthitectures. gregating several CPUs in order to execute larger parallel tasks and thus find a better equilibrium between the work- Emmanuel Agullo load assigned to the CPUs and the one assigned to the INRIA GPUs. To this end, we rely on the notion of scheduling [email protected] contexts in order to isolate the parallel tasks and thus del- egate the management of the task parallelism to the inner Luc Giraud scheduling strategy. We demonstrate the relevance of our Inria approach through the dense Cholesky factorization kernel Joint Inria-CERFACS lab on HPC implemented on top of the StarPU task-based runtime sys- [email protected] tem. We allow having parallel elementary tasks and using Intel MKL parallel implementation optimized through the Stojce Nakov use of the OpenMP runtime system. We show how our INRIA approach handles the interaction between the StarPU and [email protected] the OpenMP runtime systems and how it exploits the par- allelism of modern accelerator-based machines. We present experimental results showing that our solution outperforms MS14 state of the art implementations to reach a peak perfor- Rounding Errors in Pipelined Krylov Solvers mance of 4.5 TFlop/s on a platform equipped with 20 CPU cores and 4 GPU devices. Abstract not available Terry Cojean Siegfried Cools INRIA University of Antwerpen [email protected] Belgium [email protected] MS15 Task-Based Multifrontal QR Solver for GPU- MS14 Accelerated Multicore Architectures Espreso: ExaScale PaRallel Feti Solver In this talk we present the design of task-based sparse di- This talk will present a new approach for efficient imple- rect solvers on top of runtime systems. In the context of the mentation of the FETI and Hybrid FETI solver for many- qr mumps solver, we prove the usability and effectiveness core accelerators using Schur complement (SC). By using of our approach with the implementation of a sparse ma- the SC solver can avoid working with sparse Cholesky fac- trix multifrontal factorization based on a Sequential Task tors of the stiffness matrices. Instead a dense structure flow (STF) parallel programming model. Using this pro- is computed and used by the iterative solver. The main gramming model, we developed features such as the inte- bottleneck of this method is the computation of the SC. gration of dense 2D Communication Avoiding algorithms This bottleneck has been significantly reduced by using in the multifrontal method allowing for better scalability new PARDISO-SC sparse direct solver. compared to the original approach used in qr mumps. Fur- thermore, following this approach, we show that we are Lubomir Riha capable of exploiting heterogeneous architectures with the VSB Technical University of Ostrava design of data partitioning and a scheduling algorithms ca- [email protected] pable of handling the heterogeneity of resources.

Tomas Brzobohaty, Alexandros Markopoulos Florent Lopez IT4Innovations National Supercomputing Centre Universit´e Paul Sabatier - IRIT (ENSEEIHT) VSB Technical University of Ostrava fl[email protected] [email protected], alexan- [email protected] MS15 An Effective Methodology for Reproducible Re- MS14 search on Dynamic Task-Based Runtime Systems Notification Driven Collectives in Gaspi Reproducibility of experiments and analysis by others is Abstract not available one of the pillars of science. Yet, the description of experi- mental protocols (particularly in computer science) is often Christian Simmendinger incomplete and rarely allows for reproducing a study. In T-Systems this talk, we introduce the audience to the notions of repli- [email protected] cability, repeatability and reproducibility and we discuss their relevance in the context of HPC research. Then, we present an effective approach tailored to the specificities of MS15 studying HPC systems. Exploiting Two-Level Parallelism by Aggregating Computing Resources in Task-Based Applications Arnaud Legrand, Luka Stanisic Over Accelerator-Based Machines INRIA Grenoble Rhˆone-Alpes [email protected], [email protected] Computing platforms are now extremely complex provid- ing an increasing number of CPUs and accelerators. This trend makes balancing computations between these het- MS15 erogeneous resources performance critical. In this paper High Performance Task-Based QDWH-eig on 88 PP16 Abstracts

Manycore Systems [email protected]

Current dense symmetric eigensolver implementations suf- Torsten Hoefler fer from the lack of concurrency during the reduction ETH Z¨urich step toward a condensed tridiagonal matrix and typically [email protected] achieve a small fraction of theoretical peak performance. Indeed, whether the reduction to tridiagonal form em- ploys a single or multi-stage approach as in the state-of- MS16 the-art numerical libraries, the memory-bound nature of some of the computational kernels is still an inherent prob- Reproducible and Accurate Algorithms for Numer- lem. A novel dense symmetric eigensolver, QDWH-eig, ical Linear Algebra has been presented by Y. Nakatsukasa and N. J. Higham (SIAM SISC, 2012), which addresses these drawbacks with On modern multi-core, many-core, and heterogeneous a highly parallel and compute-bound numerical algorithm. architectures, floating-point computations, especially re- However, the arithmetic complexity is an order of magni- ductions, may become non-deterministic and, therefore, tude higher than the standard algorithm, which makes it non-reproducible mainly due to the non-associativity of unattractive, at first glance, even with todays relatively low floating-point operations and the dynamic scheduling. We cost of flops. By (1) implementing QDWH-eig framework address the problem of reproducibility in the context of by the DAG-based tile algorithms (fine-grained) to allow fundamental linear algebra operations – like the ones in- pipelining the various computational stages and to increase cluded in the Basic Linear Algebra Subprograms (BLAS) computational usage of all processing cores, (2) leveraging library – and propose algorithms that yields both repro- the execution of based tile implementation of QDWH-eig ducible and accurate results. We extend this approach to with GPUs, the QDWH-eig eigensolver becomes a compet- the higher level linear algebra algorithms, e.g. the LU fac- itive alternative. torization, that are built on top of these BLAS kernels. We present these reproducible and accurate algorithms for the Dalal Sukkari BLAS routines and the LU factorization as well as their im- KAUST plementations in parallel environments such as Intel server [email protected] CPUs, Intel Xeon Phi, and both NVIDIA and AMD GPUs. We show that the performance of the proposed implemen- Hatem Ltaief tations is comparable to the standard ones. King Abdullah University of Science & Technology (KAUST) Roman Iakymchuk [email protected] UPMC - LIP6 [email protected] David E. Keyes KAUST David Defour [email protected] DALI/LIRMM University of Perpignan [email protected] MS16 Bit-Reproducible Climate and Weather Simula- Sylvain Collange tions: From Theory to Practice INRIA – Rennes [email protected] The problem of reproducibility in floating-point numerical simulations is an active research topic due to the many ad- Stef Graillat vantages of reproducible computations, ranging from the University Pierre and Marie Curie (UPMC) simplification of testing and debugging to the deployment LIP6 of applications on heterogeneous systems maintaining nu- [email protected] merical consistency. A reproducible simulation always pro- duces the same result for the same settings regardless of the system it is run on and the configuration of the soft- MS16 ware. We show principles for achieving bit reproducibil- ity with a case study of the COSMO climate and weather Parallel Interval Algorithms, Numerical Guaran- code in the context of the recently started crClim project. tees and Numerical Reproducibility We use bit reproducibility to develop a fine-resolution cli- mate model for next-generation supercomputers which ne- This talk focuses on the parallel implementation of inter- cessitates high-performance bit-reproducibility. We discuss val algorithms. A first issue is to preserve the inclusion sources of non-reproducibility and our strategies for solving property, that is, the guarantee that a result computed them. We will also show how these techniques affect the using interval arithmetic contains the exact result. We performance of the application. Overall, we aim at demys- will illustrate the fact that the inclusion property is at risk tifying the belief that determinism is too hard to obtain in with parallel implementations and give some hints to pre- high-performance codes and show how reproducibility can serve the inclusion property even for parallel interval al- be a desirable feature for a numerical model. gorithms. A second issue concerns the ir-reproducibility of numerical computations, i.e. the fact that the same Andrea Arteaga numerical computations deliver different results depending ETH Zurich on the environment. We will detail the impact of numer- [email protected] ical ir-reproducibility on interval computations and point to several developments that remedy this problem. Christophe Charpilloz, Olive Fuhrer MeteoSwiss Nathalie Revol [email protected], INRIA - LIP, Ecole Normale Superieure de Lyon PP16 Abstracts 89

[email protected] [email protected]

MS16 MS17 Summation and Dot Products with Reproducible Hierarchical Distributed and Parallel Program- Results ming for Auto-Tunned Extreme Computing

It can be shown that under very general conditions floating- Exascale hypercomputers will have highly hierarchical ar- point summation cannot be associative. Basically, a fixed chitectures with nodes composed by lot-of-core processors point system is the only exception. Nevertheless summa- and, often, accelerators ; mixing both distributed and par- tion and dot product algorithms with reproducible results allel computing paradigms. Methods have to be redesigned can be derived, for example using so-called error-free trans- and some will be rehabilitated or hybridized. The different formations (EFT). We discuss several possibilities of al- programming levels will generate new difficult algorithm is- gorithms with identical results regardless of the order of sues and numerical methods would have to be auto-tuned summation, blocking or other techniques to speed up sum- at runtime to accelerate convergence, minimize energy con- mation. sumption and scalabilities. In this talk, we present a short survey of auto-tuning of such methods and show that we Siegfried M. Rump may use high level graph of component language, associ- TU Hamburg-Harburg ated with multilevel programming paradigm, to help sci- [email protected] entists to develop efficient software.

Serge G. Petiton MS17 University Lille 1, Science and Technologies - CNRS Parallel Generator of Non-Hermitian Matrices serge.petiton@lifl.fr with Prescribed Spectrum and Shape Control for Extreme Computing MS17 When designing eigensolvers it is necessary to provide Scalable Sparse Solvers on GPUS backtesting sample matrices whose spectrum is known be- forehand even with constraints on the matrices. We derive The sparse direct method we present is a multifrontal QR a method that produces rigorously non-hermitian matrixes factorization intended specifically for GPU accelerators. with an exact control of its eigenvalues, shape, sparcity. Our approach relies on the use of a bucket scheduler that The method is parallel and scalable in very high dimen- exploits an irregular parallelism on both a coarse grain, sion. Moreover we show that it is possible to reach absolute among a set of fronts with different characteristics, and arithmetic precision by avoiding all potential troncatures on a fine grain, through the exploitation of the staircase coming from number representations in the computer. shape of these fronts. The scheduler then relies on dense GPU kernels which design and implementation target re- Herve Galicher cent GPU architectures. KAUST [email protected] Timothy A. Davis, Wissam M. Sid-Lakhdar Texas A&M France Boillod Computer Science and Engineering CNRS/LIFL [email protected], [email protected] [email protected] MS18 Serge G. Petiton Space-Filling Curves and Adaptive Meshes for University Lille 1, Science and Technologies - CNRS Oceanic and Other Applications serge.petiton@lifl.fr sam(oa)2 (space-filling curves and adaptive meshes for Christophe Calvin oceanic and other applications) is a software package for CEA-Saclay/DEN/DANS/DM2S parallel adaptive mesh refinement on structured triangu- [email protected] lar grids created via newest vertex bisection. Exploiting Sierpinski order of the grid cells, it features high memory efficiency and throughput as well as efficient dynamic load MS17 balancing on shared and distributed memory. We will re- Monte Carlo Algorithms and Stochastic Process on port on recent extensions of sam(oa)2, such as implementa- Manycore Architectures tion of 3D problems (with 2D adaptivity) and optimization for latest hardware arcitectures, and show application sce- Monte Carlo algorithms are well-known candidates for narios from tsunami simulation and porous media flow. massively parallel architectures. Nevertheless, from the- ory to implementation on many core architectures, there Michael Bader always some gaps. We will illustrate in this talk two ex- Technische Universit¨at M¨unchen amples of stochastic algorithms and their practical imple- [email protected] mentation onto many core architecture. The 1st one comes from Monte Carlo neutron transport and the 2nd is the in- vestagation on potential benefits of stochastic algorithms MS18 forsolvinglinearsystems. ADER-DG on Spacetrees in the ExaHyPE Project

Christophe Calvin We introduce a first version of ExaHyPE, a spacetree-based CEA-Saclay/DEN/DANS/DM2S AMR engine for hyperbolic equation system solvers. It 90 PP16 Abstracts

combines multilevel, higher-order ADER-DG with patch- ing of some of the problems necessitates to take the di- based finite volumes. The spacetree either acts as compute rectionally into account, which adds additional constraints or as organisational data structure - either to host the that cannot be easily addressed in the current state-of-the- higher order shape functions or to organise the patches. art partitioning methods and tools. In this talk, we will Our presentation introduces a reordering of the classic discuss some example problems, models and potential so- ADER-DG solver phases that allows us to realise the algo- lution approaches for them. rithm with a pipelined single-touch policy, to process cells concurrently and to overlap MPI data exchange with the Umit V. Catalyurek actual computation. The Ohio State University Department of Biomedical Informatics Dominic E. Charrier [email protected] Durham University [email protected] Kamer Kaya Sabanci University, Angelika Schwarz Faculty of Engineering and Natural Sciences Technische Universit¨at M¨unchen [email protected] [email protected] Bora Ucar Tobias Weinzierl LIP, ENS-Lyon, France. Durham University [email protected] [email protected]

MS19 MS18 Parallel Graph Coloring on Manycore Architec- A Tetrahedral Space-Filling Curve Via Bitwise In- tures terleaving In scientific computing, the problem of finding sets of in- We introduce a space-filling curve for tetrahedral red- dependent tasks is usually addressed with graph color- refinement that can be computed using bitwise interleav- ing. We study performance portable graph coloring al- ing operations similar to the well-known Morton curve for gorithms for many-core architectures. We propose a novel cubical meshes. We present algorithms that compute the edge-based algorithm and enhancements of the speculative parent, children and face-neighbors of a given mesh element Gebremedhin-Manne algorithm that exploit architectures. in constant time, as well as the next and previous element We show superior quality and execution time of the pro- in the space-filling curve and whether a given element is posed algorithms on GPUs and Xeon Phi compared to pre- on the domain boundary. We close with numerical results vious work. We present effects of coloring on applications obtained with our implementation. such as Gauss-Seidel preconditioned solvers.

Johannes Holke, Carsten Burstedde Mehmet Deveci Universit¨at Bonn The Ohio State University [email protected], [email protected] [email protected]

MS18 Erik G. Boman Center for Computing Research Nonconforming Meshes and Mesh Adaptivity with Sandia National Labs Petsc [email protected] Efficient mesh adaptivity requires a sensible combination of data structures and numerical algorithms. Unlike AMR Siva Rajamanickam frameworks that try to control both components (structure Sandia National Laboratories on the outside, numerics on the inside), this talk will dis- [email protected] cuss work on using PETSc to drive AMR libraries that pro- vide only data structures (numerics on the outside, struc- ture on the inside). An implementation of this interface MS19 using the p4est library will be presented. Parallel Approximation Algorithms for b-Edge Covers and Data Privacy Toby Isaac ICES, University of Texas at Austin We propose a new 3/2-approximation algorithm, called [email protected] LSE for computing b-Edge Cover and it’s application to a data privacy problem called adaptive k-Anonymity. b-Edge Matthew G. Knepley Cover is a special case of the well-known Set Multicover Rice University problem and also a generalization of Edge Cover problem [email protected] *Preferred in graphs. The objective is to choose a subset of C edges in the graph with weights on the edges, such that at least a specified number b(v)ofedgesinC are incident on each MS19 vertex v and the sum of edge weights is minimized. We im- Directed Graph Partitioning plement the algorithm on serial and shared-memory paral- lel processors and compare it’s performance against a col- In scientific computing directed graphs are commonly used lection of inherently sequential approximation algorithms for modeling dependencies among entities. However, while that have been proposed for the Set Multicover problem. modeling some of the problems as graph partitioning prob- With LSE, i) we propose the first shared-memory paral- lems, directionality is generally ignored. Accurate model- lel algorithm for the adaptive k-Anonymity problem and PP16 Abstracts 91

ii) give new theoretical results regarding privacy guaran- University of Nebraska, Omaha tees which are significantly stronger than the best known [email protected] previous results.

Arif Khan MS20 Purdue University The Structure of Communities in Social Networks [email protected] Within the area of social network analysis, community de- Alex Pothen tection and understanding community structure are criti- Purdue University cal to a variety of fields. However, there is little consen- Department of Computer Science sus as to what a community actually is, and characteriz- [email protected] ing the structure of real communities is an important re- search challenge. I present a machine-learning framework to better understand the structure of real-world communi- MS19 ties. Application of this framework to real networks gives Clustering Sparse Matrices with Information from us surprising insight into the structure of communities. Both Numerical Values and Pattern Sucheta Soundarajan Considering any square fully indecomposable matrix A, we Department of Computer Science can apply a two-sided diagonal scaling to |A| to render it Rutgers University into doubly stochastic form. The Perron-Frobenius theo- [email protected] rem is a key tool to exploit and we aim to use spectral properties of doubly stochastic matrices to reveal hidden block structure in matrices. We also combine this with MS20 classical graph analysis techniques to design partitioning Parallel Algorithms for Analyzing Dynamic Net- algorithms for large sparse matrices based on both numer- works ical values and pattern information. We present, a parallel graph sparsification-based frame- Iain Duff work for updating the topologies of dynamic large-scaled Rutherford Appleton Laboratory, Oxfordshire, UK OX11 networks. The challenge in computing dynamic network 0QX properties is to update them without recomputing the en- and CERFACS, Toulouse, France tire network from scratch. This process becomes more chal- iain.duff@stfc.ac.uk lenging in the parallel domain, because the concurrency in the updates is difficult to determine. To address this issue, Philip Knight we use a technique known as graph sparsification, that di- University of Strathclyde, Glasgow, Scotland vides the graph into subgraphs. The updates can be per- [email protected] formed concurrently on each graph and then aggregated over in a reduction-like format over the entire path. We show that this process can provide a general framework Sandrine Mouysset for developing scalable algorithms for updating different Universit´e deToulouse, UPS-IRIT, France topologies, including MST and graph connectivity. [email protected] Sanjukta Bhowmick Daniel Ruiz Department of Computer Science INPT-IRIT-Universit´e deToulouse, France University of Nebraska, Omaha [email protected] [email protected]

Bora Ucar Sriram Srinivasan LIP, ENS-Lyon, France. University of Nebraska at Omaha [email protected] [email protected]

Sajal Das MS20 Missouri Univeristy of Science and Technology Performance and Power Characterization of Net- [email protected] work Algorithms

We present a comparative analysis of the performance MS20 and power/energy attributes of several community de- Generating Random Hyperbolic Graphs in Sub- tection algorithms. We consider the Louvain, Walk- quadratic Time Trap, and InfoMap algorithms on several synthetic and real-world networks. We introduce a new fine-grained NetworKit is our open-source toolkit for large-scale net- power/performance measurement infrastructure that is work analysis. Besides numerous analytics kernels it con- used to estimate tradeoffs among performance, power, and tains various graph generators. Random hyperbolic graphs accuracy of community detection algorithms. are a promising family of scale-free geometric graphs using hyperbolic geometry. We provide the first generation al- Boyana Norris gorithm for random hyperbolic graphs with subquadratic Argonne National Laboratory running time. Using the code available in NetworKit, net- [email protected] works with billions of edges can be generated in a few min- utes on a 16-core workstation. Sanjukta Bhowmick Department of Computer Science Moritz von Looz 92 PP16 Abstracts

Karlsruhe Institute of Technology (KIT) mension. We discuss advantages and disadvantages of the Institute of Theoretical Informatics different approaches and their benefit compared to tradi- [email protected] tional space-parallel algorithms with sequential time step- ping in a parallel computational environment. Henning Meyerhenke Karlsruhe Institute of Technology Stephanie Friedhoff Institute for Theoretical Informatics, Parallel Computing University of Leuven [email protected] stephanie.friedhoff@uni-koeln.de

Roman Prutkin Robert D. Falgout, Tzanio V. Kolev Karlsruhe Institute of Technology (KIT) Center for Applied Scientific Computing Institute of Theoretical Informatics Lawrence Livermore National Laboratory [email protected] [email protected], [email protected]

Scott Maclachlan MS21 Department of Mathematics and Statistics Improving Applicability and Performance for the Memorial University of Newfoundland Multigrid Reduction in Time (mgrit) Algorithm [email protected]

Since clock speeds are no longer increasing, time integra- Jacob B. Schroder tion is becoming a sequential bottleneck. In this talk, we Lawrence Livermore National Laboratory discuss new advances in the MGRIT algorithm and corre- [email protected] sponding code XBraid. MGRIT can be easily integrated into existing codes, but more intrusive approaches allow for Stefan Vandewalle better performance. We present developments in XBraid to Department of Computer Science broaden its applicability (e.g., support for temporal adap- Katholieke Universiteit Leuven tivity) and improve efficiency by allowing for varying levels [email protected] of intrusiveness, dictated by the user’s code.

Robert D. Falgout MS21 Center for Applied Scientific Computing Convergence Rate of Parareal Method with Modi- Lawrence Livermore National Laboratory fied Newmark-Beta Algorithm for 2nd-Order Ode [email protected] In this study, we introduce a modified Newmark-beta algo- Veselin Dobrev rithm (MNBA) as a time integrator in the parareal method Lawrence Livermore National Laboratory for 2nd-order ODEs. The MNBA allows us to calculate [email protected] a highly accurate amplitude and phase regardless of time step width. Therefore, the MNBA is effective for coarse- Tzanio V. Kolev solver in the parareal algorithm. The accurate calculation Center for Applied Scientific Computing is accomplished by a phase correction coefficient. The com- Lawrence Livermore National Laboratory puted results showed excellent convergence rate for simple [email protected] harmonic motion with the MNBA. Mikio Iizuka Ben O’Neill RIKEN University of Colorado, Boulder [email protected] [email protected] Kenji Ono Jacob B. Schroder VCAD System Research Program, RIKEN Lawrence Livermore National Laboratory Japan [email protected] [email protected] Ben Southworth University of Colorado, Boulder MS21 [email protected] PinT for Climate and Weather Simulation

Ulrike Meier Yang Over the last decade, the key component for increasing the Lawrence Livermore National Laboratory compute performance for HPC systems was an increase in [email protected] data parallelism. However, for simulations with a fixed problem size, this increase in data parallelism clearly leads to circumstances with the communication latency domi- MS21 nating the overall compute time which again results in a Multigrid Methods with Space-Time Concurrency stagnation or even decline of scalability. For weather sim- ulations, further requirements on wallclock time restric- One feasible approach for doing effective parallel-in-time tions are given and exceeding these restrictions would make integration is with multigrid. In this talk, we consider the these simulation results less beneficial. For such circum- comparison of the three basic options of multigrid algo- stances, the so-called parallelization-in-time method pro- rithms on space-time grids that allow space-time concur- vide a promising approach. We present our recent results rency: coarsening in space and time, semicoarsening in the and methods on applying the Parareal approach as one of spatial dimensions, and semicoarsening in the temporal di- the parallelization-in-time method to a prototype of nu- PP16 Abstracts 93

merical weather/climate simulations. Library BEM4I

Martin Schreiber We present a library of parallel boundary element based University of Exeter solvers for solution of engineering problems. The library College of Engineering, Mathematics and Physical supports solution of problems modelled by the Laplace, Sciences Helmholtz, Lame, and time-dependent wave equation. In [email protected] our contribution we focus on the acceleration of the sys- tem matrices assembly using the Intel Xeon Phi coproces- Pedro S. Peixoto sors and parallel solution of linear elasticity problems using University of S˜ao Paulo, Brazil the boundary element tearing and interconnecting method [email protected] (BETI) in cooperation with the ESPRESO library. Michal Merta Terry Haut VSB CZ Los Alamos National Laboratory [email protected] [email protected] Jan Zapletal Beth Wingate IT4Innovations University of Exeter VSB-Technical University of Ostrava [email protected] [email protected]

MS22 MS22 Increasing Arithmetic Intensity Using Stencil HPGMG-FV Benchmark with GASPI/GPI-2: Compilers. First Steps

Stencil calculations and matrix-free Krylov sub-space The communication library GASPI/GPI is thread based solvers represent an important class of kernels in many sci- and offers a full flexibility to follow the task-model paral- entific computing applications. In such types of solvers, lelization approach, being free from the restrictions of the applications of stencil kernels are often the dominant part standard hybridization with MPI and threads/OpenMP. of the computation, and an efficient parallel implementa- Here, after replacing the MPI-communication by a thread- tion of the kernel is therefore crucial in order to reduce the parallel GPI-communication in the prolongation module time to solution. Inspired by polynomial preconditioning, of HPGMG, we compare the timings of the original im- we increase the arithmetic intensity of this Krylov subspace plementation and the new one. Our results show that building block by replacing the matrix with a higher-degree GASPI/GPI might be an appropriate choice for the hi- matrix polynomial. This allows for better use of SIMD erarchically parallel architectures nowadays. vectorization and as a consequence shows better speed-ups on the latest hardware. Portability is made possible with Dimitar M. Stoyanov the use of state- of-the-art stencil compiler programs, and ITWM Fraunhofer an extension to a distributed-memory environment with [email protected] local shared-memory parallelism is demonstrated. Thus, we demonstrate an effective, portable approach which can easily be adapted to extract high performance from newly MS23 developed hardware. FE2TI - Computational Scale Bridging for Dual- Phase Steels Simplice Donfack USI The computational scale bridging approach (FE2TI) com- [email protected] bines the well-known FE2 method with the FETI-DP do- main decomposition method. This approach incorporates Patrick Sanan phenomena on the microscale into a macroscopic prob- USI - Universit`a della Svizzera italiana lem. FE2TI is used in the project ”EXASTEEL - Bridg- ICS ing Scales for Multiphase Steels” (part of the German [email protected] priority program SPPEXA) for simulations of dual-phase steels. Scalability results for three-dimensional nonlinear and micro-heterogeneous hyperelasticity problems are pre- Olaf Schenk sented, filling the complete Mira at Argonne National Lab- USI - Universit`a della Svizzera italiana oratory (786,432 BlueGene/Q cores). Institute of Computational Science [email protected] Martin Lanser Mathematical Institute Reps Bram Universit Universiteit Antwerpen [email protected] [email protected] Martin Lanser Wim Vanroose Mathematical Institute Antwerp University Universit¨at zu K¨oln, Germany [email protected] [email protected]

Axel Klawonn MS22 Universit¨at zu K¨oln Many Core Acceleration of the Boundary Element [email protected] 94 PP16 Abstracts

Oliver Rheinbach archical Low Rank Structure Technische Universit¨at Bergakademie Freiberg [email protected] We describe a parallel direct solver based on a octree- nested condensation strategy and hierarchical compressed matrix representations for the linear algebra operations in- MS23 volved. The algorithm is asymptotically work-optimal and Title Not Available its overall memory requirements grow only linearly with problem size assuming bounded ranks of off-diagonal ma- Abstract not available trix blocks. The algorithm exposes substantial concurrency in the octree traversals. We show its parallel efficiency with Steffen Muething an implementation running on multi-thousand cores on the Universitaet Heidelberg KAUST Shaheen supercomputer. steff[email protected] Gustavo Chavez KAUST MS23 [email protected] Extreme-Scale Solver for Earth’s Mantle – Spec- tral/Geometric/Algebraic Multigrid Methods for Hatem Ltaief Nonlinear, Heterogeneous Stokes Flow King Abdullah University of Science & Technology We present an implicit solver that exhibits optimal al- (KAUST) gorithmic performance and is capable of extreme scal- [email protected] ing for hard PDE problems, such as mantle convec- tion. To minimize runtime the solver advances in- George M Turkiyyah, George M Turkiyyah clude: aggressive mesh adaptivity, mixed continuous- American University of Beirut discontinuous discretization with high-order accuracy, hy- [email protected], brid spectral/geometric/algebraic multigrid, and novel [email protected] Schur-complement preconditioning. We demonstrate such algorithmically optimal implicit solvers that scale out to 1.5 David E. Keyes million cores for severely nonlinear, ill-conditioned, hetero- KAUST geneous, and anisotropic PDEs. [email protected] Johann Rudi The University of Texas at Austin MS24 The University of Texas at Austin [email protected] Performance Optimization of a Massively Paral- lel Phase-Field Method Using the Hpc Framework Walberla A. Cristiano I. Malossi IBM Research - Zurich We present a massively parallel phasefield code, based on [email protected] the HPC framework waLBerla, for simulating solidification processes of ternary eutectic systems. Different patterns Tobin Isaac are forming during directional solidification of ternary eu- The University of Chicago tectics, depending on the physical parameters, with signif- [email protected] icant influence on the mechanical properties of the mate- rial. We show simulations of large structure formations, Georg Stadler which give rise to spiral growth. Our implementation of Courant Institute for Mathematical Sciences this method in the HPC framework waLBerla uses explicit New York University vectorization, asynchronous communication and I/O band- [email protected] with reduction techniques. Optimization and scaling re- sults are presented for simulations run on three German Michael Gurnis supercomputers: SuperMUC, Hornet and JUQUEEN. Caltech [email protected] Martin Bauer University of Erlangen-Nuremberg PeterW.J.Staar,YvesIneichen,CostasBekas Department of Computer Science IBM Research - Zurich [email protected] [email protected], [email protected], [email protected] Harald K¨ostler Universit¨at Erlangen-N¨urnberg Alessandro Curioni Department of Computer Science 10 (System Simulation) IBM Research-Zurich [email protected] [email protected] Ulrich R¨ude Omar Ghattas Friedrich-Alexander-University Erlangen-Nuremberg, The University of Texas at Austin Germany [email protected] [email protected]

MS23 MS24 3D Parallel Direct Elliptic Solvers Exploiting Hier- Large Scale phase-field simulations of directional PP16 Abstracts 95

ternary eutectic solidification entropy production, and the convergence of the dynamic solutions to the equilibrium solutions; and avoids the ap- Materials with defined properties are crucial for the de- pearance of spurious phases on binary interfaces. The new velopment and optimization of existing applications. Dur- theory is tested for multi-component liquid phase separa- ing the directional solidification of ternary eutectic alloys, tion (with/without hydrodynamics) and grain boundary a wide range of patterns evolve. This patterns are re- roughening. lated to the mechanical properties of the material. Large scale phase-field simulation, based on a grand-potential ap- Gyula I. Toth proach, allow to study the underlying physical processes Department of Physics and Technology of microstructure evolution. Based on the massive par- University of Bergen, NORWAY allel waLBerla framework, a highly optimized solver was [email protected] developed. Large phase-field simulations of the ternary system Al-Ag-Cu and the three-dimensional structure are Laszlo Granasy presented and visually and quantitatively compared to ex- Wignes Research Centre for Physics periments. Budapest, Hungary [email protected] Johannes H¨otzer Hochschule Karlsruhe Technik und Wirtschaft, Germany Tamas Pusztai tba Wigner Research Centre for Phyics Budapest, Hungary Philipp Steinmetz, Britta Nestler [email protected] Karlsruhe Institute of Technology (KIT) Institute of Applied Materials (IAM-ZBS) Bjorn Kvamme [email protected], [email protected] Department of Physics and Technology University of Bergen MS24 [email protected] Building High Performance Mesoscale Simulations using a Modular Multiphysics Framework MS25 The phasefield module included with MOOSE (the open- PMI-Exascale: Enabling Applications at Exascale source Multiphysics Object Oriented Simulation Environ- The PMI-Exascale (PMIx) community is chartered with ment) allows for the rapid creation and evolution of a wide providing an open source, standalone library that supports variety of multiphysics mesoscale simulations. Researchers a wide range of programming models at exascale and be- utilizing this tool have access to powerful finite element yond. Critical to the success of these large systems is the capabilities while enjoying highly scalable runs using the ability to rapidly start applications, and the emerging part- built-in hybrid parallelism through the underlying frame- nership between resource managers and applications for work. An overview of the advanced capabilities of the managing workflows. This talk will provide an overview of phasefield module will be provided including our parsed PMIx objectives, and a status report on the community’s kernel and material system, fully-coupled linkage to other progress. physics modules, and advances in algorithmic development with the grain tracking capability. Ralph Castain Intel Cody Permann [email protected] Center for Advanced Modeling and Simulation National Laboratory [email protected] MS25 Middleware for Efficient Performance in HPC

MS24 Two middleware components are attracting significant at- Equilibrium and Dynamics in Multiphase-Field tention today given their notable contributions to the effi- Theories: A Comparative Study and a Consistent ciency of HPC systems in terms of energy and throughput, Formulation while also improving performance. The first is the use of ”local” storage hierarchy, sometimes referred to as ”Burst We present a new multiphase-field theory for describing Buffers”, though the functionality goes beyond the task of interface driven phase transitions in multi-domain and/or buffering checkpoint files. The second is the deployment of multi-component systems. We first analyize the most advance power management functionality, from the system widespread multiphase theories presently in use, to iden- level, down through jobs, and into applications themselves. tify their advantages/disadvantages. Then, based on the results of the analysis, a new way of constructing the free energy surface is introduced, and a generalized multiphase Larry Kaplan description is derived for arbitrary number of phases (or Cray domains). The construction of the free energy functional [email protected] and the dynamic equations is based on criteria that en- sure mathematical and physical consistency. The presented approach retains the variational formalism; reduces (or ex- MS25 tends) naturally to lower (or higher) number of fields on the Power Aware Resource Management level of both the free energy functional and the dynamic equations; enables the use of arbitrary pairwise equilib- Maximizing application performance under a power con- rium interfacial properties; penalizes multiple junctions in- straint is a key research area for exascale supercomput- creasingly with the number of phases; ensures non-negative ing. Current HPC systems are designed to be worst-case 96 PP16 Abstracts

power provisioned, leading to under-utilized power and lim- [email protected] ited application performance. In this talk I will present RMAP, a low-overhead, power-aware resource manager im- plemented in SLURM that addresses the aforementioned MS26 issues. RMAP uses hardware overprovisioning and power- Simulation of Incompressible Flows on Parallel Oc- aware backfilling to improve job turnaround times by up tree Grids to 31% when compared to traditional resource managers. We present a second-order accurate solver for the incom- Tapasya Patki pressible Navier-Stokes equations on parallel Octree grids. Lawrence Livemore National Laboratory The viscosity is treated implicitly and we use a stable pro- [email protected] jection method. We rely on the p4est library to handle the distributed Octree structure and capture the presence of irregular interfaces with the level-set method. This leads MS25 to a scalable, stable and accurate solver for incompressible Co-Scheduling: Increasing Efficency for HPC flows past complex geometries.

An approach to improve efficiency on HPC systems with- Arthur Guittet out changing applications is co-scheduling, i.e., more than UCSB one job is executed simultaneously on the same nodes of a [email protected] system. If co-scheduled applications use different resource types, improved efficiency can be obtained. However, ap- Frederic G. Gibou plications may also slow down each other when sharing UC Santa Barbara resources. This talk presents our research activities and [email protected] results within FAST, a three year project funded by the German ministry of research and education. MS26 Josef Weidendorfer Basilisk: Simple Abstractions for Octree-Adaptive LRR-TUM Schemes TU M¨unchen [email protected] I will present the basic design and algorithms of a new octree-adaptive free software library: Basilisk (http://basilisk.fr). I will show how using simple stencil- MS26 based abstractions, one can very simply construct a wide Fast Solvers for Complex Fluids range of octree-adaptive numerical schemes including: multigrid solvers for elliptic problems, volume-of-fluid ad- Complex fluids are challenging to simulate as they are char- vection schemes for interfacial flows and high-order flux- acterized by stationary or evolving microstructure. We will based schemes for systems of conservation laws. The auto- discuss the discretization of integral equation formulations matic and transparent generalisation of these abstractions for complex fluids. In particular, we will discuss the formu- to large-scale distributed parallel systems will also be dis- lation, numerical challenges and scalability of algorithms cussed. for volume integral equations and we will present a new open-source library for such problems. We will compare Stephane Popinet their performance to other state-of-the art codes. National Institute of Water and Atmospheric Research NIWA George Biros [email protected] University of Texas at Austin [email protected] MS27 Dhairya Malhotra Tradeoffs in Domain Decomposition: Halos Vs Institute of Computational Engineering and Sciences Shared Contiguous Arrays The University of Texas at Austin Abstract not available [email protected] Jed Brown Argonne National Laboratory; University of Colorado MS26 Boulder Recent Parallel Results Using Dynamic Quad-Tree [email protected] Refinement for Solving Hyperbolic Conservation Laws on 2d Cartesian Meshes MS27 We discuss recent parallel results using ForestClaw, an Firedrake: Burning the Thread at Both Ends adaptive code based on the Berger-Oliger-Colella patch- based scheme for solving hyperbolic conservation laws cou- Abstract not available pled with the quad/octree mesh refinement available in the p4est library. We show how parallel communication costs Gerard J Gorman vary across different refinement strategies and patch sizes Department of Earth Science and Engineering and detail the strategies we use to minimize these costs. Imperial College London Results will be shown for several hyperbolic problems, in- [email protected] cluding shallow water wave equations on the cubed sphere.

Donna Calhoun MS27 Boise State University Threads: the Wrong Abstraction and the Wrong PP16 Abstracts 97

Semantic [email protected]

MPI+OpenMP is frequently proposed as the right evolu- tionary programming model for exascale. Unfortunately, MS28 the evolutionary introduction of OpenMP into existing Preconditioning Using Hierarchically Semi- MPI-only codes is fraught with difficulty. We will describe Separable Matrices and Randomized Sampling ”The Right Way” to do MPI+OpenMP and ultimately conclude that MPI+MPI is a more effective alternative for We present a software package, called STRUMPACK legacy codes. (Structured Matrices PACKage), to deal with dense or sparse rank structured matrices. Dense rank structured Jeff R. Hammond matrices occur for instance in the BEM, quantumchem- Intel Corporation istry, covariance and machine-learning problems. Our fully Parallel Computing Lab algebraic sparse solver or preconditioner uses rank struc- jeff [email protected] tured kernels on small dense submatrices. The computa- tional complexity and actual runtimes are much lower than Timothy Mattson for traditional direct methods. We show results for several Intel Corporation large problems on modern HPC architectures. [email protected] Pieter Ghysels Lawrence Berkeley National Laboratory Computational Research Division MS28 [email protected] Recent Directions in Extreme-Scale Solvers Francois-Henry Rouet Extreme-scale supercomputers have made extremely large Lawrence Berkeley National Laboratory scientific simulations possible, but extreme-scale linear [email protected] solvers is often a bottleneck. New ideas and algorithms are desperately needed to move forward. In this talk we Xiaoye Sherry Li review several topics of current interest that will be cov- Computational Research Division ered in this MS, such as, hierarchical low-rank solvers, fine- Lawrence Berkeley National Laboratory grained iterative incomplete factorizations (precondition- [email protected] ers), asynchronous iterations and communication-avoiding (or reducing) iterative solvers. MS28 Erik G. Boman Preconditioning Communication-Avoiding Krylov Center for Computing Research Methods Sandia National Labs [email protected] We have previously presented a domain decomposition framework to precondition communication-avoiding (CA) Krylov methods. This framework allowed us to develop a MS28 set of preconditioners that do not require any additional A New Parallel Hierarchical Preconditioner for communication beyond what the CA Krylov methods have Sparse Matrices already needed while improving their convergence rates. In this talk, we present our recent extensions to the framework We will present a new parallel preconditioner for sparse that can improve the performance of the preconditioners. matrices based on hierarchical matrices. The algorithm shares some elements with the incomplete LU factoriza- Ichitaro Yamazaki tion, but uses low-rank approximations to remove fill-ins. University of Tennessee, Knoxville The algorithm also uses multiple levels, in a way similar to [email protected] multigrid, to achieve fast global convergence. Moreover, lo- cal communication is needed only for eliminating boundary Siva Rajamanickam, Andrey Prokopenko nodes on every process. We will demonstrate performance Sandia National Laboratories and scalability results using various benchmarks. [email protected], [email protected]

Chao Chen Erik G. Boman Stanford University Center for Computing Research [email protected] Sandia National Labs [email protected] Eric F. Darve Stanford University Mark Hoemmen, Michael Heroux Mechanical Engineering Department Sandia National Laboratories [email protected] [email protected], [email protected]

Siva Rajamanickam Stan Tomov Sandia National Laboratories Computer Science Department [email protected] University of Tennessee [email protected] Erik G. Boman Center for Computing Research Jack J. Dongarra Sandia National Labs Department of Computer Science 98 PP16 Abstracts

The University of Tennessee have proven extremely successful in the past decade [email protected] through MapReduce, Pregel, etc. Can we construct a BSP platform that behaves predictably, is fully portable and re- silient, remains easy to program in, while not sacrificing MS29 performance? Deductive Verification of BSP Algorithms with Subgroups Albert-Jan N. Yzelman KU Leuven We present how to perform deductive verification of BSP Intel ExaScience Lab (Intel Labs Europe) algorithms with subgroup synchronisation. From BSP pro- [email protected] grams, we generate sequential codes for the condition gen- erator WHY. By enabling subgroups, the user can prove the correctness of programs that run on hierarchical ma- MS30 chines (e.g. clusters of multi-cores). We present how High Performance Computing in Micromechanics to generate proof obligations of MPI programs that only use collective operations. Our case studies are distributed Micromechanics can be characterized as a computation of state-space construction algorithms, the basis of model- the global (macrolevel) response of a heterogeneous ma- checking. terial possessing a complicated microstructure. It can be implemented via finite element method exploiting informa- Fr´ed´eric Gava tion on the microstructure provided by CT. The solution of Laboratoire d’Algorithmique, Complexit´e et Logique the finite element systems then requires high performance Universit´edeParisEst parallel solution methods. We shall discuss adaptation of [email protected] an algebraic two level solver exploiting additive Schwarz type approximate inverse for the solution of extreme large finite element systems. MS29 A BSP Scalability Analysis of 3D LU Factorization Radim Blaheta with Row Partial Pivoting Institute of Geonics, Academy of Sciences of the Czech Republic, Ostrava, Czech Republic Three dimensional matrix multiplication is intuitively no [email protected] more complex in terms of communication volume than the LU factorization without pivoting. It is known that it is O. Jakl in fact the same. In this presentation, we show that this Institute of Geonics, Academy of Sciences of the Czech communication volume lower bound can also be achieved Repub for the LU factorization with ”usual” row partial pivot- [email protected] ing. The bulk synchronous parallel (BSP) model is used to analyze the scalability of this algorithm and to discuss its J. Stary tradeoffs. Institute of Geonics, Academy of Sciences of the Antoine Petitet Czech Republic Huawei Technologies France [email protected] [email protected] I. Georgiev Institute of Information and Communication MS29 Technologies, BAS Computing Almost Asynchronously: What Prob- Sofia, Bulgaria lems Can We Solve in O(logP ) Supersteps? [email protected]

The elegance and utility of BSP stem from treating a par- allel computation as a sequential one, consisting of super- MS30 steps. In many non-trivial computations this number can Balanced Dense-Sparse Parallel Solvers for Nonlin- be independent of the input, depending only on the num- ear Vibration Analysis of Elastic Structures ber of processors p.Wegiveanoverviewofproblemsthat admit work-optimal BSP solutions using only O(log p)su- The dynamical analysis of elastic structures requires com- persteps: FFT, sorting, suffix array, selection and Longest putation of periodic solutions with reliable accuracy, natu- Common Subsequence, and present general techniques for rally leading to nonlinear models and large-scale systems. the design of such almost-asynchronous algorithms. A parallel algorithm for implementation of the shooting method is proposed and studied. It involves simultane- Alexander Tiskin ously basic matrix operations and solvers for sparse and Computer Science dense matrices. The computational complexity and par- University of Warwick allel efficiency are estimated. Numerical experiments on [email protected] hybrid HPC systems with accelerators are presented and the achieved scalability is analyzed.

MS29 Svetozar Margenov Towards Next-Generation High-Performance BSP Institute of Information and Communication Technologies Bulgarian Academy of Sciences, Sofia, Bulgari Bulk Synchronous Parallel allows structured programming, [email protected] ideally making algorithms run predictably, portably, and performantly. This talk focuses on the performance as- pect of BSP, in particular its interoperability with ease-of- MS30 programming and resiliance, both aspects of BSP which Parallel Block Preconditioning Framework and Ef- PP16 Abstracts 99

ficient Iterative Solution of Discrete Multi-Physics [email protected] Problems

MS31 In this presentation we discuss software engineering is- sues and parallel performance of the block precondition- Designing a New Masters Degree Program in Com- ing framework - a software middleware designed for simple putational and Data Science design and implementation of block preconditioners that arise in common discretisations of multi-physics problems. The Friedrich Schiller University Jena, founded in 1558, is a classical German university with around 20,000 stu- dents distributed over ten different departments. In win- Milan D. Mihajlovic ter 2014/2015, the university started a new master’s pro- Centre for Novel Computing, Department of Computer gram in Computational and Data Science that is designed Science to exploit the synergies between computational science for The University of Manchester large-scale problems and data-driven approaches to ana- [email protected] lyze massive data sets. The fundamental methodological basis of this interdisciplinary program consists of parallel processing and scalable algorithms. MS30 Martin Buecker Simulations for Optimizing Lightweight Structures Mathematics and Computer Science Department on Heterogeneous Compute Resources Friedrich Schiller University Jena [email protected] For lightweight structures in mechanical manufacturing, several parameters have to be determined, which is done in an optimization before production. The optimization is a MS31 complex simulation code comprising several CFD and FEM Supporting Data-Driven Businesses Through Tal- simulation tasks. The challenge is to execute this complex ented Students in KnowledgeEngineering@Work multi-modular code on parallel or distributed hardware en- vironments. Besides others, this leads to a specific schedul- To go beyond the traditional learning experience, Knowl- ing problem of the FEM tasks, which is the main topic of edgeEngineering@Work offers its young and talented stu- this talk. dents in Knowledge Engineering the opportunity to face real and challenging problems on the interface of applied Computer Science and Mathematics, while doing their Robert Dietze bachelor. This allows them to become professionals that TU Chemnitz, Germany are tailor made for the fast growing, highly competitive [email protected] and demanding market in which knowledge of how to han- dle large data sets is more and more important. Michael Hofmann Department of Computer Science Carlo Galuzzi Chemnitz University of Technology, Germany Dept. of Knowledge Engineering [email protected] Maastricht University [email protected] Gudula R¨unger TU Chemnitz, Germany [email protected] MS31 Blending the Vocational Training and University Education Concepts in the Mathematical Technical MS31 Software Developer (matse) Program Exploiting Synergies with Non-University Partners The challenging problems in natural and engineering sci- in CSE Education ences require a close cooperation between mathematicians, computer scientists and end-users. This talk presents The chances and obstacles to involving non-university part- the blend of a vocational training as Mathematical Tech- ners, in our case the federal research center in Jlich, in nical Software Developer (MATSE) and the training- graduate CSE education will be discussed. Although sig- accompanying bachelor program Scientific Programming. nificant flexibility may be required by both the univer- The graduates acquire the skills to handle mathematical sity and non-university partners in order to chomp with models as well as to find and implement solutions of simu- a joint working educational program, the benefits include lation, optimization and visualization problems. a sustainable communication platforms, strengthening of Benno Willemsen the research collaborations, and more opportunities for the IT Center graduates. One challenge is dealing with diverse back- RWTH Aachen University grounds of incoming students. Depending on the desired [email protected] graduate profile, a significant amount of curriculum flexi- bility may be necessary in order to harmonize the various skill sets. In addition, integrative instruction elements that MS32 combine various components of CSE are called for, includ- Parallel Decomposition Methods for Structured ing custom-designed laboratory modules and projects. MIP

Marek Behr Large-scale optimization with integer variables is a chal- RWTH Aachen University lenging problem with many scientific and engineering ap- Chair for Computational Analysis of Technical Systems plications, because of the complexity, nonlinearity and size 100 PP16 Abstracts

of the problems. However, certain classes of the prob- [email protected], [email protected] lems (e.g., stochastic and/or network optimization) have structures that allow decomposition of the problems. We present decomposition techniques and implementations ca- MS32 pable of running on high performance computing systems. How to Solve Open MIP Instances by Using Su- Numerical examples are provided using large-scale stochas- percomputers with Over 80,000 Cores tic unit commitment problems. In this talk, we introduce the Ubiquity Generator (UG) Kibaek Kim Framework, which is a software framework to parallelize Department of Industrial Engineering and Management a branch-and-bound based solver on a variety of parallel Sciences computing environments. UG gives us a systematic way Northwestern University to develop a parallel solver that can run on large-scale [email protected] distributed memory computing environments. A success story will be presented where we solve 14 open MIP in- stances from MIPLIB2003 and MIPLIB2010 using ParaS- MS32 CIP, a distributed-memory instantiation of UG using SCIP Parallel Optimization of Complex Energy Systems as MIP solver, on up to 80,000 cores of supercomputers. on High-Performance Computing Platforms Yuji Shinano Zuse Institute Berlin We present scalable algorithms and PIPS suite of parallel Takustr. 7 14195 Berlin implementations for the stochastic economic optimization [email protected] of complex energy systems and parameter estimation for power grid systems under dynamic transients due to un- foreseen contingencies. We will also discuss HPC simula- Tobias Achterberg tions related to the stochastic optimization of electricity Gurobi GmbH, Germany dispatch in the State of Illinois’ grid under large penetra- [email protected] tion of wind power. Timo Berthold, Stefan Heinz Cosmin G. Petra Fair Issac Europe Ltd, Germany Argonne National Laboratory timoberthold@fico.com, stefanheinz@fico.com Mathematics and Computer Science Division [email protected] Thorsten Koch Zuse Institute Berlin [email protected] MS32 PIPS-SBB: A Parallel Distributed Memory Solver Stefan Vigerske for Stochastic Mixed-Integer Optimization GAMS Software GmbH, Germany [email protected] Stochastic Mixed-Integer Programming (MIP) formula- tions of optimization problems from real-world applica- Michael Winkler tions such as unit commitment can exceed available mem- Gurobi GmbH, Germany ory on a single workstation. To overcome this limitation, [email protected] we have developed PIPS-SBB, a parallel distributed mem- ory Branch-and-Bound based stochastic MIP open-source solver that can leverage HPC architectures. In this talk, MS33 we discuss ongoing work on PIPS-SBB, and illustrate its Improving Resiliency Through Dynamic Resource performance on large unit commitment problem instances Management that cannot be solved to guaranteed optimality using serial implementations. Application recovery techniques for exascale systems target quick recovery from node failures using spare nodes. Static Deepak Rajan resource management assigns a fixed number of spare nodes Lawrence Livermore National Laboratory to an individual job, leading to poor resource utilization [email protected] and possibly also to insufficiency. Our dynamic resource management scheme allows the execution of non-rigid jobs Lluis Munguia and can perform on-the-fly replacement of a failed node. Georgia Institute of Technology Results show better job turnaround times under high fail- [email protected] ure rates when compared to static resource management.

Geoffrey M. Oxberry Suraj Prabhakaran, Felix Wolf Lawrence Livermore National Laboratory Technische Universit¨at Darmstadt [email protected] [email protected], [email protected] darmstadt.de Cosmin G. Petra Argonne National Laboratory Mathematics and Computer Science Division MS33 [email protected] Operation Experiences of the K Computer

Pedro Sotorrio, Thomas Edmunds The K computer is one of the largest supercomputer sys- Lawrence Livermore National Laboratory tems in the world, and it has been operated for more than PP16 Abstracts 101

three years. Since its release, we have solved a significant shown to 256K cores and 16K GPUs. number of operation issues such as fixing and improving the system software components, customizing the opera- Alan Humphrey tion rules according to users needs, and so on. In my talk, Scientific Computing and Imaging Institute I shall talk what we have experienced through the opera- University of Utah tion of the K computer. [email protected]

Atsuya Uno Todd Harman RIKEN Advanced Institute of Computing Science Department of Mechanical Engineering [email protected] University of Utah [email protected] MS33 Daniel Sunderland The Influence of Data Center Power and Energy Sandia National Laboratories Constraints on Future HPC Scheduling Systems [email protected] Reducing the power and energy consumption of HPC data centers is one of the essential areas of exascale research. Martin Berzins Currently, PC data centers are built and provisioned for Scientific Computing and Imaging Institute expected peak power. But the standard workload power University of Utah consumption can be as low as 50% of peak power. There- [email protected] fore, future HPC data centers would then over-provision or right size their data center; meaning that power and cooling is only sufficient for normal operation. Additional MS34 power restrictions might come from the power providers Resolving a Vast Range of Spatial and Dynamical (maximum power change in a specific time frame). Other Scales in Cosmological Simulations factors could include, for example, peak power vs. energy consumption over the year, real time power consumption Solving the equations of ideal hydrodynamics accurately adjustments based on renewable energy generation, and and efficiently is a key feature of state-of-the-art cosmolog- moving from CPU hour based accounting to energy con- ical codes. The AREPO code achieves a high accuracy by sumption. Since scheduling systems are currently the main utilizing a second-order finite volume method on a mov- way to allocate resources for applications, it seems clear ing mesh. Recently, we have developed a second operating that power and energy might need to become a scheduling mode by adopting a discontinuous Galerkin (DG) scheme resource/parameter. This talk will discuss possible data on a Cartesian grid, which can be adaptively refined. DG center policies concerning power and energy and their im- is a higher-order method with a small stencil and thus very pact on scheduling decisions. suitable for parallel systems.

Torsten Wilde Kevin Schaal, Volker Springel LRZ: Leibniz Supercomputing Centre Heidelberg Institute for Theoretical Studies [email protected] [email protected], [email protected]

MS34 MS34 Resiliency in AMR Using Rollback and Redun- A Simple Robust and Accurate a Posteriori Sub- dancy cell Finite Volume Limiter for the Discontinuous Galerkin Method on Space-Time Adaptive Meshes Many resiliency strategies introduce redundancy in the ap- plications. Some others use roll-back and recovery. Struc- We present a class of high order numerical schemes, which tured AMR has built-in redundancy and hierarchical struc- include finite volume and discontinuous Galerkin (DG) fi- ture, both of which can be exploited to build a cost-effective nite element methods for the solution of hyperbolic con- resiliency strategy for AMR. We present an example of a servation laws in multiple space dimensions. For the DG strategy that provides coverage for a variety of faults and scheme, we employ an ADER-WENO a-posteriori sub-cell the errors they manifest in the context of patch-based AMR finite volume limiter, which is activated only in those re- for solving hyperbolic PDE’s. gions where strong oscillations are produced. The resulting schemes have been implemented within space-time adap- Anshu Dubey tive Cartesian meshes and MPI parallelization. Several Lawrence Berkeley National Laboratory relevant numerical tests are shown. [email protected] Olindo Zanotti MS34 Universita degli Studi di Trento [email protected] Adaptive Mesh Refinement in Radiation Problems

Radiative heat transfer is an important mechanism in a Michael Dumbser class of challenging engineering and research problems. A University of Trento direct all-to-all treatment of these problems is prohibitively Italy expensive on large core counts. Here, we show that radi- [email protected] ation calculations can be made to scale within the Uintah framework through a novel combination of reverse Monte Carlo ray tracing techniques combined with adaptive mesh MS35 refinement. Strong scaling results from DOE Titan are MPI+MPI: Using MPI-3 Shared Memory As a 102 PP16 Abstracts

MultiCore Programming System of loop level parallelism will be discussed. This abstrac- tion has been implemented in the OCCA library, including MPI is widely used for parallel programming and enables front-end support for C, Fortran, C++, Julia and backend programmers to achieve good performance. MPI also has support for OpenMP, CUDA, and OpenCL. Algorithms ex- a one-sided or put/get programming model. In MPI-3, pressed in the OCCA Kernel Languages are portable across this one-sided model was greatly expanded, including new the multiple threading backends. Support is provided for atomic read-modify-write operations. In addition, MPI-3 automatic data movement between host and compute de- adds direct access to memory shared between processes. vices. Performance of OCCA based finite-element solvers This talk discusses some of the opportunities and chal- will be presented. lenges in making use of this feature to program multicore processors. Tim Warburton Virginia Tech William D. Gropp [email protected] University of Illinois at Urbana-Champaign Dept of Computer Science David Medina [email protected] Rice University [email protected] MS35 How I Learned to Stop Worrying and Love MPI: MS36 Prospering with an MPI-Only Model in PFLO- Revisiting Asynchronous Linear Solvers: Provable TRAN + PETSc in the Manycore Era Convergence Rate Through Randomization The recent rise of ”manycore” processors has led to spec- Asynchronous methods for solving systems of linear equa- ulation that developers must move from MPI-only mod- tions have been researched since Chazan and Miranker’s els to hybrid programming with MPI+threads to maintain pioneering 1969 paper on chaotic relaxation. The under- scalability on leadership-class machines. However, careful lying idea of asynchronous methods is to avoid processor consideration of communication patterns and judicious use idle time by allowing the processors to continue to make of MPI-3 features presents a viable alternative that retains progress even if not all progress made by other processors a simpler programming model. We explore this alternative has been communicated to them. Historically, the applica- in the open-source, massively parallel environmental flow bility of asynchronous methods for solving linear equations reactive transport code PFLOTRAN and the underlying was limited to certain restricted classes of matrices, such PETSc numerical framework it is built upon. as diagonally dominant matrices. Furthermore, analysis of Richard T. Mills these methods focused on proving convergence in the limit. Oak Ridge National Laboratory Comparison of the asynchronous convergence rate with its and The University of Tennessee, Knoxville synchronous counterpart and its scaling with the number [email protected] of processors were seldom studied, and are still not well understood. In this talk, we discuss a randomized shared- memory asynchronous method for general symmetric pos- MS35 itive definite matrices. We will present a rigorous analysis MPI+X: Opportunities and Limitations for Het- of the convergence rate that proves that it is linear, and erogeneous Systems is close to that of the method’s synchronous counterpart if the processor count is not excessive relative to the size and We investigate the use of node-level parallelization ap- sparsity of the matrix. Our work suggests randomization as proaches such as CUDA, OpenCL, and OpenMP coupled a key paradigm to serve as a foundation for asynchronous with MPI for data communication across nodes and present methods. comparisons with a conventional flat MPI approach. Our results demonstrate what can be considered an abstraction Haim Avron penalty of MPI+X over flat MPI in the strong scaling limit. Business Analytics & Mathematical Sciences IBMT.J.WatsonResearchCenter [email protected] Karl Rupp, Josef Weinbub, Florian Rudolf, Andreas Morhammer, Tibor Grasser Alex Druinsky Institute for Microelectronics Computational Research Division Vienna University of Technology Lawrence Berkeley National Laboratory [email protected], [email protected], [email protected] [email protected], [email protected], [email protected] Anshul Gupta IBM T J Watson Research Center Ansgar Juengel [email protected] Institute for Analysis and Scientific Computing Vienna University of Technology [email protected] MS36 Task and Data Parallelism Based Direct Solvers and Preconditioners in Manycore Architectures MS35 The Occa Abstract Threading Model: Implemen- Sparse direct or incomplete factorizations are important tation and Performance for High-Order Finite- building blocks in scalable solvers, for example, domain Element Computations decomposition and multigrid. Due to irregular data access these are difficult to parallelize, especially on manycore ar- An abstract threading model that simplifies the expression chitectures. We give an overview of our ongoing parallel PP16 Abstracts 103

solver projects that exploit either data or task parallelism, imity through a suitable vertex partitioning. The effects of or both. partitioning are investigated experimentally.

Siva Rajamanickam Rob H. Bisseling Sandia National Laboratories Utrecht University [email protected] Mathematical Institute [email protected] Erik G. Boman, Joshua D. Booth Center for Computing Research Sandia National Labs MS37 [email protected], [email protected] From Cilk to MPI to BSP: The Value of Structured Parallel Programming Kyungjoo Kim Parallel programming languages have still not reached the Sandia National Laboratories same maturity, clarity and stability as sequential lan- [email protected] guages. This situation has a strong negative effect on the industry of parallel software. We explain three important styles of parallel programming: Cilk, MPI, and BSP. We MS36 explain how their basic features balance expressive power Parallel Ilu and Iterative Sparse Triangular Solves with simplicity, scalability and predictable performance. We also investigate complex language constructs such as We consider hybrid preconditioners that are highly parallel types, modules, objects, and patterns. and efficient. Therefore, in a first step, we consider station- ary iterative solvers for triangular linear equations based Gaetan Hains on sparse approximate inverse preconditioners. These it- Huawei Technologies France erative methods are used for solving triangular systems [email protected] resulting from parallel ILU preconditioners. For solving general symmetric or nonsymmetric systems we combine sparse approximative inverses with Gauss-Seidel-like split- MS37 ting methods and the above triangular solvers. This leads Systematic Development of Functional Bulk Syn- to efficient and parallel preconditioners. chronous Parallel Programs Thomas K. Huckle With the current generalisation of parallel architectures Institut fuer Informatik and increasing requirement of parallel computation arises Technische Universitaet Muenchen the concern of applying formal methods, which allow spec- [email protected] ifications of parallel and distributed programs to be pre- cisely stated and the conformance of an implementation to Juergen Braeckle be verified using mathematical techniques. However, the TU Muenchen complexity of parallel programs, compared to sequential [email protected] ones, makes them more error-prone and difficult to ver- ify. This calls for a strongly structured form of parallelism, which should not only be equipped with an abstraction MS36 or model that conceals much of the complexity of parallel Aggregation-Based Algebraic Multigrid on Mas- computation, but also provides a systematic way of devel- sively Parallel Computers oping such parallelism from specifications for practically nontrivial examples. Program calculation is a kind of pro- To solve large linear systems, it is nowadays standard to use gram transformation based on the theory of constructive algebraic multigrid methods, although their parallel imple- algorithms. An efficient program is derived step-by-step mentations often suffer on large scale machines. We con- through a sequence of transformations that preserve the sider an aggregation-based method whose associated soft- meaning and hence the correctness. With suitable data- ware code (AGMG) has several hundreds of users around structures, program calculation can be used for writing par- the world. To improve parallel performance, some criti- allel programs. This talk presents the SyDPaCC system, cal components are redesigned, in a relatively simple yet a set of libraries for the proof assistant Coq. SyDPaCC not straightforward way. Excellent weak scalability results allows to write naive (i.e. inefficient) functional programs are reported on three petascale machines among the most then to transform them into efficient versions that could be powerful today available. automatically parallelised within the framework before be- ing extracted from Coq as code in the functional language Yvan Notay OCaml plus calls to the parallel functional programming Universite Libre de Bruxelles library Bulk Synchronous Parallel ML. [email protected] Fr´ed´eric Loulergue Laboratoire d’Informatique Fondamentale d’Orl´eans MS37 Universit´ed’Orl´eans Scalable Parallel Graph Matching for Big Data [email protected] The BSP style of programming is popular in graph com- putations, especially in the Big Data field, where routinely MS37 graphs of billions of vertices are processed. A fundamental Parallel Data Distribution for a Sparse Eigenvalue problem here is to match vertices in a large edge-weighted Problem Based on the BSP Model graph. We present a bulk-synchronous parallel approxima- tion algorithm based on local edge dominance, with partial We discuss data distribution based on the BSP model to sorting of vertex neighbours and exploitation of data prox- load balance a generalised eigenvalue computation arising 104 PP16 Abstracts

from machine learning. Using CG to solve a linear system [email protected] with a matrix that is the product of three sparse matri- ces, the main operations are sparse matrix-vector prod- ucts. The matrices are partitioned based on hypergraph MS38 partitioning with net amalgamation and the vectors are Online Auto-Tuning for Parallel Solution Methods partitioned using an (extended) local-bound method. Var- for ODEs ious variants of these partitioning methods are analysed and compared. In this talk, we consider auto-tuning of parallel solution methods for initial value problems of systems of ordinary Dirk Roose differential equations (ODEs). The performance of ODE KU Leuven solvers is sensitive to the characteristics of the input data. Dept. of Computer Science Therefore, the best implementation variant cannot be de- [email protected] termined at compile or installation time, and offline auto- tuning is not sufficient. Taking this into account, we devel- Karl Meerbergen oped an online auto-tuning approach, which exploits the K. U. Leuven time-stepping nature of ODE solvers. [email protected] Thomas Rauber Joris Tavernier Universit¨at Bayreuth, Germany KU Leuven [email protected] Dept. of Computer Science [email protected] Natalia Kalinnik, Matthias Korch University Bayreuth Germany MS38 [email protected], [email protected] Efficient Parallel Algorithms for New Unidirec- tional Models of Nonlinear Optics MS38 We are solving the unidirectional Maxwell equation simu- Optimizing the Use of GPU and Intel Xeon Phi lating very fast movement of solitons in laser physics Clusters to Accelerate Stencil Computations

3 ∂zE + DE + ∂τ E =0, MPDATA is a stencil-based kernel of EULAG geophysical model for simulating thermo-fluid flows across a wide range here E is the field component, D describes the linear dis- of scales/physical scenarios. In this work, we investigate persion operator. The problem is solved by applying the methods of optimal adaptation of MPDATA to modern splitting method, the dispersion step is approximated by hetereogeneous clusters with GPU and Intel Xeon Phi ac- using the spectral method, the nonlinear term is approxi- celerators. The achieved performance results are presented mated by using finite difference method. Parallelization of and discussed. In particular, the sustained performance of the numerical algorithm is done by applying the domain 138 Gflop/s is achieved for a single NVIDIA K20x GPU, decomposition method, the algorithm is implemented by which scales up to 16.1 Tflop/s for 128 GPUs. using MPI. Roman Wyrzykowski Raimondas Ciegas Czestochowa University of Technology, Department of Mathematical Modelling Poland University Vilnius, Lithuani [email protected] [email protected] Krzysztof Rojek, Lukasz Szustak Czestochowa University of Technology MS38 Poland Parallelization of Numerical Methods for Time [email protected], [email protected] Evolution Processes, Described by Integro- Differential Equations MS39 We deal with the iterative solution of a large scale viscoelas- Parallel Implementations of Ensemble Kalman Fil- tic problem arising in geophysics when studying Earth dy- ters for Huge Geophysical Models namics under load. In addition to the standard equilib- rium equation, the problem poses the challenge to handle The Data Assimilation Research Testbed (DART) is a com- a constitutive relation between stress and strains that is munity software facility for ensemble Kalman filter data of integral form. An iterative solution method is applied assimilation. The largest DART applications employ en- during each time step. The focus is on efficient and paral- sembles of size O(100) with models with up to one bil- lelizable preconditioning, related also to the methods used lion variables. DART combines ensemble information from to advance in time. structured model grids with observations from unstruc- tured networks. A scalable implementation of DART must Ali Dorostkar provide parallel algorithms that load balance computation Uppsala University while efficiently moving massive amounts of data. Newly Department of Information Technology developed algorithms will be presented along with scaling [email protected] results.

Maya Neytcheva Jeffrey Anderson Department of Information Technology National Center for Atmospheric Research Uppsala University Institute for Math Applied to Geophysics PP16 Abstracts 105

[email protected] dimensional variational data assimilation. The assimila- tion window is divided into multiple subintervals to facili- Helen Kershaw tate parallel cost function and gradient computations. The NCAR scalability of the method is tested by performing data as- [email protected] similation with increasing problem sizes (i.e. the assim- ilation window). The data assimilation is performed for Jonathan Hendricks Lorenz-96 and the shallow water model. We also explore National Center for Atmospheric Research the feasibility of using a hybrid-method (a combination of [email protected] serial and parallel 4D-Vars) to perform the variational data assimilation.

Nancy Collins Adrian Sandu National Center for Atmospheric Research Virginia Polytechnic Institute Boulder, CO and State University [email protected] [email protected]

MS39 MS40 Development of 4D Ensemble Variational Assimi- Parallel Processing for the Square Kilometer Array lation at Mto-France Telescope The 4D Ensemble Variational assimilation (4DEnVar) has Abstract not available received a considerable attention at Numerical Weather Prediction centres in the very recent years. Indeed, this David De Roure formalism allows, in such very large size estimation prob- University of Oxford lems related to Ensemble Kalman Filter, consistent error [email protected] covariances, while being potentially highly parallel on new computer architectures. Such a system is being developed at Mto-France. We will present the different parallelization MS40 opportunities offered by this new system and its overall performance. High Performance Computing for Astronomical In- strumentationonLargeTelescopes Gerald Desroziers Meteo France Exploiting the full capabilities of the next generation of ex- [email protected] tremely large ground based optical telescopes will rely on the technology of adaptive optics. This technique requires multiple wave-front sensing cameras and deformable mir- MS39 rors running at frame rates of 1 KHz. Meeting these re- Parallel 4D-Var and Particle Smoother Data As- quirements is a significant challenge and requires high per- similation for Atmospheric Chemistry and Meteo- formance computational facilities capable of operating with rology very low latency. The proposed solution requires massively parallel systems based on emerging multi-core technology. Parallel implemetation strategies building on two differ- ent assimilation methods are considered: Firstly, the four Nigel Dipper dimensional variational technique applying parallel pre- Centre for Advanced Instrumentation conditioning of the square root of the forecast error co- Department of Physics, Durham University variance matrix by the diffusion paradigm. This will [email protected] be presented in the context of an extended advection- diffusion-reaction equation for atmospheric chemistry mod- elling with aerosols. Secondly, a particle filter/smoother MS40 by an ultra-large (up to O(1000) member) ensemble of Adaptive Optics Simulation for the European Ex- a mesoscale meteorological model will be given, with a tremely Large Telescope on Multicore Architec- two-level approach on its parallel implementation for short tures with Multiple Gpus range prediction, where the role of particle weighting algo- rithms proofs critical. We present a high performance comprehensive implemen- tation of a multi-object adaptive optics (MOAO) simula- Hendrik Elbern tion on multicore architectures with hardware accelerators Rhenish Institute for Environmental Research in the context of computational astronomy. This imple- at the University of Cologne, Cologne, Germany mentation will be used as an operational testbed for sim- [email protected] ulating the design of new instruments for the European Extremely Large Telescope project (E-ELT), one of Eu- Jonas Berndt, Philipp Franke rope’s highest priorities in ground-based astronomy. The Institute for Energy and Climate Research simulation corresponds to a multi-step multi-stage proce- FZ Juelich dure, which is fed, near real-time, by system and turbu- [email protected], [email protected] lence data coming from the telescope environment. Using modern multicore architectures associated with the enor- mous computing power of GPUs, the resulting data-driven MS39 compute-intensive simulation of the entire MOAO applica- Parallel in Time Strong-Constraint 4D-Var tion, composed of the tomographic reconstructor and the observing sequence, is capable of coping with the afore- This talk discusses a scalable parallel-in-time algorithm mentioned real-time challenge and stands as a reference based on augmented Lagrangian approach to perform four implementation for the computational astronomy commu- 106 PP16 Abstracts

nity. unusable leadership computing systems. Previous predic- tions were false. We argue that unacceptable failure rates Hatem Ltaief will not emerge soon, but will remain at acceptable levels King Abdullah University of Science & Technology until the computing community is prepared to handle fail- (KAUST) ures via algorithms and applications. In this presentation [email protected] we provide evidence and arguments for our hypothesis and discuss how resilience research is still essential and timely. Damien Gratadour Observatoire de Paris Michael Heroux [email protected] Sandia National Laboratories [email protected] Ali Charara KAUST [email protected] MS41 Memory Errors in Modern Systems Eric Gendron Observatoire de Paris Several recent publications have shown that hardware [email protected] faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future David E. Keyes systems that contain orders of magnitude more DRAM KAUST and SRAM than found in current memory subsystems. [email protected] These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high- performance computing systems and data centers contain- MS40 ing tens of thousands of nodes. In this talk, I will focus on TheSKA(SquareKilometreArray)andItsCom- learnings about hardware reliability gathered from systems putational Challenges in the field, and use our data to project to whether hard- ware reliability will become a larger problem at exascale. The currently being designed will enable radio as- tronomers to study the Universe at levels of detail and sen- Steven Raasch, Vilas Sridharan sitivity well beyond current instruments. The SKA will be AMD Inc. truly a software driven instrument, and its computational [email protected], [email protected] demands will be unprecedented and will pose challenges that will need to be tackled and overcome. After introduc- ing briefly the aims and overall design of the SKA, this talk MS41 will show how computational power will be needed in sev- Characterizing Faults on Production Systems eral areas such as data transport, storage and analysis, and numerical computation in order to achieve the SKA scien- Understanding and characterizing faults on production sys- tific aims: it would indeed be stretching the boundaries of tems is key to generate realistic fault models in order to computational capabilities. drive flexible fault management across hardware and soft- ware components and to optimize the cost-benefit trade- Stefano Salvini offs among performance, resilience, and power consump- OeRC, the University of Oxford tion. In this talk, we will discuss recent efforts to extract [email protected] fault data from several production systems at LLNL and to store them in a central data repository. In addition to easy access to fault data, this will also enable correlations with MS41 other data (incl. power and performance data) to gain a The Missing High-Performance Computing Fault complete picture and to deduce root causes for observed Model faults.

The path to exascale computing poses several research Martin Schulz challenges. Resilience is one of the most important chal- Lawrence Livermore National Laboratory lenges. This talk will present recent work in developing the [email protected] missing high-performance computing (HPC) fault model. This effort identifies, categorizes and models the fault, er- ror and failure properties of today’s HPC systems. It devel- MS42 ops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current systems and From Titan to Summit: Hybrid Multicore Com- extrapolates this knowledge to exascale HPC systems. puting at ORNL

Christian Engelmann The Titan supercomputer at Oak Ridge National Labora- Oak Ridge National Laboratory tory was one of the first generally available, leadership com- [email protected] puting systems to deploy a hybrid CPU-GPU architecture. In this talk I will describe the Titan system, talk about the lessons we have learned in more than three years of success- MS41 ful operation, and how we are applying those lessons in our Resilience Research is Essential but Failure is Un- planning for the Summit system that will be available to likely our users in 2018.

Once again the high performance computing community Arthur S. Bland is faced with stark predictions of unreliable, failing and Oak Ridge National Laboratory PP16 Abstracts 107

[email protected] ordering, and graph coloring. Part of the Trilinos frame- work, it is our research and development testbed for ge- ometric and graph-based algorithms on emerging high- MS42 performance computing platforms. In this talk, we describe Hazelhen Leading HPC Technology and its Impact Zoltan2 and highlight its partitioning and task placement on Science in Germany and Europe algorithms. We demonstrate its benefits in the HOMME atmospheric climate model and in particle-based methods. The new HLRS flagship system Hazelhen (a Cray XC40) is Europes fastest system in the HPCG benchmarking list. Designed to provide maximum sustainable performance to Mehmet Deveci the users of HLRS ignoring peak performance numbers is The Ohio State University has even climbed to number 8 in the TOP500 list and is [email protected] one of the fastest systems in the world used both by re- search and industry. This talk will present insight into the Karen D. Devine strategy that was at the heart of the HLRS decision. It Sandia National Laboratories will further show how Hazelhen can and has already im- [email protected] pacted research and industry by providing high sustained performance for real world applications. Erik G. Boman Thomas B¨onisch Center for Computing Research Univerity of Stuttgart Sandia National Labs The High Performance Computing Center Stuttgart [email protected] tba Vitus Leung, Siva Rajamanickam, Mark A. Taylor Michael Resch Sandia National Laboratories HLRS, University of Stuttgart, Germany [email protected], [email protected], [email protected] [email protected]

MS42 MS43 Lessons Learned from Development and Operation Lessons Learned in the Development of FASTMath of the K Computer Numerical Software for Current and Next Genera- tion HPC The K computer is one of the most powerful supercomput- ers in the world. The design concept of the K computer The FASTMath team has experimented with different pro- includes not only huge computing power but also usabil- gramming models, communication reducing strategies, and ity, reliability, availability and serviceability. In this talk, adaptive load balancing schemes to handle both inter-node we introduce some functions of the K computer to realize and intra-node concurrency on deep NUMA architectures. the concept in hardware and software aspects and discuss I will give an overview of our experience to date and about lessons learned from the development and operation thoughts on the additional developments needed to achieve experiences of the K computer. performance portability across a range of architectures and software libraries. Fumiyoshi Shoji RIKEN Lori A. Diachin [email protected] Lawrence Livermore National Laboratory [email protected] MS42 Sequoia to Sierra: The LLNL Strategy MS43 Lawrence Livermore National Laboratory (LLNL) has a Scalable Nonlinear Solvers and Preconditioners in long history of leadership in large-scale computing. Our Climate Applications current platform, Sequoia, is a 96 rack BlueGene/Q sys- High resolution, long-term climate simulations are essential tem that is currently number three on the Top 500 list. to understanding regional climate variation within global Our next platform, Sierra, will be a heterogeneous system change on the decade scale. Implicit time stepping methods delivered by a partnership between IBM, NVIDIA and Mel- provide a means to increase efficiency by taking time steps lanox. While the platforms are diverse, they represent a commensurate with the physical processes of interest. We carefully considered strategy. In this talk, we will compare present results for a fully implicit time stepping method and contrast these platforms and the applications that run utilizing FASTMath Trilinos solvers and an approximate on them, and provide a glimpse into LLNL’s strategy be- block factorization preconditioner in the hydrostatic spec- yond Sierra. tral element dynamical core of the Community Atmosphere Bronis R. de Supinski Model (CAM-SE). Livermore Computing Lawrence Livermore National Laboratory David J. Gardner [email protected] Lawrence Livermore National Laboratory [email protected]

MS43 Richard Archibald Partitioning and Task Placement with Zoltan2 Computational Mathematics Group Oak Ridge National Labratory Zoltan2 is a library for partitioning, task placement, data [email protected] 108 PP16 Abstracts

Katherine J. Evans Oak Ridge National Laboratory [email protected] Evrim Acar University of Copenhagen Paul Ullrich [email protected] University of California, Davis [email protected] MS44 Carol S. Woodward Parallel Tensor Compression for Large-Scale Scien- Lawrence Livermore Nat’l Lab tific Data [email protected] Today’s computational simulations often generate more data than can be stored or analyzed. For multi-dimensional Patrick H. Worley data, we can use tensor approximations like the Tucker Oak Ridge National Laboratory decomposition to expose low-rank structure and compress [email protected] the data. This talk presents a distributed-memory paral- lel algorithm for computing the Tucker decomposition for general dense tensors. We show performance results and MS43 demonstrate large compression ratios (with small error) for Parallel Unstructured Mesh Genera- data generated by applications of combustion science. tion/Adaptation Tools Used in Automated Simulation Workflows Woody N. Austin University of Texas - Austin [email protected] This presentation will overview recent advances in the de- velopment of a set of unstructured mesh tools that sup- port all steps in the simulation workflow to be executed Grey Ballard, Tamara G. Kolda in parallel on the same massively parallel computer sys- Sandia National Laboratories tem. These tools to be discussed address the parallel mesh [email protected], [email protected] generation/adaptation, parallel mesh structures and dy- namic load balancing capabilities needed. The in-memory MS44 coupling of these tools with multiple unstructured mesh analysis code will be overviewed. Parallelizing and Scaling Tensor Computations

Mark S. Shephard Tensor decomposition is a powerful data analysis tech- Rensselaer Polytechnic Institute nique for extracting information from large-scale multidi- Scientific Computation Research Center mensional data and is widely used in diverse applications. [email protected] Scaling tensor decomposition algorithms on large comput- ing systems is key for realizing the applicability of these powerful techniques on large datasets. We present our Cameron Smith tool - ENSIGN Tensor Toolbox (ETTB) - that has op- Scientific Computation Research Center timized tensor decompositions through the application of Rensselaer Polytechnic Institute novel techniques for improving the concurrency, data lo- [email protected] cality, load balance, and asynchrony in the execution of tensor computations that are critical for scaling them on Dan A. Ibanez large systems. Rensselaer Polytechnic Institute SCOREC Muthu M. Baskaran, Benoit Meister, Richard Lethin [email protected] Reservoir Labs [email protected], [email protected], Brian Granzow, Onkar Sahni [email protected] Rensselaer Polytechnic Institute [email protected], [email protected] MS44 Kenneth Jansen Parallel Algorithms for Tensor Completion and University of Colorado at Boulder Recommender Systems [email protected] Low-rank tensor completion is concerned with filling in missing entries in multi-dimensional data. It has proven MS44 its versatility in numerous applications, including context- aware recommender systems and multivariate function Coupled Matrix and Tensor Factorizations: Mod- learning. To handle large-scale datasets and applications els, Algorithms, and Computational Issues that feature high dimensions, the development of dis- tributed algorithms is central. In this talk, we will discuss In the era of data science, it no longer suffices to analyze novel, highly scalable algorithms based on a combination a single data set if the goal is to unravel the dynamics of of the canonical polyadic (CP) tensor format with block a complex system. Therefore, data fusion, i.e., extracting coordinate descent methods. Although similar algorithms knowledge through the fusion of complementary data sets, have been proposed for the matrix case, the case of higher is a topic of interest in many fields. We formulate data dimensions gives rise to a number of new challenges and fusion as a coupled matrix and tensor factorization prob- requires a different paradigm for data distribution. The lem, and discuss extensions of the model as well as various convergence of our algorithms is analyzed and numerical algorithmic approaches for fitting such data fusion models. experiments illustrate their performance on distributed- PP16 Abstracts 109

memory architectures for tensors from a range of different over conventional BSP model and enjoy provable guaran- applications. tees.

Lars Karlsson Eric Xing Ume˚a University, Dept. of Computing Science School of Computer Science [email protected] CMU [email protected] Daniel Kressner EPFL Lausanne MS45 Mathicse daniel.kressner@epfl.ch Panel: The Future of BSP Computing Bulk Synchronous Parallel has had a large influence on par- Andre Uschmajew allel computing, both in classical high-performance com- University of Bonn puting and, more recently, in Big Data computing. Chal- uschmajew.ins.uni-bonn.de lenges to BSP are increasing latency costs, non-uniform memory hierarchies and network interconnects, the ten- sion between structured/convenient programming versus MS45 high performance, and the cost of fault-tolerance and tail- Multi-ML: Functional Programming of Multi-BSP tolerance. Can BSP be relaxed to unify opposing points Algorithms of view, or will the future hold tens or even hundreds of domain-specific BSP-style frameworks? The Multi-BSP model is an extension of the well known BSP bridging model. It brings a tree based view of nested Albert-Jan N. Yzelman components to represent hierarchical architectures. In the KU Leuven past, we designed BSML for functional programming BSP Intel ExaScience Lab (Intel Labs Europe) algorithms. We now propose the Multi-ML language as [email protected] an extension of BSML for programming Multi-BSP algo- rithms. MS46 Victor Allombert, Julien Tesson Linear and Non-Linear Domain Decomposition Laboratoire d’Algorithmique, Complexit´e et Logique Methods for Coupled Problems Universit´eParis-estCr´eteil [email protected], [email protected] In this talk, we consider fast and parallel solution meth- ods for linear and non-linear systems of equations, which Fr´ed´eric Gava arise from the discretization of coupled PDEs. We discuss Laboratoire d’Algorithmique, Complexit´e et Logique multigrid and domain decomposition based methods for Universit´edeParisEst biomechanical applications, including the coupled electro- [email protected] mechanical simulation of the human heart, as well as for fluid-structure interaction. We focus in particular on the volume coupling of different PDEs, for which we propose MS45 an approach based on variational transfer between non- New Directions in BSP Computing matching meshes.

BSP now powers parallel computing, big data analytics, Rolf Krause graph computing and machine learning at many leading- Institute of Computational Science edge companies. It is used at web companies for their anal- University of Lugano ysis of big data from massive graphs and networks with [email protected] billions of nodes and trillions of edges. BSP is also at the heart of new projects such as Petuum, for machine learn- Erich Foster ing with billions of parameters. At Huawei Paris Research Institute of Computational Science Center we are developing new BSP algorithms and software USI Lugano platforms for Telecomms and IT. [email protected]

Bill McColl Marco Favino Huawei Technologies France Institute of Computational Science [email protected] University of Lugano [email protected]

MS45 Patrick Zulian Parallelization Strategies for Distributed Machine Institute of Computational Science Learning USI Lugano [email protected] A key issue of both theoretical and engineering interest in effectively building distributed systems for large-scale machine learning (ML) programs underlying a wide spec- MS46 trum of applications, is the bridging models which spec- Improving Strong Scalability of Linear Solver for ify how parallel workers should be coordinated. I discuss Reservoir Simulation the unique challenges and opportunities in designing these models for ML, and present a stale synchronous parallel Reservoir simulation plays an important role in defining model that can speed up parallel ML orders of magnitudes and optimizing the production strategies for oil and gas 110 PP16 Abstracts

reservoirs. As mesh sizes, complexity and nonlinearity of KAUST the models grow highly parallel and scalable linear solvers [email protected] are of paramount importance for realistic design and op- timization workflows. The industry standard Constrained Pressure Residual (CPR) preconditioner consists of various MS47 components that are not well tailored for strong scalabil- Title Not Available ity. In this talk we will discuss the fundamental underlying problems and review several solution methods on how to Abstract not available overcome this performance barrier. Mike Fisher Tom B. Jonsthovel ECMWF Schlumberger mike.fi[email protected] [email protected]

MS47 MS46 From Data to Prediction with Quantified Uncer- Solving Large Systems for the Elastodynamics tainties: Scalable Parallel Algorithms and Applica- Equation to Perform Acoustic Imaging tions to the Flow of Ice Sheets

The field of numerical modeling of sound propagation in the Given a model containing uncertain parameters, (noisy) ocean, non destructive testing of materials, and of seismic observational data, and a prediction quantity of interest, waves in geological media following earthquakes has been we seek to (1) infer the model parameters from the data the subject of increasing interest of the underwater acous- via the model, (2) quantify the uncertainty in the inferred tics and geophysics communities for many years. Comput- parameters, and (3) propagate the resulting uncertain pa- ing an accurate solution, taking into account a complex rameters through the model to issue predictions with quan- environment, is still the subject of active research focus- tified uncertainties. We present scalable parallel algorithms ing on simulations in the time domain for forward and also for this data-to-prediction problem, in the context of mod- adjoint/inverse tomography problems in three dimensions eling the flow of the Antarctic ice sheet. We show that the (3D). Nowadays, with the stunning increase in computa- work, measured in number of forward (and adjoint) model tional power, the full waveform numerical simulation of solves, is independent of the state dimension, parameter complex wave propagation problems in the time domain is dimension, and data dimension. becoming possible. The purpose of the talk is to present some illustrative examples that emphasize the high poten- Omar Ghattas tial of a high-order finite-element method for forward and The University of Texas at Austin adjoint/inverse problems in geophysics as well as in un- [email protected] derwater acoustic and non destructive testing applications. We will thus show examples of numerical modeling of seis- Tobin Isaac mic wave propagation at very high resolution in large ge- The University of Chicago ological models, and summarize recent results for inverse [email protected] problems in this area. The spectral-element method, be- cause of its ability to handle coupled fluid-solid regions, is Noemi Petra also a natural candidate for performing wave propagation University of California, Merced simulations in underwater acoustics in the time domain. [email protected] Moreover, it is also known for being accurate to model surface and interface waves. In this presentation, we will show configurations in which such types of waves, called Georg Stadler Stoneley-Scholte waves, can be generated. Courant Institute for Mathematical Sciences New York University Dimitri Komatitsch,PaulCristini [email protected] Laboratory of Mechanics and Acoustics CNRS, Marseille, France [email protected], [email protected] MS47 Using Adjoint Based Numerical Error Estimates and Optimized Data Flows to Construct Surrogate MS46 Models PCBDDC: Robust BDDC Preconditioners in PETSc Surrogate based uncertainty quantification(UQ) using a suitably chosen ensemble of simulations is the method for A novel class of preconditioners based on Balancing Do- choice for most large scale simulations with UQ. We have main Decomposition by Constraints (BDDC) methods is developed novel methods and workflows to inform the con- available in the PETSc library. Such methods provide so- struction of these surrogates using adjoint based methods phisticated and almost black-box Domain Decomposition for error estimation in a scalable and parallel way while preconditioners of the non-overlapping type which are able managing the large data flows. to deal with elliptic PDEs with coefficients’ heterogeneities and discretized by means of non-standard finite element Abani Patra discretizations. The talk introduces the BDDC algorithm SUNY Buffalo and its current implementation in PETSc. Large scale nu- [email protected]ffalo.edu merical results for different type of PDEs and finite ele- ments discretizations prove the effectiveness of the method. MS47 Stefano Zampini Parallel Aspects of Diffusion-Based Correlation PP16 Abstracts 111

Operators in Ensemble-Variational Data Assimila- of the latest SPH developments and its applications will tion be presented. The challenges of the SPH real engineering simulations and a vision for the future are discussed. Ensemble-variational data assimilation requires multiple evaluations of large correlation matrix-vector products. Benedict Rogers This presentation describes methods to define such prod- School of Mechanical, Aerospace and Civil Engineering ucts using implicit diffusion-based operators that can be University of Manchester parallelized in the pseudo-time dimension. A solution [email protected] method based on the Chebyshev iteration is proposed: it does not require global communications and can be em- ployed to ensure the correlation matrix preserves its SPD MS48 attribute for an arbitrary convergence threshold. Examples An Hybrid OpenMP-MPI Parallelization for Vari- from a global ocean assimilation system are given. able Resolution Adaptive SPH with Fully Object- Oriented SPHysics Anthony Weaver, Jean Tshimanga CERFACS Starting from restructuring the original open-source [email protected], [email protected] SPHysics, a new SPH programming framework has been developed employing the full functionality of object- oriented Fortran (OOF) to create a new code that is both MS48 powerful, highly adaptable to the computing architectures Parallel Pic Simulation Based on Fast Electromag- of emerging technology. The novelties of this approach in- netic Field Solver clude: (i) Highly modular, program interface (API), (ii) A novel way to generate the neighbour list for particle inter- In this talk, a parallel symplectic PIC program for solving actions is proposed, (iii) Hybrid OpenMP-Message Passing non-relativistic Vlasov-Maxwell system on unstructured Interface (MPI) parallelization. grids for realistic applications will be introduced. The implicit scheme is used in our code. We solve Maxwell Ranato Vacondio equations on tetrahedral mesh by finite element method. University of Parma An efficient and scalable parallel algebraic solver is used [email protected] for solving the discrete system. Numerical results will be given to show that our method and parallel program is robust and scalable. MS49 Tao Cui Fault-Tolerant Multigrid Algorithms Based on Er- LSEC, the Institute of Computational Mathematics ror Confinement and Full Approximation Chinese Academy of Sciences We introduce a novel fault-tolerance scheme to detect and [email protected] repair soft faults in multigrid solvers, which essentially combines checksums for linear algebra with multigrid- MS48 specific algorithm-based fault-tolerance. To improve ef- ficiency we split the algorithm into two parts. For the Parallel Application of Monte Carlo Particle Sim- smoothing stage, we exploit properties of the full approx- ulation Code JMCT in Nuclear Reator Analysis imation scheme to check and repair the results. Our re- Massive data for full-core Monte Carlo simulation leads to sulting method significally increases the fault-tolerance of memory overload for a single core processor. Tree-based the multigrid algorithm and has only a small impact in the domain decomposition algorithm for combinatorial geom- non-faulty case. etry is developed on JCOGIN. An efficient, robust asyn- chronous transport algorithm for domain decomposition Mirco Altenbernd Monte Carlo is introduced. Combination of domain decom- Universit¨at Stuttgart position and domain replication is designed and provided [email protected] to improve parallel efficiency. Results of two full-core mod- els are presented using the Monte Carlo particle transport MS49 code JMCT, which is developed on JCOGIN. Soft Error Detection for Finite Difference Codes on Gang Li Regular Grids Institute of Applied Physics and Computational Mathematics, We investigate a strategy to detect soft errors in iterative Beijing, 100094, P.R. China or time stepping methods where the iterates live on regu- li [email protected] lar grids. The principle idea is to complement iterates by redundant information that allows to determine if memory copies or messages have arrived correctly at the destina- MS48 tion. We test our approach by means of the 2D heat equa- Applications of Smoothed Particle Hydrodynam- tion. Errors are injected asynchronously into the memory ics (SPH) for Real Engineering Problems: Particle or caches for testing. Parallelisation Requirements and a Vision for the Future Peter Arbenz ETH Zurich Recent and likely future HPC developments of SPH are Computer Science Department making massively parallel computations viable; conse- [email protected] quently there is a real possibility of achieving a step change in the level of complexity and realism in the SPH applica- Manuel Kohler tions with huge impact on industry and society. A review ETH Zurich Computer Science Department 112 PP16 Abstracts

[email protected] Duke University [email protected]

MS49 Bert J. Debusschere Resilience for Multigrid at the Extreme Scale Energy Transportation Center Sandia National Laboratories, Livermore CA We consider scalable geometric multigrid schemes to ob- [email protected] tain fast and fault-robust solvers. The lost unknowns in the faulty region are re-computed with local multigrid cy- cles. Especially effective are strategies in which the recov- MS50 ery in the faulty region is executed in parallel with global Progress in Performance Portability of the XGC iterations and when the local recovery is additionally ac- Plasma Microturbulence Codes: Heterogeneous celerated. This results in an asynchronous multigrid iter- and Manycore Node Architectures ation that can fully compensate faults. Excellent parallel performance is demonstrated for systems up to a trillion The XGC plasma microturbulence particle-in-cell code unknowns. models transport in the edge region of a tokamak fusion Ulrich J. Ruede reactor. XGC scales to over 18,000 nodes/GPUs on Cray University of Erlangen-Nuremberg XK7 Titan and over 64000 cores on Blue Gene/Q Mira. We Department of Computer Science (Simulation) present our progress and lessons learned in achieving per- [email protected] formance portability to heterogeneous GPU and manycore Xeon Phi architectures by using compiler directives, such as nested OpenMP parallelization for manycores, OpenMP Markus Huber SIMD vectorization, and OpenACC for porting to Nvidia TU Muenchen GPUs. [email protected] Eduardo F. D’Azevedo Bjoern Gmeiner Oak Ridge National Laboratory University Erlangen-Nuremberg Computer Science and Mathematics Division [email protected] [email protected]

Barbara Wohlmuth Mark Adams Technical University of Munich Lawrence Berkeley Laboratory [email protected] [email protected]

Choong-Seock Chang, Stephane Ethier, Robert Hager, MS49 Seung-Hoe Ku, Jianying Lang A Soft and Hard Faults Resilient Solver for 2D El- Princeton Plasma Physics Laboratory liptic PDEs Via Server-Client-Based Implementa- [email protected], [email protected], [email protected], tion [email protected], [email protected] We present a new approach to solve 2D elliptic PDEs that incorporates resiliency to soft and hard faults. Resiliency Mark Shepard to hard faults is obtained at the implementation level by Scientific Computing Research Center leveraging a fault-tolerant MPI implementation (ULFM- Rensselaer Polytechnic Institute MPI) within a task-based server-client model. Soft faults [email protected] are handled at the algorithm level by recasting the PDE as a sampling problem followed by a resilient regression-based Patrick H. Worley manipulation of the data to obtain a new solution. Oak Ridge National Laboratory [email protected] Francesco Rizzi, Karla Morris Sandia National Laboratories Eisung Yoon [email protected], [email protected] Rensselaer Polytechnic Institute [email protected] Paul Mycek Duke University [email protected] MS50 Research Applications and Analysis of a Fully Ed- Olivier P. Le Matre dying Virtual Ocean. LIMSI-CNRS [email protected] We will describe the mathematical and computational tech- niques used to realize a global kilometer scale ocean model configuration that includes representation of sea-ice and Andres Contreras tidal excitation, and spans scales from planetary gyres to Duke University internal tides. Computationally we exploit high-degrees of [email protected] parallelism in both numerical evaluation and in recording model state to persistent disk storage. Using this capabil- Khachik Sargsyan, Cosmin Safta ity we create a first of its kind multi-petabyte trajectory Sandia National Laboratories archive; sampled at unprecedented spatial and temporal [email protected], [email protected] frequency.

Omar M. Knio Chris Hill PP16 Abstracts 113

Earth, Atmospheric and Physical Sciences Rensselaer Polytechnic Institute M.I.T. SCOREC [email protected] [email protected]

MS50 MS51 Challenges in Adjoint-Based Aerodynamic Design Scaling Unstructured Mesh Computations for Unsteady Flows An unstructured grid, stabilized finite element flow solver An overview of the use of adjoint-based methods for a (PHASTA) has proven to be a very effective vehicle for de- broad range of large-scale aerodynamic and multidisci- velopment, testing, and scaling demonstrations for a wide plinary design applications is presented. The general mo- variety of parallel software components. These include un- tivation for the approach is outlined, along with a sum- structured grid meshing, adaptivity, load balancing, linear mary of the implementation and experiences gained over equation solution, and in situ data analytics. In this talk, two decades of concerted research effort. Specific issues as- we discuss how each of these components are integrated to sociated with applying these methods to unsteady simula- keep unstructured grid simulations on a productive path tions are highlighted, including next-generation computing to exascale partial differential equation simulation. challenges currently faced by the community.

Eric Nielsen Kenneth Jansen NASA Langley University of Colorado at Boulder [email protected] [email protected] Jed Brown MS50 Mathematics and Computer Science Division Extreme-Scale Genome Assembly Argonne National Laboratory and CU Boulder [email protected] The rate and cost of genome sequencing continues to im- prove at an exponential rate, making their analysis a signif- Michel Rasquin, Benjamin Matthews icant computational challenge. In this talk we present Hip- University of Colorado Boulder Mer, a high performance de novo assembler, which is a crit- [email protected], ical first step in many genome analyses. Recent work has [email protected] shown that efficient distributed memory algorithms using the UPC programming model allows unprecedented scaling Cameron Smith to thousands of cores, reducing the human genome assem- Scientific Computation Research Center bly time to only a few minutes. Rensselaer Polytechnic Institute [email protected] Leonid Oliker, Aydin Buluc, Rob Egan Lawrence Berkeley National Laboratory [email protected], [email protected], [email protected] Dan A. Ibanez Rensselaer Polytechnic Institute SCOREC Evangelos Georganas [email protected] EECS UC Berkeley [email protected] Mark S. Shephard Rensselaer Polytechnic Institute Scientific Computation Research Center Steven Hofmeyr [email protected] Lawrence Berkeley National Laboratory [email protected] MS51 Daniel Rokhsar, Katherine Yelick Lawrence Berkeley National Laboratory Comparing Global Link Arrangements for Dragon- University of California at Berkeley fly Networks [email protected], [email protected] High-performance computing systems are shifting away from traditional interconnect topologies to exploit new MS51 technologies and to reduce interconnect power consump- tion. The Dragonfly topology is one promising candidate Unstructured Mesh Adaptation for MPI and Ac- for new systems, with several variations already in produc- celerators tion. It is hierarchical, with local links forming groups and Conformal simplex meshes can be adapted to improve global links joining the groups. At each level, the inter- discretization error and enable Lagrangian simulations of connect is a clique. This presentation shows that the in- deforming geometry problems. We present new algo- tergroup links can be made in meaningfully different ways. rithms that perform mesh modifications in batches, allow- ing OpenMP and CUDA parallelism inside supercomputer nodes and making good use of MPI 3.0 neighborhood col- Vitus Leung lectives between supercomputer nodes. The mesh is stored Sandia National Laboratories in a simple array-based structure, allowing low-overhead [email protected] coupling with physics codes inside device memory. David Bunde Dan A. Ibanez Knox College 114 PP16 Abstracts

[email protected] trary dimension. Whereas conventional implementations of TTM rely on explicitly converting the input tensor operand into a matrixin order to be able to use any available and MS51 fast general matrix-matrix multiply (GEMM) implemen- An Extension of Hypres Structured and Semi- tation our frameworks strategy is to carry out the TTM Structured Matrix Classes in-place, avoiding this copy. As the resulting implementa- tions expose tuning parameters, this paper also describes The hypre software library provides parallel high perfor- a heuristic empirical model for selecting an optimal con- mance preconditioners and solvers. One of its attrac- figuration based on the TTMs inputs. When compared tive features is the provision of a structured and a semi- to widely used single-node TTM implementations that structured interface. We describe a new structured-grid are available in the Tensor Toolbox and Cyclops Tensor matrix class that supports rectangular matrices and con- Framework (CTF), InTensLi’s in-place and input-adaptive stant coefficients and a semi-structured-grid matrix class TTM implementations achieve 4 and 13 speedups, showing that builds on the new class. We anticipate that an effi- GEMM-like performance on a variety of input sizes. cient implementation of these new classes will lead to bet- ter performance on current and future architectures than Jiajia Li hypres current matrix classes. College of Computing Georgia Institute of Technology Ulrike M. Yang [email protected] Lawrence Livermore National Laboratory [email protected] Casey Battaglino, Ioakeim Perros, Jimeng Sun, Richard Vuduc Robert D. Falgout Georgia Institute of Technology Center for Applied Scientific Computing [email protected], [email protected], ji- Lawrence Livermore National Laboratory [email protected], [email protected]. edu [email protected]

MS52 MS52 High Performance Parallel Sparse Tensor Decom- Efficient Factorization with Compressed Sparse positions Tensors We investigate an efficient parallelization of the itera- Computing the Canonical Polyadic Decomposition is bot- tive algorithms for computing CANDECOMP/PARAFAC tlenecked by multiplying a sparse tensor by several dense (CP) and Tucker sparse tensor decompositions on shared matrices (MTTKRP). MTTKRP algorithms either use a and distributed memory systems. The key operations of compressed tensor specific to each dimension of the data each iteration of these algorithms are matricized tensor or a single uncompressed tensor at the cost of additional times Khatri-Rao product (MTTKRP) and n-mode tensor- operations. We introduce the compressed sparse fiber data matrix product. These operations amount to element-wise structure and several parallel MTTKRP algorithms that vector multiplication or vector outer product, followed by a need only a single compressed tensor. We reduce memory reduction, depending on the sparsity of the tensor. We first by 58% while achieving 81% of the parallel performance. investigate a fine and a coarse-grain task definition for these operations, and propose methods to achieve load balance and reduce the communication requirements for their par- Shaden Smith allel execution on distributed memory systems. Then we Dept. Computer Science investigate efficient parallelization issues on shared mem- Univ. Minnesota ory systems. Based on these investigations, we implement [email protected] the alternating least squares method for both decomposi- tions in a software library. The talk will contain results for George Karypis both decomposition schemes on large computing systems University of Minnesota / AHPCRC with real-world sparse tensors. We have a full-featured [email protected] implementation for the CP-decomposition scheme and de- velopment for the Tucker decomposition in progress. MS52 Oguz Kaya Inria and ENS Lyon Low Rank Bilinear Algorithms for Symmetric Ten- France sor Contractions [email protected] We demonstrate algebraic reorganizations that reduce by Bora Ucar two the number of scalar multiplications (bilinear algo- LIP, ENS-Lyon, France. rithm rank) needed for matrix-vector multiplication with [email protected] a symmetric matrix and the rank-two symmetric update. We then introduce symmetry-preserving algorithms, which lower the number of scalar multiplications needed for most MS52 types of symmetric tensor contractions. By being nested An Input-Adaptive and In-Place Approach to with matrix multiplication, these algorithms reduce the to- Dense Tensor-Times-Matrix Multiply tal number of operations needed for squaring matrices and contracting partially-symmetric tensors. This talk describes a novel framework, called InTensLi (“intensely”), for producing fast single-node implementa- Edgar Solomonik tions of dense tensor-times-matrix multiply (TTM) of arbi- ETH Zurich PP16 Abstracts 115

[email protected] [email protected]

MS53 MS53 BigDFT: Flexible DFT Approach to Large Systems Wavelets as a Basis Set for Electronic Structure Using Adaptive and Localized Basis Functions Calculations: The BigDFT Project

Density functional theory is an ideal method for the study Dauchechies wavelets are the only smooth orthogonal ba- of a large range of systems due to its balance between ac- sis set with compact support. These properties make them curacy and computational efficiency, however the applica- ideally suited as a basis set for electronic structure calcu- bility of standard implementations is limited to systems of lations, combining advantages from plane wave and Gaus- around 1000 atoms or smaller due to their cubic scaling sian basis sets. I will explain the basic principles of wavelet with respect to the number of atoms. In order to overcome based multi-resolution analysis and the wavelet based algo- these limitations, a number of methods have been devel- rithms used in our BigDFT electronic structure code both oped which use localized orbitals to take advantage of the for the solution the Schroedinger’s and Poisson’s equation. nearsightedness principle and reformulate the problem in a In addition I will present methods implemented in BigDFT manner which can be made to scale linearly with the num- to explore the potential energy surface ber of atoms, paving the way for calculations on systems of 10,000 atoms or more. We have recently implemented Stefan Goedecker such a method in the BigDFT code (http://bigdft.org), Univ Basel, Switzerland which uses an underlying wavelet basis set that is ideal [email protected] for linear-scaling calculations due to it exhibiting both or- thogonality and compact support. This use of wavelets also has the distinct advantage that calculations can be MS53 performed in a choice of boundary conditions - either free, DGDFT: A Massively Parallel Electronic Structure wire, surface or periodic. This distinguishes it from other Calculation Tool linear-scaling codes, which are generally restricted to peri- odic boundary conditions, and thus allows calculations on We describe a massively parallel implementation of the re- charged systems using open boundary conditions without cently developed discontinuous Galerkin density functional the need for adding a compensating background charge. theory (DGDFT) method, for efficient large-scale Kohn- We will first outline the method and implementation within Sham DFT based electronic structure calculations. The BigDFT, after which we will present results showing both DGDFT method uses adaptive local basis (ALB) func- the accuracy and the flexibility of the approach, includ- tions generated on-the-fly during the self-consistent field ing a demonstration of the improved scaling compared to (SCF) iteration to represent the solution to the Kohn-Sham standard BigDFT calculations. equations. DGDFT can achieve 80% parallel efficiency on 128,000 high performance computing cores when it is used Thierry Deutsch to study the electronic structure of 2D phosphorene sys- Institute of Nanosciences and Cryogeny tems with 3,500-14,000 atoms. This high parallel efficiency CEA results from a two-level parallelization scheme that we will [email protected] describe in detail.

Chao Yang MS53 Lawrence Berkeley National Lab Design and Implementation of the PEXSI Solver [email protected] Interface in SIESTA

We have implemented in the SIESTA code an interface to MS54 a new electronic structure solver based on the Pole Expan- sion and Selected Inversion (PEXSI) technique, whose com- Preconditioning of High Order Edge Elements plexity scales at most quadratically with size and can be Type Discretizations for the Time-Harmonic applied with good accuracy to all kinds of systems, includ- Maxwells Equations ing metals. We describe the design principles behind the Abstract not available interface and the heuristics employed to integrate PEXSI efficiently in the SIESTA workflow to enable large-scale Victorita Dolean simulations. University of Strathclyde Alberto Garcia [email protected] ICMAB [email protected] MS54 Block Iterative Methods and Recycling for Im- Lin Lin proved Scalability of Linear Solvers. University of California, Berkeley Lawrence Berkeley National Laboratory On the one hand, block iterative methods may be use- [email protected] ful when solving systems with multiple right-hand sides, for example when dealing with time-harmonic Maxwells Georg Huhs equations. They indeed offer higher arithmetic intensity, Barcelona Supercomputing Center and typically decrease the number of iterations of Krylov [email protected] solvers. On the other hand, recycling also provides a way to decrease the time to solution of successive linear solves, Chao Yang when all right-hand sides are not available at the same Lawrence Berkeley National Lab time. I will present some results using both approaches, as 116 PP16 Abstracts

well as their implementation inside the open-source frame- prototype for brain imaging. work HPDDM (https://github.com/hpddm/hpddm). Pierre-Henri Tournier Pierre Jolivet University Paris VI CNRS [email protected] [email protected] Pierre Jolivet Pierre-Henri Tournier CNRS University Paris VI [email protected] [email protected] Fr´ed´eric Nataf CNRS and UPMC MS54 [email protected] Reliable Domain Decomposition Methods with Multiple Search MS55 Domain Decomposition methods are a family of solvers Scaling Up Autotuning tailored to very large linear systems that require parallel Autotuning has been successful at the node level. As we computers. They proceed by splitting the computational move towards larger machines, a challenge will be to scale domain into subdomains and then approximating the in- up autotuning systems to handle full applications at the verse of the original problem with local inverses coming scale of tens to hundreds of thousands of nodes. In this from the subdomains. I will present some classical domain talk, I will outline some of the challenges of running at decomposition methods and show that for realistic simu- scale and describe how we have been working to enhance lations (with heterogeneous materials for instance) conver- the Active Harmony system to handle these challenges. gence usually becomes very slow. Then I will explain how this can be fixed by injecting more information into the Jeffrey Hollingsworth solver. In particular I will show how using multiple search University of Maryland directions within the conjugate gradient algorithm makes [email protected] the algorithm more reliable. Efficiency is also taken into account since our solvers are adaptive. MS55 Nicole Spillane Auto-Tuning of Hierarchical Computations with Centre de Math´ematiques Appliqu´ees ppOpen-AT Ecole Polytechnique [email protected] We are now developing ppOpen-AT, which is a directive- base Auto-tuning (AT) language to specify fundamental Christophe Bovet AT functions, i.e., varying values of parameters, loop trans- LMT, ENS Cachan, CNRS, Universit´e Paris-Saclay, formations, and code selection. Considering with expected 94235 Cachan hardware of Post Moores era, we focus on optimization [email protected] for computations with deep hierarchy of 3D memory stack. ppOpen-AT provides code selection to optimize code with Pierre Gosselet respect to layers of the memory. Performance evaluation LMT, ENS Cachan of AT with a code of FDM will be shown by utilizing the [email protected] Xeon Phi. Takahiro Katagiri Daniel J. Rixen The University of Tokyo TU Munich [email protected] [email protected] Masaharu Matsumoto Fran¸cois-Xavier Roux Information Technology Center ONERA The University of Tokyo Universit´e Pierre et Marie Curie [email protected] [email protected] Satoshi Ohshima The University of Tokyo MS54 Information Technology Center Parallel Preconditioners for Problems Arising from [email protected] Whole-Microwave System Modelling for Brain Imaging MS55 Brain imaging techniques using microwave tomography in- Auto-Tuning for the SVD of Bidiagonals volve the solution of an inverse problem. Thus, solving the electromagnetic forward problem efficiently in the inversion It is well known that the SVD of a matrix A, A = USV T , loop is mandatory. In this work, we use parallel precon- can be obtained from the eigenpairs of the matrix CV = T T ditioners based on overlapping Schwarz domain decompo- A A or CU = AA . Alternatively, the SVD can be sition methods to design fast and robust solvers for time- obtained from the eigenpairs of the augmented matrix T harmonic Maxwell´ s equations discretized with a N´ed´elec CUV =[0A; A 0]. This formulation is particularly at- edge finite element method. We present numerical results tractive if A is a bidiagonal matrix and if only a partial for a whole-system modeling of a microwave measurement SVD (a subset of singular values and vectors) is desired. PP16 Abstracts 117

This presentation focuses on CUV in the context of bidi- is simple to use, yet does not compromise performance agonal matrices, and discusses how tridiagonal eigensolvers can be applied and tuned in that context. The presentation Achim Gsell will also discuss opportunities for parallelism, accuracy and PSI implementation issues. [email protected]

Osni A. Marques Lawrence Berkeley National Laboratory MS56 Berkeley, CA Towards a Highly Scalable Particles Method Based [email protected] Toolkit: Optimization for Real Applications A large number of industrial physical problems can be sim- MS55 ulated using particle-based methods. In this talk, we will present a toolkit for solving a complex, highly nonlinear User-Defined Code Transformation for High Per- and distorted flow using Incompressible Smoothed Particle formance Portability Hydrodynamics in very large scale. This toolkit is imple- Our research project is developing a code transformation mented cache friendly and specially designed to preserve framework, Xevolver, to allow users to define their own the data locality. With unstructured communication mech- code transformation rules and also to customize typical anism in parallel, this toolkit can also be used with the rules for individual applications and systems. Use of user- other type of particle methods, such as PIC. defined code transformations is effective to achieve a high Xiaohu Guo performance while avoiding specialization of an HPC ap- Scientific Computing Department plication code for a particular computing system. As a Science and Technology Facilities Council result, without overcomplicating the code, we can achieve [email protected] high performance portability across diverse HPC systems.

Hiroyuki Takizawa, Shoichi Hirasawa MS56 Tohoku University/JST CREST Weakly Compressible Smooth Particle Hydrody- [email protected], [email protected] namics Performance on the Intel Xeon Phi

Hiroaki Kobayashi DualSPHysics is a Smoothed Particle Hydrodynamics code Tohoku University, Japan to study free-surface flow phenomena where Eulerian meth- [email protected] ods can be difficult to apply, such as waves or impact of dam-breaks on off-shore structures.The IPCC at the Hartree Centre has been porting and optimising Dual- MS56 SPHysics for native execution on the Xeon Phi processor. Parallelization of Smoothed Particle Hydrodynam- Threading, vectorization and arithmetic efficiency was in- ics for Engineering Applications with the Hetero- creased resulting in speed ups for both the Xeon CPUs and geneous Architectures the Xeon Phi.

Smoothed Particle Hydrodynamics (SPH) method is a Sergi-Enric Siso powerful technique used to simulate complex free-surface The Hartree Centre flows. However, one of the main drawbacks of the method Science and Technology Facilities Council is its high computational cost and the large number of par- [email protected] ticles needed in 3D simulations. High Performance Com- puting (HPC) becomes essential to accelerate SPH codes MS57 and perform large simulations. In this study, paralleliza- tion using MPI and GPUs (Graphics Processing Units) is Resilience for Asynchronous Runtime Systems and applied for executions on heterogeneous clusters of CPUs Algorithms and GPUs. Parallel asynchronous algorithms have shown to be easy to couple with different kinds of low-overhead forward recov- Jose M. Dom´ınguez Alonso ery techniques based on tasks. To manage the asynchrony EPHYSLAB Environmental Physics Laboratory of the parallel workload, a runtime system is typically used. Universidade de Vigo Since its activity may also be sensitive to errors, some run- [email protected] time specific resilience techniques are required under cer- tain circumstances. This talk will provide a vulnerability MS56 comparison between runtime and algorithmic data struc- tures and also suggest low overhead resilience techniques H5hut: a High Performance Hdf5 Toolkit for Par- for them. ticle Simulations Marc Casas Particle-based simulations running on large high- Barcelona Supercomputing Center performance computing systems can generate an enormous [email protected] amount of data. Post-processing, effectively managing the data on disk and interfacing it with analysis and visu- alization tools can be challenging, especially for domain MS57 scientists without expertise in high-performance I/O and Experiments with FSEFI: The Fine-Grained Soft data management. We present the H5hut library, an Error Fault Injector implementation of several data models for particle-based simulations that encapsulates the complexity of HDF5 and In this talk we will discuss FSEFI, the Fine-Grained Soft 118 PP16 Abstracts

Error Fault Injector and experiments we have conducted [email protected] with FSEFI to evaluate resiliency, fault-tolerance, and vul- nerability of applications and application kernels. FSEFI is a processor emulator-based fault injector where full appli- MS58 cations, operating system kernels, drivers, and middleware Silicon Photonics for Extreme Scale Computing run. Users can create experiments and conduct trials of hundreds of thousands of fault injections, collect the re- Silicon Photonics Integrated Circuits will soon enable the sults, and build a vulnerability/resiliency profile to gain direct integration of optical connectivity onto the NICs. more information about an application. FSEFI has been Photonic IO through high data rates and wavelength mul- in construction since 2011 and has users at two govern- tiplexing is expected to support Tb/s bandwidth per link ment agencies, several national laboratories, and a handful thereby substantially increasing the available throughput of universities. It is expected to be open sourced by the to the node. In the last 5 years, bandwidth per node in Department of Energy in the near future. dominant HPC systems has only doubled while node com- pute power increased by a factor of 5 and global computer Nathan A. DeBardeleben power by a factor of 13. By suddenly allowing drastically Los Alamos National Lab wider bandwidths, and by reducing the IO area and power [email protected] dissipation, Integrated Photonics is poised to revolution- ize inter-node I/O and trigger in depth changes in HPC Qiang Guan architectures. In this talk, we show how the main charac- Los Alamos National Laboratory teristics of these upcoming photonic links can be estimated, [email protected] how these performance estimates would influence the over- all shape of an Exascale-class interconnect, and finally, how these predictions may influence global architectures. MS57 Keren Bergman, Sebastien Rumley What Error to Expect When You Are Expecting a Columbia University Bit Flip. [email protected], [email protected]

We present a statistical analysis of the expected absolute and relative error given a bit flip in an IEEE-754 scalar. MS58 We model the expected value agnostic of the standard used, Quantum Computing: Opportunities and Chal- and discuss the implications of using different standards, lenges e.g., double vs quad or single vs double when a bit flip is expected. The core of many real-world problems consists of extremely difficult combinatorial optimization challenges that cannot James Elliott be solved accurately by traditional supercomputers in a re- North Carolina State University alistic time frame. A key reason is that classical computing [email protected] systems are not designed to handle the exponential growth in computational complexity as the number of variables Mark Hoemmen increases. In contrast, quantum computers use a physi- Sandia National Laboratories cal evolution process that harnesses quantum mechanical [email protected] effects. In this talk, we will present NASAs experiences with a 1097-qubit D-Wave 2X system that utilizes a pro- Frank Mueller cess known as quantum annealing to solve this class of hard North Carolina State University optimization problems. [email protected] Rupak Biswas NASA Ames Research Center MS57 [email protected] Silent Data Corruption and Error Propagation in Sparse Matrix Computations MS58 Task-Based Execution Models and Co-Design In this talk we highlight silent data corruption in several sparse matrix computations. Error rates are expected to Future HPC machines carry requirements that force impact next generation machines and silent data corrup- changes to hardware and software models. The tension be- tion (SDC) remains a concern as simulations grown size tween these models mandates a co-design approach for jug- and complexity. Here we investigate the propagation of gling compromises. In order to effectively address variabil- SDC in sparse matrix operations such as AMG and other ity, dynamic behavior, power, and resiliency, task-based sparse solvers. We detail the injection library used and execution models are a compromise candidate that neither show how errors propagate through different operations, architecture nor software groups particularly desire – but suggesting different algorithm-based mitigation strategies lack viable alternatives. This talk will showcase some ten- to help reduce the impact of SDC. sion points, how task models address them, and provide results based on OCR. Luke Olson, Jon Calhoun Department of Computer Science Romain Cledat University of Illinois at Urbana-Champaign Intel Corporation [email protected], [email protected] [email protected]

Marc Snir Joshua Fryman University of Illinois at Urbana-Champaign Intel PP16 Abstracts 119

[email protected] Martin Lanser Mathematical Institute Universit¨at zu K¨oln, Germany MS58 [email protected] Emerging Hardware Technologies for Exascale Sys- tems Oliver Rheinbach Technische Universit¨at Bergakademie Freiberg To achieve exascale computing, fundamental hardware ar- [email protected] chitectures must change. While many details of the exas- cale architectures are undefined it is clear that this machine will be a significant departure from current systems. De- MS59 velopers will face increasing parallelism, decreasing single Convergence Estimates for Composed Iterations threaded performance as well as significant changes to the memory hierarchy. This talk will explore the continuum I will define nonlinear preconditioning, a composition tech- of emerging hardware technologies, evaluate their viability nique used to accelerate an iterative solution process, in and briefly discuss the impact on the application program- analogy with the linear case, and demonstrate its effective- mer. ness for representative nonlinear PDE. I will outline the beginning of a convergence theory for these composed it- David Donofrio erations. Lawrence Berkeley National Laboratory [email protected] Matthew G. Knepley Rice University [email protected] *Preferred MS59 Additive and Multiplicative Nonlinear Schwarz MS59 Scaling Nonlinear FETI-DP Domain Decomposi- Multiplicative Schwarz preconditioned inexact Newton tion Methods to 786 452 Cores (MSPIN) is naturally based on partitioning degrees of free- dom in a nonlinear PDE system by field rather than by Nonlinear FETI-DP (Finite Element Tearing and Intercon- subdomain, where a modest factor of concurrency is sac- necting) domain decomposition methods are parallel solu- rificed for robustness. ASPIN (2002), by subdomain, is tion methods for nonlinear implicit problems discretized natural for high concurrency and reduction of global syn- by finite elements. A recent nonlinear FETI-DP method chronization. We augment ASPINs convergence theory for is combined with an approach that allows an inexact solu- the multiplicative case and show that MSPIN can be sig- tion of the FETI-DP coarse problem by a parallel Algebraic nificantly more robust than a global Newton method and Multigrid Method. Weak scalability for two TOP 10 super- than ASPIN. computers, i.e., JUQUEEN at FZ J¨ulich and the complete Mira at Argonne National Laboratory (786,432 cores) are David E. Keyes shown. KAUST [email protected] Axel Klawonn Universit¨at zu K¨oln Lulu Liu [email protected] King Abdullah University of Science and Technology [email protected] Martin Lanser Mathematical Institute Universit MS59 [email protected] A Framework for Nonlinear Feti-Dp and Bddc Do- main Decomposition Methods Martin Lanser Mathematical Institute Solving nonlinear problems, e.g., in material science, re- Universit¨at zu K¨oln, Germany quires fast and highly scalable parallel solvers. FETI-DP [email protected] and BDDC domain decomposition methods meet these re- quirements for linearized implicit finite element problems. Oliver Rheinbach Recently, nonlinear versions have been proposed where the Technische Universit¨at Bergakademie Freiberg nonlinear problem is decomposed before linearization. This [email protected] can be viewed as a strategy to further localize compu- tational work and to extend parallel scalability towards extreme-scale supercomputers. A framework for different MS60 nonlinear FETI-DP and BDDC algorithms will be pre- NUMA-Aware Hessenberg Reduction in the Con- sented. text of the QR Algorithm Axel Klawonn Hessenberg reduction appears in the aggressive early defla- Universit¨at zu K¨oln tion bottleneck in parallel multishift QR algorithms. The [email protected] reduction is memory-bound and performance relies on ef- ficient use of caches and memory buses. We applied the Martin Lanser parallel cache assignment technique recently proposed by Mathematical Institute Castaldo and Whaley and obtained a cache-efficient and Universit NUMA-aware multithreaded algorithm. Numerous tun- [email protected] able parameters were identified and their impact on per- 120 PP16 Abstracts

formance evaluated. The goal is to incorporate runtime for Parallel Sparse Matrix-Vector Multiplication auto-tuning capabilities. The performance of parallel sparse matrix-vector multipli- Mahmoud Eljammaly cation (SpMV for short, calculating y ← Ax,whereA is a Ume˚aUniversity sparse matrix, x and y are dense vectors) highly depends Dept. of Computing Science on the storage format of the input matrix. Recently, many [email protected] formats have been proposed for many-core processors such as GPUs and Xeon Phi. This talk will give a comprehensive Lars Karlsson comparison for SpMV performance and format conversion Ume˚a University, Dept. of Computing Science cost (from the CSR to a new format) of recently developed [email protected] formats including HYB (SC ’09), Cocktail (ICS ’12), ESB (ICS ’13), BCCOO (PPoPP ’14), BRC (ICS ’14), ACSR (SC ’14), CSR-Adaptive (SC ’14, HiPC ’15) and CSR5 MS60 (ICS ’15). Low Rank Approximation of Sparse Matrices for Weifeng Liu Lu Factorization Based on Row and Column Per- Niels Bohr Institute mutations University of Copenhagen [email protected] In this talk we present an algorithm for computing a low rank approximation of a sparse matrix based on a trun- cated LU factorization with column and row permutations. MS61 We present various approaches for determining the column and row permutations that show a trade-off between speed New Building Blocks for Scalable Electronic Struc- versus deterministic/probabilistic accuracy. We show that ture Theory if the permutations are chosen by using tournament pivot- We highlight two central building blocks for scalable ing based on QR factorization, then the obtained truncated electronic structure theory on current high-performance LU factorization with column/row tournament pivoting, computers:(i) New methodological approaches beyond LU CRTP, satisfies bounds on the singular values which semilocal density-functional theory based on a local- have similarities with the ones obtained by a communi- ized resolution-of-indentity approach for the two-electron cation avoiding rank revealing QR factorization. We also Coulomb operator; (ii) a new, general open-source library compare the computational complexity of our algorithm platform ”ELSI” that will provide a seamless interface to with randomized algorithms and show that for sparse ma- address or circumvent the Kohn-Sham eigenvalue problem. trices and certain accuracy, our approach is faster. Current work focuses on the massively parallel eigenvalue Laura Grigori solver ELPA, the orbital minimization method (OMM), INRIA and the PEXSI library. France Volker Blum [email protected] Duke University Department of Mechanical Engineering and Materials Sebastien Cayrols Science INRIA [email protected] [email protected]

James W. Demmel MS61 University of California MOLGW: A Parallel Many-Body Perturbation Division of Computer Science Theory Code for Finite Systems [email protected] The many-body perturbation theory was mostly developed for condensed matter. The successes of the GW approx- MS60 imation to the electron self-energy for the band structure The Challenge of Strong Scaling for Direct Meth- and of the Bethe-Salpeter equation for the absorption spec- ods trum are uncountable for solids. However, very little is known about the performance of many-body perturbation We present an analysis of the problems of getting near theory for atoms and molecules. The code molgw ad- full performance for sparse direct methods, particularly for dresses the efficient and accurate calculation within this symmetric systems, on complex modern hardware. Our framework using modern parallelization, based on MPI and discussion is informed by experience on both modern CPU SCALAPACK. and GPU hardware. We focus on getting high performance on small problems that are still important in many com- Fabien Bruneval munities. The algorithms and techniques described can be CEA, DEN, SRMP used as building blocks to address the solution of larger [email protected] systems on HPC resources. Samia M. Hamed, Tonatiuh Rangel-Gordillo, Jeffrey B. Jonathan Hogg Neaton STFC Rutherford Appleton Lab, UK UC Berkeley (USA) [email protected] [email protected], [email protected], [email protected]

MS60 MS61 Assessing Recent Sparse Matrix Storage Formats GPU-Accelerated Simulation of Nano-Devices PP16 Abstracts 121

from First-Principles [email protected]

This talk will introduce a newly developed algorithm called SplitSolve to take care of the Schr¨odinger equation with MS62 open boundary conditions, as encountered in quantum Recent Advances in the Task-Based StarPU Run- transport calculations. The strengths of SplitSolve are the time Ecosystem significant reduction of the computational time it offers and its ability to fully leverage hybrid nodes with CPUs The task paradigm has allowed the StarPU dynamic run- and GPUs. It thus allows for the transport simulation of time system to achieve scalable performance for various ap- nanostructures composed of thousands of atoms at the ab- plications over heterogeneous systems and clusters thereof. initio (from first-principles) level. We will here discuss the latest advances which allow us to proceed a step further: controlling the memory footprint Mathieu Luisier, Mauro Calderara, Sascha Brueck (notably of applications with compressed data), employing Integrated Systems Laboratory local disks, achieving perfectly reproducible experiments ETH Zurich through simulation, and showing how far we are from per- [email protected], [email protected], fect execution. [email protected] Samuel Thibault University of Bordeaux 1, Inria Hossein Bani-Hashemian, Joost VandeVondele [email protected] Nanoscale Simulations ETH Zurich [email protected], Marc Sergent [email protected] INRIA, France [email protected]

MS61 Suraj Kumar INRIA High-Performance Real-Time TDDFT Excited- France States Calculations [email protected]

From atoms and molecules to nanostructures, we discuss Luka Stanisic how the FEAST solver has considerably broaden the per- INRIA Grenoble Rhˆone-Alpes spectives for enabling reliable and high-performance large- [email protected] scale first-principle DFT and real-time TDDFT calcula- tions. Our TDDFT simulations can operate a broad range of electronic spectroscopy: UV-ViS, X-ray (using MS62 our all-electron method), and near IR (using our TDDFT- Task Dataflow Scheduling and the Quest for Ex- Ehrenfest approach). Various numerical results will be pre- tremely Fine-Grain Parallelism sented and discussed. Task dataflow programming models provide a useful pro- Eric Polizzi gramming abstraction but the underpinning dynamic University of Massachusetts, Amherst, USA schedulers carry a burden that imposes limits on strong [email protected] scaling. This work proposes methods to assess the inher- ent overhead of schedulers and applies them to assess the burden of industry-standard schedulers. We demonstrate MS62 that the scheduler burden explains the performance scala- bility of a set of memory-intensive and compute-intensive Composition and Compartmentalisation As En- codes. Based on this we develop a novel scheduler that abling Features for Data-Centric, Extreme Scale demonstrates increased scalability. Applications: An MPI-X Approach Hans Vandierendonck,MahwishArif Next generation extreme scale programming model is of- Queen’s University Belfast tenenvisionedasMPI+X,wherethesecondlayerXisone Electronics, Electrical Engineering, and Computer Science of OpenMP, OpenACC, etc. Conversely, we advocate a [email protected], [email protected] MPI-X approach, where MPI is only used as communica- tion library and applications are composed assembling in streaming fashion data-parallel kernels described as plain MS62 C++11 classes. They are designed to run on compositional A Single Interface for Multiple Task-Based Parallel accelerators, which can be implemented as cluster of de- Programing Paradigms/Schemes vices, e.g. multi-cores, GPUs, FPGAs or a heterogeneous combination of them. Task-based parallel programming schemes have proven ad- vantageous regarding ease of programming, efficiency and Marco Aldinucci, Maurizio Drocco, Claudia Misale performance. By employing a hierarchical definition of University of Torino tasks and data, locality and bandwidth utilization can be Italy maximized for a specific underlying hardware. However, [email protected], [email protected], [email protected] explicitly encoding each level of parallelism becomes cum- bersome. We present a unified programming interface that Massimo Torquati allows an algorithm written in terms of non-hierarchical University of Pisa tasks to be scheduled hierarchically across different shared Italy and distributed memory environments by a task-based run- 122 PP16 Abstracts

time. [email protected]

Afshin Zafari Uppsala University MS63 Dept. of Information Technology Auto-Tuning for Eigenvalue Solver on the Post [email protected] Moore’s Era Elisabeth Larsson Dense Real-Symmetric eigenvalue problem consists of three Uppsala University, Sweden main procedures. Each of three procedures has a unique Department of Information Technology aspect from the viewpoint of the insensitivity of memory [email protected] access or floating point operations. In the Post-Moores era, the present tuning strategy, which we reduce data traffic cost by CA (Communication Avoidance), SR (Syn- MS63 chronous Reduction), and AS (Asynchronous algorithm), Hardware/Algorithm Co-Optimization and Co- may change back into or mix up with the era of vector Tuning in the Post Moore Era computer. Automatic-tuning helps to find the sub-optimal combination of these strategies. In the Post Moore Era performance and power efficiency is Toshiyuki Imamura obtained from better resource usage, since no free speed up RIKEN Advance Institute for Computational Science due to technology scaling is helping along any more. In this [email protected] talk we will present how hardware specialization and design space exploration in conjunction with software synthesis and autotuning provides a technological basis for contin- ued performance and power gains in the Post Moore Era. MS63 Users write their programs against well-established math Automatic Tuning for Parallel FFTs on Intel Xeon library interfaces like the BLAS and FFTW interface, and Phi Clusters behind the scenes code synthesis and hardware/algorithm co-design leverages dark silicon, 3D stacking, and emerg- In this talk, we propose an implementation of a parallel ing memory technologies to translate substantial functional fast Fourier transform (FFT) with automatic performance blocks into highly optimized hardware constructs that can tuning on Intel Xeon Phi clusters. Parallel FFTs on Intel be software configured. This is work in progress; how- Xeon Phi clusters require intensive all-to-all communica- ever, we have demonstrated the power of this approach tion. An automatic tuning facility for selecting the opti- with space time adaptive processing, an FFT and linear mal parameters of overlapping between computation and algebra heavy kernel. communication, the all-to-all communication, the radices, and the block size is implemented. Performance results of Franz Franchetti FFTs on an Intel Xeon Phi cluster are reported. Department of Electrical and Computer Engineering Carnegie Mellon University Daisuke Takahashi [email protected] Faculty of Engineering, Information and Systems University of Tsukuba [email protected] MS63 On Guided Installation of Basic Linear Algebra Routines in Nodes with Manycore Components MS64 Thermal Comfort Evaluation for Indoor Airflows Most computational systems are composed of multicore Using a Parallel CFD Approach nodes together with several GPUs or Xeon Phi, which might be of different types. This work analyses the adapta- The thermal comfort prediction of occupants in rooms tion of auto-tuning techniques used for hybrid CPU+GPU and vehicles is of great interest during planning and op- systems to these more complex configurations. The study eration phases. With the evermore increasing computa- is conducted on basic linear algebra kernels. Satisfactory tional power available, a great variety of problems can be results are obtained, with experimental execution times solved which were unthinkable until recently. This talk will close to the lowest experimentally obtained. present and discuss coupling procedures and strategies to integrate human thermoregulation based on Fialas model Javier Cuenca into an in-house CFD code specifically designed with par- Departamento de Ingeniera y Tecnologa de Computadores allel systems in mind to investigate indoor airflows. University of Murcia [email protected] J´erˆome Frisch Institute of Energy Efficiency and Sustainable Building Luis-Pedro Garc´ıa E3D Servicio de Apoyo a la Investigaci´on Cient{´ı}fica RWTH Aachen University Technical University of Cartagena [email protected] [email protected] Ralf-Peter Mundani Domingo Gim´enez TUM, Faculty of Civil, Geo and Environmental Departamento de Inform´atica y Sistemas Engineering University of Murcia Chair for Computation in Engineering PP16 Abstracts 123

[email protected] Chair for computation in engineering TU M¨unchen [email protected] MS64 Fast Contact Detection for Granulates J´erˆome Frisch Institute of Energy Efficiency and Sustainable Building We study the problem of contact detection between large E3D numbers of triangles resulting from meshed objects. This RWTH Aachen University problem arises in FSI and DEM, e.g., and classic algorithms [email protected] from computational geometry come along with many case distinctions. We propose an iterative, weak solver that Ralf-Peter Mundani vectorises but may fall back to a classic formulation if the TUM, Faculty of Civil, Geo and Environmental Newton steps do not converge quickly. Based upon a tree Engineering domain decomposition, it furthermore is parallelised via Chair for Computation in Engineering MPI. [email protected] Tomasz Koziara Durham University Ernst Rank [email protected] Computation in Engineering Technische Universit¨at M¨unchen Konstantinos Krestenitis [email protected] Durham University Durham, UK MS65 [email protected] A Flexible Area-Time Model for Analyzing and Jon Trevelyan, Tobias Weinzierl Managing Application Resilience Durham University Traditional error recovery systems such as checkpoint- [email protected], to- restart fix key aspects of the error model, detection timing, [email protected] and recovery approach. As a result, these methods have limited ability to adapt to different error types, statistics, MS64 or algorithm behavior. We have designed a new model, area-time, a general, flexible framework for reasoning about LB Particle Coupling on Dynamically-Adaptive errors, detection, and response. We apply this model to Cartesian Grids several specific application examples to demonstrate its Dynamically-adaptive, tree-structured grids are a well- utility. known way of solving partial differential equations with Andrew A. Chien less computational effort than regular Cartesian grids. We The University of Chicago focus on a minimally-invasive approach using p4est, a well- Argonne National Laboratory known framework for such grids, which allows reusing huge [email protected] parts of the existing code. We integrated p4est as a new mesh in ESPResSo, a large MD-simulation package. In our contribution, we present implementational details and Aiman Fang performance data for various background flow scenarios. Department of Computer Science University of Chicago Michael Lahnert [email protected] University of Stuttgart [email protected] MS65 Miriam Mehl Local Recovery at Extreme Scales with FenixLR: Universit¨at Stuttgart Implementation and Evaluation [email protected] This talk explores how local recovery can be used for certain classes of applications to reduce overheads due MS64 to resilience. Specifically, it is centered on the develop- High-Performance Computing in Free Surface ment of programming support and scalable runtime mecha- Fluid Flow Simulations nisms to enable online and transparent local recovery under the FenixLR framework. We integrate these mechanisms Nowadays, more and more questions originate from the so- with the S3D combustion simulation, and experimentally called emerging sciences such as medicine, sociology, biol- demonstrate (using Titan) the ability to tolerate high fail- ogy, virology, chemistry, climate or geo-sciences. Here, we ure rates (every 5 seconds) at scales up to 262144 cores. present an advanced simulation and (visual) exploration concept for various environmental engineering applications Marc Gamell such as porous media flows and high water simulations. NSF Cloud and Autonomic Computing Center, Rutgers Based on complex 3D building information and environ- Discovery mental models, we can perform multi-resolution simula- Informatics Institute, Rutgers University tions on massive parallel systems combined with an inter- [email protected] active exploration of the computed data during runtime. Keita Teranishi, Michael Heroux Sandia National Laboratories Nevena Perovic [email protected], [email protected] 124 PP16 Abstracts

Manish Parashar ence Challenges Department of Computer Science Rutgers University Abstract not available [email protected] Venkatram Vishwanath Argonne Leadership Computing Facility, Argonne MS65 National Lab. Spare Node Substitution of Failed Node [email protected]

Spare node substitution for a failed node is not a new idea, but its behavior is not well-studied so far. Spare node sub- MS66 stitution may preserve the communication pattern and the Complex In Situ and In Transit Workflows number of nodes of a user program. Thus, it can drasti- cally reduce the effort of fault handling at the user pro- Abstract not available gram. However, actual communication topology at physi- cal network cannot be the same and this can result in the Matthew Wolf severe degradation of communication performance. In this Georgia Institute of Technology talk, recent research result focusing on this topic will be [email protected] presented.

Atsushi Hori, Kazumi Yoshinaga, Yutaka Ishikawa MS67 RIKEN Advanced Institute for Computational Science Addressing the Future Needs of Uintah on Post- [email protected], [email protected], yu- Petascale Systems [email protected] The need to solve ever-more complex science and engineer- ing problems on computer architectures that are rapidly MS65 evolving poses considerable challenges for modern simula- Will Burst Buffers Save Checkpoint/Restart? tion frameworks and packages. While the decomposition of software into a programming model that generates a task- Exascale systems are projected to fail frequently and par- graph for execution by a runtime system make portability allel file system performance is not expected to match the and performance possible, it also imposes considerable de- increase in computational speed. Burst buffers can po- mands on that runtime system. The challenge of meeting tentially solve this problem by smoothing out bursty I/O those demands for near-term and longer term systems in requests, and reducing load on parallel file systems. In this the context of the Uintah framework is considered in key talk, we will explore the tradeoffs of using burst buffers as areas such as support for data structures, tasks on het- part of storage systems, and how fault tolerance libraries erogeneous architectures, performance portability, power can exploit burst buffers to provide increased performance management and designing for resilience by the use of repli- to user applications. cation based upon adaptive mesh refinement processes is addressed. Kathryn Mohror Lawrence Livermore National Laboratory Martin Berzins [email protected] Scientific Computing and Imaging Institute University of Utah [email protected] MS66 In Situ Processing Overview and Relevance to the Hpc Community MS67 Progress on High Performance Programming Abstract not available Frameworks for Numerical Simulations

E. Wes Bethel This talk reports the progress of high performance numeri- Lawrence Berkeley National Laboratory cal simulation frameworks in China. These frameworks are [email protected] designed to separate parallel algorithm design from parallel programming, which eases challenging application develop- ment even with the rapid growth of supercomputers. The MS66 domain-specific software architectures and components are Overview of Contemporary In Situ Infrastructure introduced on top of general parallel programming systems, Tools and Architectures namely MPI and OpenMP. A variety of multi-petaflops ap- plications are developed on top of these frameworks, illus- Abstract not available trating the success of our approaches.

Julien Jomier Zeyao Mo Kitware Laboratory of Computational Physics, IAPCM [email protected] P.O. Box 8009, Beijing 100088, P.R. China zeyao [email protected] Patrick O’Leary Kitware, Inc., USA [email protected] MS67 Can We Really Taskify the World? Challenges for Task-Based Runtime Systems Designers MS66 In Situ Algorithms and Their Application to Sci- To Fully tap into the potential of heterogeneous manycore PP16 Abstracts 125

machines, the use of runtime systems capable of dynam- Climate Networks ically scheduling tasks over the pool of underlying com- puting resources has become increasingly popular. Such Over the last decade, the techniques of complex network runtime systems expect applications to generate a graph analysis have found application in climate research. Here of tasks of sufficient ”width” so as to keep every process- the observation locations serve as nodes and edges (links) ing unit busy. However, not every application can exhibit are based on statistical measures of similarity, e.g. a cor- enough task-based parallelism to occupy the tremendous relation coefficient, between pairwise time series of climate number of processing units of upcoming supercomputers. variables at these different locations. Many studies in Nor can they generate tasks of appropriate granularity for the field focused on investigating correlation patterns in CPUs and accelerators. In this talk, I will briefly present the atmospheric surface temperature and teleconnections the main features provided by the StarPU task-based run- as well as the synchronization between climate variabil- time system, and I will go through the major challenges ity patterns. In most studies above so called interaction that runtime systems’ designers have to face in the upcom- networks were used. Up to now, the data sets (both obser- ing years. vational and model based) used to reconstruct and analyze climate networks have been relatively small due to compu- Raymond Namyst tational limitations. As a consequence, coarse-resolution University of Bordeaux 1, Inria observational and model data have been used with a fo- [email protected] cus only on large-scale properties of the climate system. This system is, however, known for its multi-scale interac- tions and hence one would like to explore the interaction MS67 of processes over the different scales. Data are available Exascale Software through high-resolution ocean/atmosphere/climate model simulations but they lead to networks with at least 105ˆ Exascale software requries innovation on many levels, such nodes and hence imposing a compelling need for new al- as scalability to beyond a million cores, energy awareness, gorithms and parallel computational tools that can enable and algorithm based resilience. Future supercomputers will high-performance graph reconstruction and analysis. In only deliver their full potential when applications, mathe- this talk we introduce the application of complex graph matical models, and numerical algorithms are adapted to tools in climate research as well as our complete toolbox ever more complex architectures in a co-design process that ParGraph designed for parallel computing platforms, which is based on the systematic engineering for performance in is capable of the preprocessing of large number of climate all stages of the the computational science pipeline. time series and the calculation of pairwise statistical mea- sures, leading to the construction of massive node climate Hans-Joachim Bungartz networks. In addition, ParGraph is provided with a set of Technische Universit¨at M¨unchen, Department of high-performance network analyzing algorithms designed Informatics for symmetric multiprocessing machines (SMPs). Chair of Scientific Computing in Computer Science [email protected] Hisham Ihshaish UWE Bristol Ulrich J. Ruede Faculty of Environment & Technology University of Erlangen-Nuremberg [email protected] Department of Computer Science (Simulation) [email protected] Alexis Tantet IMAU Philipp Neumann University of Utrecht Technische Universit¨at M¨unchen, Germany [email protected] Scientific Computing in Computer Science [email protected] Johan Dijkzeul VORtech BV Delft, The Netherlands MS68 [email protected] The GraphBLAS Effort: Kernels, API, and Paral- lel Implementations Henk A Dijkstra Department of Physics and Astronomy, Utrecht I will first give an overview of the GraphBLAS effort for University standardizing graph algorithmic building blocks in the lan- The Netherlands guage of linear algebra, including its scope, the set of ker- [email protected] nels involved and the proposed API. I will then report on recent results in developing high-performance implemen- tations of some of the most important GraphBLAS ker- MS68 nels such as sparse matrix-matrix multiplication and sparse Generating and Traversing Large Graphs in matrix-vector multiplication in clusters of multicore pro- External-Memory. cessors. Large graphs arise naturally in many real world applica- Aydin Buluc tions. The actual performance of simple RAM model algo- Lawrence Berkeley National Laboratory rithms for traversing these graphs (stored in external mem- [email protected] ory) deviates significantly from their linear or near-linear predicted performance because of the large number of I/Os they incur. In order to alleviate the I/O bottleneck, many MS68 external memory graph traversal algorithms have been de- Parallel Tools to Reconstruct and Analyze Large signed with provable worst-case guarantees. In the talk I 126 PP16 Abstracts

highlight some techniques used in the design and engineer- [email protected] ing of such algorithms and survey the state-of-the-art in I/O-efficient graph traversal algorithms. I will also report on recent work concerning the generation of massive scale MS69 free networks under resource constraints. Real-Valued Technique for Solving Linear Systems Arising in Contour Integral Eigensolver Ulrich Meyer Goethe University Frankfurt am Main Contour integral eigenvalue solvers such as the Sakurai- [email protected] Sugiura method and the FEAST algorithm have been de- veloped in the last decade. The main advantage of the con- tour integral solvers is their inherent parallelism. However MS68 they require linear system solving with complex matrices even if the matrices of the problem are real. In this study, (Generating And) Analyzing Large Networks with we propose a real-valued technique for solving linear sys- NetworKit tems arising in the contour integral methods, which only involves factorizations of real matrices. Complex networks are equally attractive and challenging targets for data mining. Novel algorithmic solutions, in- Yasunori Futamura cluding parallelization, are required to handle data sets University of Tsukuba containing billions of connections. NetworKit is a grow- [email protected] ing open-source software package for the analysis of such large complex networks on shared-memory parallel com- Tetsuya Sakurai puters. It is designed to deliver the best of two worlds: the Department of Computer Science performance of C++ algorithms and data structures (with University of Tsukuba parallelism by OpenMP) and a convenient Python frontend [email protected] for interactive workflows. The current feature set includes not only various fundamental analytics kernels and a col- lection of basic graph generators. Several novel algorithms MS69 for scalable network analysis have been realized with Net- worKit as well recently. Of those, we will present network Parallel Eigensolvers for Density Functional The- profiling, fast algorithms for computing centrality values ory (importance of nodes/edges) in different scenarios, and the Density functional theory (DFT) aims to solve the first generator of hyperbolic random graphs that takes sub- Schr¨odinger equation by modelling electronic correlation as quadratic time. 3 a function of density. Its relatively modest O(N ) scaling makes it the standard method in electronic structure com- Henning Meyerhenke putation for condensed phases containing up to thousands Karlsruhe Institute of Technology (KIT) of atoms. Computationally, its bottleneck is the partial di- Institute of Theoretical Informatics agonalisation of a Hamiltonian operator, which is usually [email protected] not formed explicitely. Using the example of the Abinit code, I will discuss the challenges involved in scaling plane- Elisabetta Bergamini wave DFT computations to petascale supercomputers, and Institute of Theoretical Informatics show how the implementation of a new method based on Karlsruhe Institute of Technology (KIT) Chebyshev filtering results in good parallel behaviour up [email protected] to tens of thousands of processors. I will also discuss some open problems in the numerical analysis of eigensolvers and Moritz von Looz, Roman Prutkin, Christian L. Staudt extrapolation methods used to accelerate the convergence Karlsruhe Institute of Technology (KIT) of fixed point iterations. Institute of Theoretical Informatics [email protected], [email protected], Antoine Levitt [email protected] Ecole des Ponts ParisTech [email protected]

MS69 MS69 Advancing Algorithms to Increase Performance of A Parallel Algorithm for Approximating the Sin- Electronic Structure Simulations on Many-Core gle Particle Density Matrix for O(N) Electronics Architectures Structure Calculations

Effective utilization of many-core architectures for large- We present two strategies for approximating the entries of scale electronic structure calculations requires fast, com- the single particle density matrix that are needed for O(N) putationally scalable and efficient computational tools uti- solution of the DFT equations. These strategies are based lizing novel algorithmic approaches on the latest hardware on the use of diagonalization or a Chebyshev polynomial technologies. In this presentation we will discuss results, approximation to compute the matrix exponential function lessons learned as well as pitfalls resulting from our al- using a principal submatrix of the global Hamiltonian ma- gorithmic efforts and code optimizations to transform the trix. We discuss pros and cons of each strategy within the NWChem computational chemistry software to utilize sys- context of developing a parallel and scalable algorithm for tems with Intel based multicore architectures connected by electronics structure calculations. fast communication networks. Daniel Osei-Kuffuor Wibe A. De Jong Lawrence Livermore National Laboratory Lawrence Berkeley National Laboratory oseikuff[email protected] PP16 Abstracts 127

Jean-Luc Fattebert how to extend the OpenMP task-model to support dis- Lawrence Livermore National Lab. tributed computing through GPUs and how the model fur- [email protected] ther can be used to drive hardware generation for FPGAs.

Artur Podobas MS70 KTH Royal Institute of Technology Resilience in Distributed Task-Based Runtimes (2) [email protected]

Abstract not available MS71 George Bosilca PFLOTRAN: Practical Application of High Per- University of Tennessee - Knoxville formance Computing to Subsurface Simulation [email protected] As subsurface simulation becomes more complex with the addition of physical and chemical processes, careful soft- MS70 ware engineering must be employed to preserve parallel Developing Task-Based Applications with Py- performance. This presentation will discuss a recent refac- compss tor of PFLOTRANs high-level process model coupling that has facilitated the integration of new processes (e.g. geo- COMP Superscalar (COMPSs) is a task-based program- physics, spent nuclear fuel degradation/dissolution) while ming model which aims to ease the development of parallel maintaining scalability. Examples of PFLOTRAN and applications for distributed infrastructures, such as Clus- high performance computing being applied to simulate ters, Grids and Clouds, by exploiting the inherent paral- practical, real-world subsurface problem scenarios will be lelism of applications at execution time. In particular, it included. is able to handle Python applications through its Python binding (PyCOMPSs), which provides an API and dec- Glenn Hammond orators that can be used in order to adapt and manage Sandia National Laboratories sequential python applications, hiding the complexities of [email protected] parallelization.

Javier Conejero MS71 Barcelona Supercomputing Centre Parallel Performance of a Reservoir Simulator in Spain the Context of Unstable Waterflooding [email protected] We consider the parallel performance of a reservoir sim- MS70 ulator in the context of water injection in viscous oils. Such problems usually require very fine grids and high- DuctTeip: A Framework for Distributed Hierarchi- order schemes in order to capture the complex fingering cal Task-Based Parallel Computing patterns induced by viscous instabilities. In this context We present a dependency-aware task parallel programming parallel performance is very important and the scalability framework for hybrid distributed-shared memory paral- of the linear solver becomes critical. We compare different lelization. Both data and tasks in the framework are hi- strategies for the Algebraic Multigrid method among which erarchically organized in order to exploit the underlying a new hybrid aggregation/point-wise coarsening strategy. heterogeneous hardware architecture. A data centric view Romain De Loubens based on a versioning system is employed. By executing an Total application code in simulation mode, the choice of data dis- [email protected] tribution and blocking scheme can be tuned a priori. The performance of the framework is evaluated against other similar efforts. Pascal Henon TOTAL E&P Elisabeth Larsson [email protected] Uppsala University, Sweden Department of Information Technology Pavel Jiranek [email protected] CDSP [email protected] Afshin Zafari Uppsala University Dept. of Information Technology MS71 [email protected] Ongoing Development and HPSC Applications with the ParFlow Hydrologic Model

MS70 ParFlow is a three-dimensional variably saturated Heterogeneous Computing with GPUs and FPGAs groundwater-surface water flow code that is currently ap- Through the OpenMP Task-Parallel Model plied at space and time scales up to continents (Europe, USA) and decades. This is made possible by the efficient The task-based programming model has become one the parallelism and, e.g. an advanced octree space-partitioning prominent models for parallelizing software on homoge- algorithm implemented in ParFlow. Ongoing advance- neous shared-memory architectures. However, because ments that will be illustrated with subsurface applications Dennard’s scaling no longer holds, we have to leave ho- include software infrastructure for coupled energy trans- mogeneity behind and move towards a more heterogeneous port; integration of the p4est meshing library; and inte- (possibly distributed) solution. In this talk, we will show gration of ParFlow in TerrSysMP for terrestrial systems 128 PP16 Abstracts

modeling. [email protected]

Stefan J. Kollet Meteorological Institute MS72 University of Bonn Title Not Available [email protected] Abstract not available Klaus Goergen J¨ulich Supercomputing Centre Thierry Coupez Research Centre J¨ulich MINES ParisTech [email protected] [email protected]

Carsten Burstedde MS72 Universit¨at Bonn [email protected] Bb-Amr and Numerical Entropy Production Abstract not available Jose Fonseca Institute for Numerical Simulation Fr´ed´eric Golay Bonn University Universit´e de Toulon, France fonseca.ins.uni-bonn.de [email protected]

Fabian Gasper Agrosphere Research Center J¨ulich MS72 [email protected] Title Not Available

Reed M. Maxwell Abstract not available Department of Geology and Geological Engineering and IGWMC Pierre Kestener Colorado School of Mines maisson de la simulation [email protected] [email protected]

Prabhakar Shresta, Mauro Sulis MS73 Meterological Institute Bonn University How Will Application Developers Navigate the [email protected], [email protected] Cost/Peformance Trade-Offs of New Multi-Level Memory Architectures?

MS71 Data movement and memory bandwidth are becoming the limiting performance factors in many applications. Hybrid Hybrid Dimensional Multiphase Compositional memory cubes (HMC), high-bandwidth memory (HBM), Darcy Flows in Fractured Porous Media and Par- and cheap capacity NVARM memories are now becoming allel Implementation in Code ComPASS available. Performance must balance cost since high band- width memory will be far too expensive at high capac- We consider the Hybrid dimensional multiphase composi- ity and NVRAM will not provide needed bandwidth, re- tional Darcy flows in fractured porous media. The model quiring both technologies to be leveraged. Managing com- accounts for the coupling of the mass balance of each com- plex memory hierarchies will prove a serious programming ponent with the pore volume conservation and the ther- burden for domain science. Here we present a simulation modynamical equilibrium and dynamically manages phase framework and profiling tools for analyzing memory access appearance and disappearance. We implement this model profiles to enable runtime tools and algorithm refactoring in the framework of code ComPASS (Computing Parallel to fully exploit new memory architectures. Architecture to Speed up Simulation), which focuses on parallel high performance simulation (MPI). Arun Rodrigues Sandia National Laboratories Feng Xing [email protected] University of Nice Inria [email protected] MS73 Performance-Energy Trade-Offs in Reconfigurable Roland Masson Optical Data-Center Networks Using Silicon Pho- University of Nice Sophia Antipolis tonics [email protected] Optical circuits can provide large bandwidth improvements relative to electrical-switched networks in HPC systems. MS72 Optical circuit setup, however, can have high latencies. Title Not Available Application-guided, latency-avoiding circuit management techniques for dynamic reconfiguration have shown poten- Abstract not available tial for better achieving the full throughput of photonics- based networks. We extend these ideas to dynamically Donna Calhoun reconfiguring topologies like dragonfly to match applica- Boise State University tion communication patterns. We further explore where PP16 Abstracts 129

runtime analysis tools can automatically manage circuits tree as our data structure. To address the load-imbalance without changing application code. problem for distributed-memory Lagrangian schemes, we introduce a novel and robust partitioning scheme. Sebastien Rumley,KerenBergman Columbia University Arash Bakhtiari [email protected], [email protected] Technische Universitaet Muenchen [email protected] MS73 Dhairya Malhotra Helping Applications Reason About Next- Institute of Computational Engineering and Sciences Generation Architectures Through Abstract The University of Texas at Austin Machine Models [email protected] To achieve performance under increasing energy con- straints, next-generation architectures will undergo funda- George Biros mental changes to include more heterogeneous processors, University of Texas at Austin network-on chip topologies, or even multi-level memories. [email protected] These architectures will inevitably change how programs are written. Detailed specifications do not enable commu- nity engagement or design space exploration with applica- MS74 tion developers. Here we present abstract machine mod- A Volume Integral Equation Solver for Elliptic els (AMMs) and corresponding proxy architectures that PDEs in Complex Geometries provide useful starting points for reasoning about appli- cations on extreme-scale hardware. Where appropriate, We present a new framework for solving volume inte- architecture details are either ignored, abstracted, or fully gral equation (VIE) formulations of Poisson, Stokes and exposed in the AMMs. We survey the programming impli- Helmholtz equations in complex geometries. Our method cations of different AMMs and illusrate their usefulness in is based on a high-order discretization of the data on adap- co-designing applications. tive octrees and uses a kernel independent fast multipole method along with precomputed quadratures for evaluat- John Shalf, David Donofrio ing volume integrals efficiently. We will present new per- Lawrence Berkeley National Laboratory formance optimizations and demonstrate scalability of our [email protected], [email protected] method on distributed memory platforms with thousands of compute nodes. MS73 Dhairya Malhotra When Is Energy Not Equal to Time? Understand- Institute of Computational Engineering and Sciences ing Application Energy Scaling with EAudit The University of Texas at Austin Under future power caps, improving energy efficiency will [email protected] become a critical component of sustaining improvements in time to solution. However, due to fundamental differences George Biros between how energy and time scale with parallelism, time University of Texas at Austin is not a good proxy for the distribution of energy in an [email protected] application. This talk covers energy analogs to Amdahls law to understand weak/strong scaling of application en- ergy and its use in eAudit a tool for energy debugging of MS74 applications. Efficient Parallel Simulation of Flows, Particles, Sudhakar Yalamanchili and Structures The School of Electrical and Computer Engineering Georgia Institute of Technology The efficient simulation of flow scenarios involving the [email protected] transport of particles and the interaction with other phys- ical phenomena such as electrostatics or elastic struc- tures usually requires high grid resolutions due to accu- Eric Anger racy requirements and different spatial/temporal scales. Georgia Institute of Technology We present several application cases for the efficient use [email protected] of octree-like adaptive computational grids in large exist- ing simulation environments. Such grids are an ideal tool MS74 in this context due to their minimal memory requirements and flexible adaptivity. A Distributed-Memory Advection-Diffusion Solver with Dynamic Adaptive Trees Miriam Mehl We present a numerical method to solve the Advection- Universit¨at Stuttgart Diffusion problem. In our approach, we solve the ad- [email protected] vection equation by using the semi-Lagrangian method. The solution of the diffusion equation is computed by Michael Lahnert applying an integral transform: a convolution with the University of Stuttgart fundamental solution of the modified Laplace PDE. For [email protected] time-marching, the Stiffly-Stable method is used. These methods are implemented on top of a dynamic spatially- Benjamin Uekermann adaptive distributed-memory parallelized Chebyshev oc- TU Munich 130 PP16 Abstracts

[email protected] calculation of electronic structure of materials and simula- tion of fluid-structure interaction.

MS74 Deng Lin Communication-Reducing and Communication- Laboratory of Scientific and Engineering Computing, Avoiding Techniques for the Particle-in-Dual-Tree Chinese Academy of Sciences, Beijing, China Storage Scheme [email protected]

We study particle-in-cell with suprathermal particles Wei Leng driven by a PDE. The particles do not interact but some Laboratory of Scientific and Engineering Computing jump over cells. It is a HPC worst-case for its low arith- Chinese Academy of Sciences, Beijing, China metic intensity and re-sorting which makes it latency- and [email protected] bandwidth-sensitive. Three ideas help: A tree grammar predicts movements and allows us to skip communication. Tao Cui Run-length encoding for integer data along domain bound- LSEC, the Institute of Computational Mathematics aries reduces the memory footprint. A hierarchical floating Chinese Academy of Sciences point representation reduces the byte count further. [email protected] Tobias Weinzierl Durham University MS75 [email protected] Cello: An Extreme Scale Amr Framework, and Enzo-P, An Application for Astrophysics and Cos- MS75 mology Built on Cello Dune - A Parallel Pde Framework We present Cello, an extremely scalable AMR software in- The Distributed and Unified Numerics Environment frastructure for post-petascale architectures, and Enzo-P, (DUNE, www.dune-project.de) is a software framework for a multiphysics application for astrophysics and cosmology the numerical solution of partial differential equations with simulations built on Cello. Cello implements a forest-of- grid-based methods started by several groups in Germany octrees spatial mesh on top of the CHARM++ parallel ob- in 2002. Its main goals are combining (1) flexibility on the jects framework developed at UIUC. Leaf nodes of the oc- one hand with (2) efficiency on the other hand while allow- trees are blocks of fixed size, and are represented as chares ing also the (3) reuse of existing finite element software. A in a hierarchical chare array. Cello supports particle and central aspect of DUNE is its abstract mesh interface al- field classes and operators for the construction of the ex- lowing the formulation of generic algorithms operating on plicit finite-difference/volume methods, N-body methods, a large variety of different types of meshes (any-d, simpli- and sparse linear system solvers called from Enzo-P. We cial, cuboid, hierarchical, conforming and non-conforming, present initial performance and scaling results on a large different refinement rules, ...). Within the framework of cosmological test problem. the german priority programme 1648 ‘Software for exa- scale computing’ DUNE is currently prepared for future Michael L. Norman exa-scale hardware. The talk will highlight the progress in University of California at San Diego exploiting parallelism on all different levels (MPI, thread- [email protected] ing, SIMD) and present results on high-order discontinuous Galerkin methods. MS75 Christian Engwer Peta-scale Simulations with the HPC Software INAM, University of M¨unster Framework Walberla [email protected] We present the HPC software framework waLBerla that Steffen M¨uthing has demonstrated excellent parallel performance on differ- Heidelberg University ent peta-scale supercomputers for simulations based on the steff[email protected] lattice Boltzmann as well as the phase-field method. The presentation focuses on the distributed core data structures Peter Bastian of the framework that enable block adaptive mesh refine- Interdisciplinary Center for Scientific Computing ment. We will discuss the performance-driven development University of Heidelberg process and the continuous integration pipeline that guar- [email protected] antees compatibility with a large number of different com- pilers and systems through fully automatic testing.

MS75 Florian Schornbaum University of Erlangen-Nuremberg PHG: A Framework for Parallel Adaptive Finite fl[email protected] Element Method

PHG (Parallel Hierarchical Grid) is a general framework for Martin Bauer developing parallel adaptive finite element method appli- University of Erlangen-Nuremberg cations. The key feature of PHG includes: bisection based Department of Computer Science conforming adaptive tetrahedral meshes, various finite ele- [email protected] ment bases support and hp adaptivity, finite element code automatic generation, etc. We also introduce a few appli- Christian Godenschwager, Ulrich R¨ude cations based on PHG, including earthquake simulation, University of Erlangen-Nuremberg PP16 Abstracts 131

[email protected], [email protected] responding vectors, amplifying the focus on the largest- magnitude entries. Experimental results indicate that di- amond sampling is orders of magnitude faster than direct MS76 computation and requires far fewer samples than any com- Communication Avoiding Ilu Preconditioner peting approach. We also apply diamond sampling to the special case of maximum inner product search, and get sig- In this talk, we present a new parallel preconditioner called nificantly better results than the state-of-the-art hashing CA-ILU. It performs ILU(k) factorization in parallel and is methods. applied as preconditioner without communication. Block Jacobi and Restrictive Additive Schwartz preconditioners Grey Ballard are compared to this new one. Numerical results show Sandia National Laboratories that CA-ILU(0) outperforms BJacobi and is close to RAS [email protected] in term of iterations whereas CA-ILU, with complete LU on each block, does for both of them but uses lot of mem- C. Seshadhri ory. Thus CA-ILU(k) takes advantages of both in term of Sandia National Labs memory consumption and iterations. [email protected] Sebastien Cayrols INRIA Tamara G. Kolda [email protected] Sandia National Laboratories [email protected] Laura Grigori INRIA Ali Pinar France Sandia National Labs [email protected] [email protected]

MS76 MS76 Parallel Approximate Graph Matching Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs We study approximate matching heuristics for undirected graphs by extending our recent work which focused on bi- Graph-structured data in network security, social networks, partite graphs. The approach uses a judiciously sampled finance, and other applications not only are massive but subgraph of the input graph and search for a maximum also under continual evolution. The changes often are scat- matching in it. The approach is highly parallelizable and tered across the graph, permitting novel parallel and incre- we are currently working on the theoretical approximation mental analysis algorithms. We discuss analysis algorithms guarantee. It can be used in practice to avoid worst cases for streaming graph data to maintain both local and global while constructing an initial matching. Our experiments metrics with low latency and high efficiency. demonstrate parallel speed-ups and show that the approx- imation guarantees do not deteriorate with the increased Jason Riedy degree of parallelization. Georgia Institute of Technology School of Computational Science and Engineering Fanny Dufoss´e [email protected] Inria Lille, Nord Europe, 59650, Villeneuve dAscq, France David A. Bader [email protected] Georgia Institute of Technology [email protected] Kamer Kaya Sabanci University, Faculty of Engineering and Natural Sciences MS77 [email protected] A Block Low-Rank Multithreaded Factorization for Dense BEM Operators. Bora Ucar LIP, ENS-Lyon, France. We consider a Block Low-Rank multithreaded LDU fac- [email protected] torization for BEM electromagetics applications. In this context, both the computation and accumulation of up- dates appear to be critical to ensure good performance. MS76 In both kernels of the factorization, each matrix may be Diamond Sampling for Approximate Maximum dense or low rank. First is a matrix-matrix multiply, where All-pairs Dot-product (MAD) Search outer-orthogonality of the representation is key. Second is a matrix-matrix merge, where there are many strategies in Given two sets of vectors A and B, our problem is to find a multithreaded environment. the top-t dot products among all possible pairs. This is a fundamental mathematical problem that appears in numer- Julie Anton ous data applications involving similarity search, link pre- LSTC diction, and collaborative filtering. We propose a sampling- [email protected] based approach that avoids direct computation of all dot products. We select diamonds (i.e., four-cycles) from the Cleve Ashcraft weighted tripartite representation of A and B. The prob- Livermore Software Technology ability of selecting a diamond corresponding to pair (i,j) Corporation is proportional to the square of the product of the cor- [email protected] 132 PP16 Abstracts

Cl´ement Weisbecker [email protected] LSTC [email protected] Alfredo Buttari CNRS-IRIT-Universit´e de Toulouse, France [email protected] MS77 Complexity and Performance of Algorithmic Vari- Pieter Ghysels ants of Block Low-Rank Multifrontal Solvers Lawrence Berkeley National Laboratory Computational Research Division Matrices coming from elliptic Partial Differential Equa- [email protected] tions (PDEs) have been shown to have a low-rank property: well defined off-diagonal blocks of their Schur complements can be approximated by low-rank products and this prop- Jean-Yves L’Excellent erty can be efficiently exploited in multifrontal solvers to INRIA-LIP-ENS Lyon provide a substantial reduction of their complexity. Among [email protected] the possible low-rank formats, the Block Low-Rank for- mat (BLR) is easy to use in a general purpose multifrontal Xiaoye Sherry Li solver and has been shown to provide significant gains com- Computational Research Division pared to full-rank on practical applications. However, un- Lawrence Berkeley National Laboratory like hierarchical formats, such as H and HSS, its theoretical [email protected] complexity was unknown. In this talk, we present several variants of the BLR multifrontal factorization, depending Theo Mary on the strategies used to perform the updates in the frontal Universite de Toulouse, INPT ENSEEIHT matrices and on the approaches to handle numerical pivot- [email protected] ing. We extend the theoretical work done on hierarchical matrices in order to compute the theoretical complexity of Cl´ement Weisbecker, Cl´ement Weisbecker each BLR variant. We then provide an experimental study LSTC with numerical results to support our complexity bounds. [email protected], [email protected] Finally, we compare and analyze the performance of each variant in a parallel setting. MS77 Patrick Amestoy Hierarchical Matrix Arithmetics on Gpus INP(ENSEEIHT)-IRIT [email protected] GPU implementations of hierarchical matrix operations are challenging as they require irregular data structures and Alfredo Buttari multilevel access patterns. Careful design, data layout, CNRS-IRIT-Universit´e de Toulouse, France and tuning are needed to allow the effective mapping of [email protected] the rank-structured representations and associated opera- tions to manycore architectures. In this talk, we describe Jean-Yves L’Excellent the design of GPU-hosted, tree-based, representations and INRIA-LIP-ENS Lyon algorithms that can compress dense matrices into a hier- [email protected] archical format and perform linear algebra operations di- rectly on the compressed representations in a highly effi- Theo Mary cient manner. Universite de Toulouse, INPT ENSEEIHT [email protected] Wajih Halim Boukaram KAUST [email protected] MS77 A Comparison of Parallel Rank-Structured Solvers Hatem Ltaief King Abdullah University of Science & Technology We present a direct comparison of different rank-structured (KAUST) solvers: STRUMPACK (that implements HSS representa- [email protected] tions, by the authors), MUMPS (Block Low-Rank repre- sentations, Amestoy et al.), and HLIBPro (H-arithmetic, George M Turkiyyah Kriemann et al.). We report on results for large dense and American University of Beirut sparse problems from academic and industrial applications, [email protected] in a distributed-memory context. David E. Keyes Francois-Henry Rouet KAUST Lawrence Berkeley National Laboratory [email protected] [email protected]

Patrick Amestoy MS78 ENINPT-IRIT, Universit´e of Toulouse, France Traversing a BLR Factorization Task Dag Based on [email protected] or [email protected] a Fan-All Wraparound Map

Cleve Ashcraft We consider the factorization of a BLR matrix partitioned Livermore Software Technology by block rows and columns. The atomic factor, solve and Corporation MMM tasks merge with submatrices to create a task DAG, PP16 Abstracts 133

which we embed in a 3D Cartesian mesh. Then the atomic Fr´ed´eric Vivien tasks are scheduled to processors block wise using a 3D LIP, ENS Lyon, France wraparound map. Using p processors, both message counts [email protected] and volume grow as p1/3. Oliver Sinnen Julie Anton University of Auckland LSTC [email protected] [email protected]

Cleve Ashcraft MS78 Livermore Software Technology Impact of Blocking Strategies for Sparse Direct Corporation Solvers on Top of Generic Runtimes [email protected] Among the preprocessing steps of a sparse direct solver, Cl´ement Weisbecker reordering and block symbolic factorization are two ma- LSTC jor steps to reach a suitable granularity for Blas ker- [email protected] nels efficiency and runtime management. In this talk, we present a reordering strategy to increase off-diagonal block sizes. It enhances Blas kernels and allows to handle larger MS78 tasks, reducing runtime overhead. Finally, we will com- Scheduling Sparse Symmetric Fan-Both Cholesky ment the resulting gain in the PaStiX solver implemented Factorization over StarPU and Parsec.

In this work, we study various scheduling approaches for Gregoire Pichon Sparse Cholesky factorization. We review different algo- INRIA rithms based on a distributed memory implementation of [email protected] the Fan-Both algorithm. We also study different commu- nication mechanisms and dynamic scheduling techniques. Mathieu Faverge IPB Bordeaux - Labri - Inria Mathias Jacquelin [email protected] Lawrence Berkeley National Lab [email protected] Pierre Ramet Bordeaux University - INRIA - LaBRI Esmond G. Ng, Yili Zheng Bordeaux University Lawrence Berkeley National Laboratory 701868 [email protected], [email protected] Jean Roman Katherine Yelick INRIA Lawrence Berkeley National Laboratory [email protected] University of California at Berkeley [email protected] MS79 Silent Data Corruption in BLAS Kernels MS78 Optimizing Memory Usage When Scheduling Tree- We summarize different types of fault tolerant parallel al- Shaped Task Graphs of Sparse Multifrontal Solvers gorithms for selected BLAS kernels and compare their be- havior under the influence of silent data corruption (bit- We study the problem of scheduling task trees with large flips). In some scenarios, randomized gossip-based algo- data, which arise in the multifrontal method of sparse ma- rithms tend to exhibit a much higher degree of resilience trix factorization. Under limited memory, changing the than deterministic methods. Nevertheless, existing gossip- traversal of the tree impacts the needed amount of mem- based algorithms cannot handle all bit-flips in their data ory. Optimal algorithms exist in the sequential case. On structures, either. We present novel algorithms, both de- parallel systems, the objective is to minimize the processing terministic as well as randomized, with improved resilience time with bounded memory, which is much more complex. against silent data corruption. Heuristics from the literature may be improved to better use the available memory. Wilfried N. Gansterer Department of Computer Science Guillaume Aupy University of Vienna ENS Lyon [email protected] [email protected]

Lionel Eyraud-Dubois MS79 INRIA Fault Tolerant Sparse Iterative Solvers University of Bordeaux [email protected] Silent data corruption is expected to become an increas- ing problem as total data set and memory sizes grow in Loris Marchal future supercomputers. Application-based fault tolerance CNRS (ABFT) techniques are one promising approach to mit- France igating this problem. In this presentation we will show [email protected] some promising early results from our work to design a 134 PP16 Abstracts

fault-tolerant sparse iterative solver based on the Conju- Fatih Yetkin gate Gradient (CG) method. Inria [email protected] Simon McIntosh-Smith University of Bristol [email protected] MS80 Code Verification in Cfd

MS79 Abstract not available Algorithm-Based Fault Tolerance of the Sakurai- Sugiura Method Daniel Chauveheid CEA DAM DIF The Sakurai-Sugiura (SS) method is a contour integral Bruy`eres-le-Chtel , 91297 Arpajon Cedex, France eigensolver, which have been developed and carefully ana- [email protected] lyzed in the last decade. The SS method computes eigen- subspace by solutions of independent linear systems and this characteristic leads coarse-grained parallel computa- MS80 tion. In this presentation we show the algorithm-based Performance Study of Industrial Hydrocodes: fault tolerance of the method and verify the error resilience Modeling, Analysis and Aftermaths on Innovative by several numerical experiments. Computational Methods

Tetsuya Sakurai,AkiraImakura This talk is the high-performance numerics companion Department of Computer Science talk given by Mathieu Peybernes in this mini-symposium. University of Tsukuba From conclusions on the performance of a legacy staggered [email protected], [email protected] Lagrange+remap hydrocode, we set a list of requirements for a new numerical solver. It has been possible to de- Yasunori Futamura rive a second-order Eulerian multimaterial finite volume University of Tsukuba scheme with collocated variables and geometrical elements [email protected] defined only from the reference (Eulerian) grid. The result- ing scheme shows both SIMD property and strong scala- bility. MS79 Florian De Vuyst Soft Errors in PCG: Detection and Correction Ecole Normale Sup´erieure Cachan [email protected] Soft errors that are not detected by hardware mechanisms may affect convergence of iterative methods. In this pre- sentation, we discuss the possibility of a self-correction MS80 technique for classical Preconditioned Conjugate Gradient Checking the Floating Point Accuracy of Scientific (PCG) algorithm in case of soft error in the SpMV cal- Code Through the Monte Carlo Arithmetic culation. To be able to correct the convergence behavior, the soft errors should be detected immediately. Checksum Exascale supercomputers will provide the computational based detection methods are good candidates for it. But power required to perform very large scale and high reso- their defect, arising from round-off effects, decreases their lution digital simulations. One of the main exascale chal- reliability. Hence we combine this technique with the well- lenges is to control their numerical uncertainties. Indeed, known results on deviation between the true and computed the floating-point arithmetic is only an approximation of residuals in finite precision calculation. The combination exact arithmetic The aim of this talk is to present a new of these two approaches gives us an opportunity to de- tool called which can be used for the automatic assessment tect the soft error before the algorithm get stuck in an of the numerical accuracy of large scale digital simulations unrecoverable state. After detection, the convergence be- by using the Monte-Carlo Arithmetic. havior can be corrected by applying residual replacement approach. The discussion is illustrated with several fault Christophe Denis injection scenarios on different matrices. CMLA - ENS Cachan - University Paris Saclay UMR 8536 Emmanuel Agullo [email protected] INRIA [email protected] MS80 Siegfried Cools Code Modernization by Mean of Proto- Antwerp University Applications [email protected] With the shift in HPC system paradigm, applications have Luc Giraud to be re-designed to efficiently address large numbers of Inria manycores. It requires deep code modification that cannot Joint Inria-CERFACS lab on HPC be practicaly applied at full-application scale. We pro- [email protected] pose proto-applications as proxy for code-modernization. Our first example features unstructured mesh computa- Wim Vanroose tion using task-based parallization. The second demon- Antwerp University strates load-balancing for combustion simulation. Both [email protected] proto-applications are open-source and can serve to the PP16 Abstracts 135

developement of genuine HPC application. Todd Gamblin Lawrence Livermore National Laboratory Eric Petit [email protected] PRiSM - UVSQ 45 avenue des Etats-Unis, 78035 Versailles cedex [email protected] MS81 Future Architecture Influenced Algorithm Changes and Portability Evaluation Using BLAST MS81 On the Performance and Productivity of Abstract BLAST is an arbitrary high-order finite element code for C++ Programming Models? shock hydrodynamics, originally developed for CPUs using MPI. In this talk, we describe the algorithmic and struc- Porting production-scale scientific applications to emerging tural changes we made to BLAST to efficiently support next-generation computing platforms is a significant chal- multiple emerging architectures, in particular GPU and lenge. Application developers must not only ensure strong Xeon Phi. We analyze the key differences between our performance, they must also ensure code is portable across CPU, GPU, and Xeon Phi versions and our efforts to com- the wide diversity of architectures available. In order to bine them into a single performance portable implementa- tackle the significant scale of these codes the community tion to simplify future maintenance. must also take into account developer productivity. In this talk we present results from a recent DOE study which Christopher Earl evaluated directive solutions against our Kokkos C++ pro- Lawrence Livermore National Laboratory gramming model. [email protected]

Simon D. Hammond Barna Bihari Scalable Computer Architectures Lawrence Livermore National Lab Sandia National Laboratories [email protected] [email protected] Veselin Dobrev, Ian Karlin Christian Trott, H. Carter Edwards, Jeanine Cook, Lawrence Livermore National Laboratory Dennis Dinge [email protected], [email protected] Sandia National Laboratories [email protected], [email protected], Tzanio V. Kolev [email protected], [email protected] Center for Applied Scientific Computing Lawrence Livermore National Laboratory Mike Glass [email protected] Sandia National Labs [email protected] Robert Rieben, Vladimir Tomov Lawrence Livermore National Laboratory Robert J. Hoekstra, Paul Lin, Mahesh Rajan, Courtenay [email protected], [email protected] T. Vaughan Sandia National Laboratories [email protected], [email protected], mra- MS81 [email protected], [email protected] Kokkos and Legion Implementations of the SNAP Proxy Application

MS81 SNAP is a proxy application which simulates the compu- tational motion of a neutral particle transport code. In Demonstrating Advances in Proxy Applications this work, we have adapted parts of SNAP separately; we Through Performance Gains and/or Performance have re-implemented the iterative shell of SNAP in the Portable Abstractions: CoMD and Kripke with task-model runtime Legion, showing an improvement to RAJA the original schedule, and we have created multiple Kokkos implementations of the computational kernel of SNAP, dis- RAJA is a programming model and C++ implementation playing similar performance to the native Fortran. thereof which exposes fine-grained parallelism in both loop and task-based fashions. In this talk, we present successful Geoff Womeldorff,BenBergen,JoshuaPayne results from a recent study wherein we applied RAJA to Los Alamos National Laboratory enhance the performance of two codes outside of its original [email protected], [email protected], [email protected] design space.

Richard Hornung, Olga Pearce MS82 Lawrence Livermore National Lab Recent Developments on Forests of Linear Trees [email protected], [email protected] Forest-of-octrees AMR is an approach that offers both ge- Adam Kunen, Jeff Keasler, Holger Jones ometric flexibility and parallel scalability and is being used Lawrence Livermore National Laboratory in various finite element codes and libraries. The core [email protected], [email protected], [email protected] functionality is defined by refinement, size balance, the as- sembly of ghost layers, and a parallel topology iteration. Rob Neely More general applications, such as semi-Lagrangian and Lawrence Livermore National Lab patch-based methods, require additional AMR functionali- [email protected] ties. Recently, a tentative tetrahedral variant has emerged. 136 PP16 Abstracts

This talk will summarize the latest theory and applications. standard time step restriction incurred by capillary forces. The solver is validated numerically in two and three spatial Carsten Burstedde, Johannes Holke dimensions and challenging numerical examples are pre- Universit¨at Bonn sented to illustrate its capabilities. [email protected], [email protected] Maxime Theillard Mechanical and Aerospace Engineering MS82 UCSD Parallel Level-Set Algorithms on Tree Grids [email protected]

We present results from parallelizing the semi-Lagrangian Frederic Gibou and Reinitialization algorithms for the level-set method on UCSB adaptive Quadtree (2D) and Octree (3D) grids. These [email protected] algorithms are directly implemented on top of the p4est library which implements high quality parallel refine- ment/coarsening and partitioning operations. Initial scal- David Saintillan ing results indicate good scalability up to 4096 cores, our UCSD current allocation limit on the Stampede supercomputer. [email protected] Some applications of our algorithms to modeling the so- lidification phenomenon and crystal growth are discussed. Arthur Guittet UCSB [email protected] Mohammad Mirzadeh Dept. Mechanical Engineering UCSB MS83 [email protected] Optimal Domain Decomposition Strategies for Par- allel Geometric Multigrid Arthur Guittet UCSB A high level cache-directed analysis is used to derive a flexi- [email protected] ble cache-oblivious heuristic that prunes the process topol- ogy search space to obtain families of high-performing do- Carsten Burstedde main decompositions for structured 3-D domains. This Universit¨at Bonn concept is used to minimize cache-misses in smoothing [email protected] phases and is combined with process-to-hardware mapping for a subset of active processes at the coarsest correction phase. Our conclusion discusses the impact on domain de- Frederic G. Gibou composition strategies resulting from current, and future, UC Santa Barbara multi-core architectures. [email protected] Gaurav Saxena University of Leeds MS82 [email protected] Exploiting Octree Meshes with a High Order Dis- continuous Galerkin Method Peter K. Jimack, Mark A. Walkley High-order discontinuous Galerkin methods are especially School of Computing attractive for parallel computation due to their high local- University of Leeds ity. However, the sheer size of modern computing systems [email protected], [email protected] requires also the scalable treatment of mesh information. This need drives the design of our Octree mesh implemen- MS83 tation TreElM. We will present the concept of Ateles, our modal discontinuous Galerkin solver based on TreElM. In Design and Implementation of Adaptive Spmv Li- this context, the focus of our presentation will be the high- brary for Multicore and Manycore Architecture order representation of geometries. Sparse matrix vector multiplication (SpMV) is an impor- Sabine P. Roller, Harald Klimach, Jens Zudrop tant kernel in both traditional high performance comput- Universitaet Siegen ing and emerging data-intensive applications. Previous [email protected], [email protected], SpMV libraries are optimized by either application-specific [email protected] or architecture-specific approach, and they’re complicated to be used in real applications. In this work we develop an auto-tuning system (SMATER) to bridge the gap between MS82 specific optimizations and general-purpose use. SMATER A Sharp Numerical Method for the Simulation of provides programmers a unified interface based on the com- Diphasic Flows pressed sparse row (CSR) format by implicitly choosing the best format and the fastest implementation for any input We present a numerical method for modeling incompress- sparse matrix in the runtime. SMATER leverages a ma- ible non-miscible fluids, in which both the interface and chine learning model and a retargetable backend library the interfacial continuity equations are treated in a sharp to quickly predict the best combination. Performance pa- manner. Our approach is implemented on adaptive Oc- rameters are extracted from 2386 matrices in UF sparse tree/Quadtree grids, which are highly effective in captur- matrix collection. The experiments show that SMATER ing different length scales. By using a modified pressure achieves the impressive performance while being portable correction projection method, we were able to alleviate the on the state-of-the-art x86 multi-core processors, NVIDIA PP16 Abstracts 137

GPU and Intel Xeon Phi accelerators. Compared with Intel tectural features such as pipeline parallelism and sliding MKL library, SMATER runs faster by more than 3 times. windows. Our results using an Altera Stratix V FPGA We demonstrate its adaptivity in an algebraic multigrid demonstrates significantly higher power efficiency in com- solver from hypre library and report above 20% perfor- parison to GPUs. mance improvement. Naoya Maruyama Guangming Tan RIKEN Advanced Institute for Computational Science Institute of Computing Technology [email protected] Chinese Academy of Sciences [email protected] Hamid Zohouri Tokyo Institute of Technology [email protected] MS83 Towards Efficient Data Communications for Frame- Aaron Smith works on Post-petascale Numa-Multicore Systems Microsoft Research [email protected] Performance of halo-exchange communications is essen- tial to the scalability of parallel programming frameworks. Motohiko Matsuda In this talk, we present our effort on optimizing halo- RIKEN Advanced Institute for Computational Science exchange communications for the JASMIN framework on [email protected] post-peta scale NUMA multicore systems. We propose a parallelism-exposing communication abstraction, design a scalable scheduling algorithm, and utilize widely avail- Satoshi Matsuoka able hardware features. Our effort improves typical halo- Tokyo Insitute of Technology exchange communication performance by up to 78%. This [email protected] in turn improves the overall performance of JEMS-FDTD application by 48%. MS84 Zhang Yang Architectural Possibilities of Fpga-Based High- Institute of Applied Physics and Computational Performance Custom Computing Mathematics China Academy of Engineering Physics Recent advancement of FPGA technology provides various yang [email protected] possibilities to high-performance custom computing archi- tectures. As a solution to higher scalability and efficiency than the present massively-parallel HPC systems, we focus Aiqing Zhang on data-centric computing architectures which is different Institute of Applied Physics and Computational from conventional control-centric ones. This talk intro- Mathematics duces their microarchitecture to system level approaches zhang [email protected] for latency-tolerant and decentralized computing, which include bandwidth compression techniques and an overlay MS84 architecture for system-wide data-flow computing. Novo-G#: Strong Scaling Using Fpga-Centric Kentaro Sano Clusters with Direct and Programmable Intercon- Tohoku University nects [email protected] FPGA-centric clusters with direct and programmable in- terconnects address fundamental limits on HPC perfor- MS84 mance by maximizing computational density, minimizing Are Datacenters Ready for Reconfigurable Logic? power density, removing the bottleneck between process- ing and communication, and facilitating intelligent and At Microsoft we have designed and deployed a composable, application-aware processing in the network itself. For ex- reconfigurable fabric to accelerate portions of large-scale ample, knowing the routing pattern a priori often enables software services beyond what commodity server designs congestion-free communication. We describe our 100-node provide. This talk will describe a prototype deployment of publically available cluster, Novo-G#, its software and IP this fabric on a bed of 1,632 servers, and discuss its efficacy infrastructure, and strong scaling of communication-bound in accelerating the Bing web search engine. applications such as long time-scale Molecular Dynamics. Aaron Smith Martin Herbordt Microsoft Research Boston University [email protected] [email protected]

MS85 MS84 Parallel Performance of the Inverse FMM and Ap- Benchmarking FPGAs with Opencl plication to Mesh Deformation for Fluid-Structure Problems We evaluate the performance of OpenCL benchmarks, in- cluding the Rodinia benchmark suite, on FPGAs using Al- In this talk, we present the inverse fast multipole method tera’s OpenCL SDK. We first evaluate the performance of as an approximate fast direct solver for dense linear sys- the original, thread-parallelized OpenCL kernels targeted tems that performs extremely well as a preconditioner in for GPUs on FPGAs. We study how those kernels can be an iterative solver. The computational cost scales linearly significantly optimized by exploiting FPGA-specific archi- with the problem size. The method uses low-rank approx- 138 PP16 Abstracts

imations to represent well-separated interactions; this is Bordeaux University - INRIA - LaBRI done in a multi-level fashion. Its parallel performance is Bordeaux University assessed, and applications related to mesh deformation us- 701868 ing radial basis function interpolation are discussed. Jean Roman Pieter Coulier INRIA Stanford [email protected] [email protected]

Chao Chen MS85 Stanford University Algebraic and Geometric Low-Rank Approxima- [email protected] tions

Eric F. Darve We perform a direct comparison between a geometric Stanford University (FMM) and algebraic (HSS) matrix-vector multiplication Mechanical Engineering Department of compressed matrices for Laplace, Helmholtz, and Stokes [email protected] kernels. Both methods are implemented with distributed memory, thread, and SIMD parallelism, and the runs are performed on large Cray systems such as Edison (XC30, MS85 133,824 cores) at NERSC and Shaheen2 (XC40, 197,568 Fast Updating of Skeletonization-Based Hierarchi- cores) at KAUST. We compare the absolute execution time cal Factorizations of both codes, along with their parallel scalability and memory consumption. We present a method for updating certain hierarchical fac- torizations for solving linear systems arising from the dis- Rio Yokota cretization of physical problems (e.g., integral equations or Tokyo Institute of Technology PDEs) in 2D. Given a factorization, we can locally perturb [email protected] the geometry or coefficients and update the factors to re- flect this change with complexity poly-logarithmic in the Fran¸cois-Henry Rouet, Xiaoye S. Li total number of unknowns. We apply our method to the Lawrence Berkeley National Laboratory recursive skeletonization factorization and hierarchical in- [email protected], [email protected] terpolative factorization and demonstrate results on some examples. David E. Keyes KAUST Victor Minden [email protected] Stanford [email protected] MS86 Anil Damle, Ken Ho, Lexing Ying Architecture-Agnostic Numerics in GridTools Stanford University [email protected], [email protected], lex- The rapid evolution of the HPC hardrware has produced [email protected] major changes in programming techniques and algorithms employed to solve several classical problems, such as the compressible Euler equations. Overcoming the lag between MS85 numerical schemes and their efficient implementations on Exploiting H-Matrices in Sparse Direct Solvers specific hardware requires cross-disciplinary competencies and is often challenging. We propose a set of libraries In this talk, we describe a preliminary fast direct solver us- called GridTools which will provide domain scientists with ing HODLR library to compress large blocks appearing in techniques to specify, discretize and solve (explicitly) PDE the symbolic structure of the PaStiX sparse direct solver. problems taking full advantage of modern hardware archi- We will present our general strategy before analyzing the tectures. We present GridTools and several prototype PDE practical gains in terms of memory and floating point op- solvers. erations with respect to a theoretical study of the problem. Finally, we will discuss ways to enhance the overall perfor- Paolo Crosetto mance of the solver. Ecole Politechnique Federale de Lausanne paolo.crosetto@epfl.ch Gregoire Pichon INRIA [email protected] MS86 Generating High Performance Finite Element Ker- Eric F. Darve nels Using Optimality Criteria. Stanford University We tackle the problem of automatically generating opti- Mechanical Engineering Department mal finite element integration routines given a high level [email protected] specification of arbitrary multilinear forms. Optimality is defined in terms of floating point operations given a mem- Mathieu Faverge ory bound. We show an approach to explore the trans- IPB Bordeaux - Labri - Inria formation space of loops and expressions typical of inte- [email protected] gration routines and discuss the conditions for optimality. Extensive experimentation validates the approach, show- Pierre Ramet ing systematic performance improvements over a number PP16 Abstracts 139

of alternative code generation systems. [email protected]

Fabio Luporini Department of Computing MS87 Imperial College London From System-Level Checkpointing to User-Level [email protected] Failure Mitigation in MPI-3

David Ham In this presentation, I will talk about transparent, system- Department of Mathematics and Department of level fault tolerance and present some (old) results that Computing show the overhead of such mechanisms. Then I will talk Imperial College London about how the user can handle fault tolerance directly [email protected] within the parallel application, using the (old, deprecated) interface FT-MPI and using a current effort made by a group of the MPI forum in the MPI-3 standard. Finally, Paul Kelly I will present how the middleware can handle failures and Imperial College London resist beyond them. [email protected] Camille Coti Universit´e Paris 13 MS86 [email protected] Source-to-Source Translation for Code Validation and Acceleration of the Icon Atmospheric General Circulation Model MS87 Reinit: A Simple and Scalable Fault-Tolerance The ICON atmospheric model port to accelerators with Model for MPI Applications OpenACC directives illustrates that these are easily in- serted and can yield performance similar to CUDA pro- In this talk, we present a global-exception programing gramming, but introduce subtelties which are difficult to model to tolerate failures in MPI applications. The model optimize and debug. Thus we propose a source-to-source allows a clean way to reinitialize MPI state after failures translator to generate accelerator directives based on an and a simple mechanism to clean up application and li- abstract description of the accelerator data and kernels, braries state via a stack of error handlers. We present an as well as validation code for intermediate results. In this implementation of the model in MVAPICH (a widely used presentation we report our findings and outline possible MPI library) and Slurm (an open-source workload man- extensions. ager), and show an initial performance evaluation in HPC benchmarks. William Sawyer Swiss Centre of Scientific Computing (CSCS) Ignacio Laguna Manno, Switzerland Lawrence Livermore National Laboratory [email protected] [email protected]

MS86 MS87 Kernel Optimization for High-Order Dynamic FA-MPI: Fault-Aware and QoS-Oriented Scalable Rupture Simulation - Flexibility Vs. Performance Message Passing A Fault-Aware Approach for non-blocking MPI-3.x APIs SeisSol simulates seismic wave propagation and dynamic is presented (FA-MPI), enabled by weak collective trans- rupture. Its innermost kernels consist of small matrix mul- actions This novel model localizes faults, provides tunable tiplications (SMMs) where all sparsity patterns are known fault-free overhead, allows for multiple kinds of faults, en- a-priori. The SMMs are generated in an offline process and ables hierarchical recovery, and is data-parallel relevant. make heavy use of vectorization in order to maximize per- Fault modeling of underlying networks is mentioned as are formance. We present our recent efforts to further optimize proposed additions to MPI-4 fully to support resilient par- and automate the code generator. Particularly, we deter- allel programs with basic QoS. Performance and scalability mine the optimal matrix multiplication order, automati- results of prototype FA-MPI are given, keyed to compact cally exploit zero blocks, and enable block sparse SMMs. applications (e.g., MiniFE, LULESH).

Carsten Uphoff Anthony Skjellum TU Munich Auburn University uphoff@in.tum.de [email protected] Michael Bader Technische Universit¨at M¨unchen MS88 [email protected] Topology-Aware Performance Modeling of Parallel SpMVs

MS87 Parallel SpMVs are a key component of many linear solvers. Bosilca ¡TBA¿ Increasingly large HPC systems yield costly communica- tion; specifically large messages sent long distances. A Abstract not available performance model can associate an approximate cost with the various messages communicated during each SpMV. George Bosilca The common alpha-beta model can be improved through University of Tennessee - Knoxville the use of topology-aware methods, allowing for a distance 140 PP16 Abstracts

parameter to be added to the cost of each message, as well 2 rue de paris, 92190 Meudon, France as a rough prediction of network contention. [email protected]

Amanda Bienz University of Illinois at Urbana-Champaign PP1 [email protected] Improving Scalability of 2D Sparse Matrix Parti- tioning Models Via Reducing Latency Cost Luke Olson Department of Computer Science Two-dimensional (2D) sparse matrix partitioning models University of Illinois at Urbana-Champaign offer better opportunities in reducing multiple communica- [email protected] tion cost metrics compared to one-dimensional partitioning models. This work makes use of a two-phase partitioning William D. Gropp methodology to address three important communication University of Illinois at Urbana-Champaign cost metrics: (i) total message volume, (ii) total message [email protected] count, and (iii) maximum message volume. The first phase only considers the message volume metric, while the sec- ond phase considers the other two metrics. We show that MS88 the scalability of the existing 2D models can significantly Cere: Llvm Based Codelet Extractor and Replayer be improved with our methodology. for Piecewise Benchmarking and Optimization Cevdet Aykanat,OguzSelvitopi Codelet Extractor and REplayer (CERE) is an LLVM- Bilkent University based framework that finds and extracts hotspots from an [email protected], [email protected] application as isolated fragments of code. Codelets can be modified, compiled, run, and measured independently from PP1 the original application. Through performance signature clustering, CERE extracts a minimal but representative Hybrid Parallel Simulation of Helicopter Rotor Dy- codelet set from applications, which can significantly re- namics duce the cost of benchmarking and iterative optimization. We present a software for the hybrid parallel calculation Codelets have proved successful in autotuning target archi- of rotor forces and control angles using a coupled simula- tecture, compiler optimization or amount of parallelism. tion of rotor dynamics and rotor wake. A big part of the Pablo De Oliveira Castro computation time is required to simulate the airflow in the PRiSM & ECR - UVSQ wake using a vortex method. For node-level paralleliza- 45 avenue des Etats-Unis, 78035 Versailles cedex tion, we use OpenACC on GPUs and OpenMP on CPUs, [email protected] and MPI for parallelization between compute nodes. We focus on the coupling scheme and discuss the performance obtained. MS88 Melven Roehrig-Zoellner, Achim Basermann Performance Modeling of a Compressible Hydro- German Aerospace Center (DLR) dynamics Solver on Multicore Cpus Simulation and Software Technology [email protected], Performance study of a reference Lagrange-Remap algo- [email protected] rithm for compressible gas dynamics obtained using the Roofline model and the Execution Cache Memory (ECM) model has been performed. As a result, we are able to Johannes Hofmann predict the whole performance on modern processors with German Aerospace Center (DLR) a mean error of 6.5%, which is actually very accurate in Institute of Flight Systems this context. The results of these analyses also gave us [email protected] highlights for designing a new full hydrodynamics solver (Lagrange+flux) with expected better performance. PP1 Mathieu Peybernes Poisson Solver for large-scale Simulations with one CEA Saclay, 91191 Gif-sur-Yvette cedex, France Fourier Diagonalizable Direction [email protected] We present a Poisson solver for problems with one Fourier diagonalizable direction, such as airfoils. The diagonaliza- MS88 tion decomposes the original 3D-system into a set of un- Using Uncertainties in High Level Performance coupled 2D-subsystems. Our method focuses on optimizing Prediction memory allocations and transactions by considering redun- dancies on such subsystems. It also takes advantage of the For rapid evaluation of future hardware, we are using high uniformity through the periodic direction for its vectoriza- level methodology based on a first approximation of the tion, and automatically manages the workload distribution application behavior, and on several hardware specifica- and the preconditioner choice. Altogether constitutes a tions (frequency, efficiency, bandwidth, nb of cores, etc.. highly efficient and scalable Poisson solver. ) that are usually known only a few months before the official launch. The introduction of uncertainties on these Ricard Borrell parameters is then allowing us to get some tolerance on the Universitat Polit´ecnica de Catalunya (Barcelona Tech) prediction results. [email protected]

Philippe Thierry Guillermo Oyarzun, Jorje Chiva, Oriol Lehmkuhl, Ivette PP16 Abstracts 141

Rodr´ıguez, Assensi Oliva [email protected], [email protected] Universitat Polit`ecnica de Catalunya (BarcelonaTech) [email protected], [email protected], [email protected], [email protected], PP1 [email protected] Parallel Programming Using Hpcmatlab

HPCmatlab is a new, flexible and scalable parallel pro- PP1 gramming framework for Matlab/Octave applications to High Performance Algorithms and Software for scale on distributed memory cluster architectures using Seismic Simulations the Message Passing Interface (MPI). This includes point- to-point, collective, one sided communication and paral- The understanding of earthquake dynamics is greatly sup- lel I/O. It supports shared memory programming within ported by highly resolved coupled simulations of the rup- a node using Pthreads. We present scientific application ture process and seismic wave propagation. This grand performance results showing the ease of programming for challenge of seismic modeling requires an immense amount fast prototyping or for large scale numerical simulations. of supercomputing resources, thus optimal utilization by software is imperative. Driven by recent hardware devel- Mukul Dave opments the increasing demand for parallelism and data ASU Research Computing locality often requires replacing major software parts to Arizona State University bring efficient numerics and machine utilization closely to- [email protected] gether. In this poster we present a new computational core for the seismic simulation package SeisSol. For mini- Xinchen Guo mal time-to-solution the new core is designed to maximize Purdue University value and throughput of the floating point operations per- [email protected] formed in the underlying ADER discontinuous Galerkin discretization method. Included are auto-tuned sparse and Mohamed Sayeed dense matrix kernels, hybrid parallelization from many- ASU Research Computing core nodes up to machine-size and a novel high performance Arizona State University clustered local time stepping scheme. The presented com- [email protected] putational core reduces time-to-solution of SeisSol by sev- eral factors and scales beyond 1 million cores. At machine- size the new core enabled a landmark simulation of the 1992 PP1 Landers earthquake. For the first time this simulation al- Using the Verificarlo Tool to Assess the Floating lowed the analysis of the complex rupture behavior result- Point Accuracy of Large Scale Digital Simulations ing from the non-linear interaction of frictional sliding and seismic wave propagation at high geometric complexity. Exascale supercomputers will provide the computational power required to perform very large scale and high reso- Alexander Breuer lution digital simulations. One of the main exascale chal- University of California, San Diego lenges is to control their numerical uncertainties. Indeed, [email protected] the floating-point arithmetic is only an approximation of exact arithmetic. This poster will present a new tool called Alexander Heinecke verificarlo which can be used for the automatic assessment Parallel Computing Laboratory of the numerical accuracy of large scale digital simulations Intel Corporation, Santa Clara, CA, USA by using the Monte-Carlo Arithmetic. [email protected] Christophe Denis Michael Bader CMLA - ENS Cachan - University Paris Saclay Technische Universit¨at M¨unchen UMR 8536 [email protected] [email protected]

Pablo De Oliveira Castro PP1 PRiSM & ECR - UVSQ Large Scale Parallel Computations in R Through 45 avenue des Etats-Unis, 78035 Versailles cedex Elemental [email protected]

R-Elemental is a package for the statistical software R Eric Petit that enables distributed computations through the parallel PRISM - UVSQ dense linear algebra library Elemental. R-Elemental offers [email protected] to the users a straightforward transition from single node to compute clusters, preserving native R syntax and methods; at the same time, it delivers Elementals full performance PP1 without any overhead. We discuss the interface design and A Parallel Fast Sweeping Method on Nonuniform present several examples common in statistical analysis. Grids

Rodrigo Canales We present a novel parallel fast sweeping method on RWTH Aachen University nonuniform grids. By constructing a graph of the dis- [email protected] cretization, we are able to determine the causally inde- pendent updates. Each of these independent steps is then Elmar Peise, Paolo Bientinesi computed in parallel. We demonstrate the method on tree AICES, RWTH Aachen based grids and triangulated surfaces. This talk will briefly 142 PP16 Abstracts

describe the algorithm, show an example application, and [email protected] present parallel scaling results.

Miles L. Detrixhe PP1 University of California Santa Barbara An Impact of Tuning the Kernel of the Structured [email protected] QR Factorization in the TSQR

In the TSQR algorithm, the so-called structured QR factor- PP1 ization is required for reducing two upper triangular ma- Chase: A Chebyshev Accellerated Subspace Itera- trix into one and repeated depending on the number of tion Eigensolver processes in parallel computing. Tuning this part is thus important but not trivial because the input matrix is usu- ally small and has the upper triangular structure. In this ChASEisaniterativeeigensolverdesignedwitharather poster, we compare several implementations and investi- simple algorithmic structure which rendered easy to paral- gate their impact on the overall performance of the TSQR lelize its kernels over virtually all existing range of multi- algorithm. and many-cores. Such a feature makes ChASE virtually one of the most parallel efficient algorithm for sequences Takeshi Fukaya of correlated eigenvalue problems. In particular, ChASE Hokkaido University, RIKEN AICS scales almost optimally over a range of processors com- [email protected] mensurate to the problem complexity, and it is portable and adaptable to many flavors of parallel architectures. Toshiyuki Imamura RIKEN Advance Institute for Computational Science Edoardo A. Di Napoli [email protected] Juelich Supercomputing Centre [email protected] PP1 Mario Berljafa School of Mathematics Filters of the Improvement of Multiscale Data from The University of Manchester Atomistic Simulations [email protected] In multiscale models combining continuum approximations with first-principles based atomistic representations the Paul Springer dominant computational cost is typically the atomistic AICES component of the model. We demonstrate the effectiveness RWTH Aachen University and scalability of a spectral filter for improving the accu- [email protected] racy of noisy data obtained from atomistic simulations. Fil- tering enables running less expensive atomistic simulations Jan Winkelmann to reach the same overall level of accuracy thus lowering AICES the primary cost of the multiscale model and leading to RWTH Aachen faster simulations. [email protected] David J. Gardner Lawrence Livermore National Laboratory PP1 [email protected] Evaluation of the Intel Xeon Phi and Nvidia K80 As Accelerators for Two-Dimensional Panel Codes Daniel R. Reynolds Southern Methodist University To predict the properties of fluid flow over a solid geome- Mathematics try is an important engineering problem. In many applica- [email protected] tions so-called panel methods (or boundary element meth- ods) have become the standard approach to solve the corre- sponding partial differential equation. Since panel methods PP1 in two dimensions are computationally cheap, they are well Parallel Multigrid and FFT-Type Algorithms suited as the inner solver in an optimization algorithm. In for High Order Compact Discretization of the this paper we evaluate the performance of the Intel Xeon Helmholtz Equation. Phi 7120 and the NVIDIA K80 to accelerate such an op- timization algorithm. For that purpose, we have imple- In this paper, we analyze parallel implementation of multi- mented an optimized version of the algorithm on the CPU grid and FFT-type methods for the solution of a subsurface and Xeon Phi (based on OpenMP, vectorization, and the scattering problem. The 3D Helmholtz equation is dis- Intel MKL library) and on the GPU (based on CUDA and cretized by a sixth order compact scheme. The resulting the MAGMA library). We present timing results for all systems of finite-difference equations are solved by differ- codes and discuss the similarities and differences between ent preconditioned Krylov subspace-based methods. The the three implementations. Overall, we observe a speedup lower order multigrid and FFT-based preconditioners are of approximately 2.5 for adding a Intel Xeon Phi 7120 to a considered for the efficient implementation of the devel- dual socket workstation and a speedup between 3.4 and 3.8 oped iterative approach. The complexity and scalability of for adding a NVIDIA K80 to a dual socket workstation. two parallel preconditioning strategies are analyzed.

Lukas Einkemmer Yury A. Gryazin University of Innsbruck Idaho State University PP16 Abstracts 143

[email protected] [email protected]

PP1 PP1 A Reconstruction Algorithm for Dual Energy Com- Computational Modeling and Aerodynamic Anal- puted Tomography and Its Parallelization ysis of Simplified Automobile Body Using Open Source Cfd (OpenFoam) This paper provides a mathematical algorithm to recon- struct the Compton scatter and photo-electronic coeffi- Open source softwares have emerged as a cross-disciplinary cients using the dual-energy CT system. The proposed paradigm for a large spectrum of computing applications imaging method is based on the mean value theorem to including computer aided engineering (CAE). Open Field handle the non-linear integration coming from the poly- Operation and Manipulation (OpenFOAM) is an open chromatic energy based CT scan system. We show a nu- source computational software capable of solving variety merical simulation result for the validation of the proposed of complex fluid flows. Present research work is focused on algorithm. In order to reduce the computational time, we investigating the ability of OpenFOAM to simulate exter- use MPI4Py to adopt the parallel processing. We obtain nal flow around a simplified automobile body. The results 34 time faster result with 64 cpu cores system. are compared with experimental values of commercial CFD packages. These results allow us to gauge the robustness, Kiwan Jeon and adaptability of the OpenFOAM utility. National Institute for Mathematical Sciences [email protected] Javeria Hashmi National University of Science and technology (NUST) Sungwhan Kim National University of Sciences and Technology Islamabad Hanbat National University [email protected] [email protected]

Chi Young Ahn, Taeyoung Ha PP1 National Institute for Mathematical Sciences [email protected], [email protected] Performance Evaluation of a Numerical Quadra- ture Based Eigensolver Using a Real Pole Filter PP1 Numerical quadrature based parallel sparse eigensolvers An Interior-Point Stochastic Approximation such as the Sakurai-Sugiura method have been developed Method on Massively Parallel Architectures in the last decade. Recently, an approach using a numeri- cal quadrature which forms a real pole filter was proposed. The stochastic approximation method is behind the solu- This approach potentially reduces the computational and tion of many actively-studied problems in PDE-constrained memory cost for solving real symmetric problems. In this optimization. Despite its far-reaching applications, there is presentation, we evaluate the performance of the real pole almost no work on applying stochastic approximation and filter method by comparing it with the often-used complex interior-point optimization, although IPM are particularly pole filter method using practical large-scale problems. efficient in large-scale nonlinear optimization due to their attractive worst-case complexity. We present a massively Yuto Inoue parallel stochastic IPM method and apply it to stochastic Department of Computer Science PDE problems such as boundary control and optimization University of Tsukuba of complex electric power grid systems under uncertainty. [email protected] Juraj Kardo Yasunori Futamura, Tetsuya Sakurai Universit`a della Svizzera italiana University of Tsukuba Institute of Computational Science [email protected], [email protected] [email protected]

Drosos Kourounis PP1 USI - Universita della Svizzera italiana Institute of Computational Science Synthesis and Verification of BSP Programs [email protected]

Bulk synchronous parallel (BSP) is an established model Olaf Schenk for parallel programming which simplifies program concep- USI - Universit`a della Svizzera italiana tion by avoiding deadlocks and data-races. Nonetheless, Institute of Computational Science there are still a large class of programming errors which [email protected] plague BSP programs. Our research interest is twofold: correctness by construction where BSP code is automati- cally generated with a new BSP-automata theory, and au- PP1 tomatic verification of existing C programs written with Nlafet - Parallel Numerical Linear Algebra for Fu- BSPlib using formal methods such as abstract interpreta- ture Extreme-Scale Systems tion. NLAFET, funded by EU Horizon 2020, is a direct re- Arvid Jakobsson, Thibaut Tachon sponse to the demands for new mathematical and algo- Huawei Technologies France rithmic approaches for applications on extreme scale sys- [email protected], tems as identified in the H2020-FETHPC work programme. 144 PP16 Abstracts

The aim is to enable a radical improvement in the perfor- Gerard J. Gorman mance and scalability of a wide range of real-world appli- Imperial College London cations relying on linear algebra software, by developing [email protected] novel architecture-aware algorithms and software libraries, and the supporting runtime capabilities to achieve scalable performance and resilience on heterogeneous architectures. PP1 The focus is on a critical set of fundamental linear algebra High-Performance Tensor Algebra Package for operations including direct and iterative solvers for dense GPU and sparse linear systems of equations and eigenvalue prob- lems. Many applications, e.g., high-order FEM simulations, can be expressed through tensors. Contractions by the first in- Bo T. K˚agstr¨om dex can often be represented using a tensor index reorder- Ume˚aUniversity ing followed by a matrix-matrix product, which is a key Computing Science and HPC2N factor to achieve performance. We present ongoing work [email protected] of a high-performance package in the MAGMA library to organize tensor contractions, data storage, and batched ex- Lennart Edblom ecution of large number of small tensor contractions. We Ume˚aUniversity use automatic code generation techniques to provide an Computing Science architecture-aware, user-friendly interface. [email protected] Ian Masliah University Paris-Sud Lars Karlsson [email protected] Ume˚a University, Dept. of Computing Science [email protected] Ahmad Abdelfattah University of Tennessee, Knoxville Dongarra Jack [email protected] University of Manchester School of Mathematics [email protected] Marc Baboulin University of Paris-Sud [email protected] Iain Duff Rutherford Appleton Laboratory, Oxfordshire, UK OX11 0QX Jack J. Dongarra and CERFACS, Toulouse, France Department of Computer Science iain.duff@stfc.ac.uk The University of Tennessee [email protected] Jonathan Hogg STFC Rutherford Appleton Lab, UK Joel Falcou [email protected] University of Paris-Sud [email protected] Laura Grigori INRIA Paris - Rocquencourt Azzam Haidar Alpines Group Department of Electrical Engineering and Computer [email protected] Science University of Tennessee, Knoxville [email protected] PP1 Stanimire Tomov Seigen: Seismic Modelling Through Code Genera- University of Tennessee, Knoxville tion [email protected] In this poster we present initial performance results for Seigen, a new seismological modelling framework that PP1 utilises symbolic computation to model elastic wave prop- Multilevel Approaches for FSAI Preconditioning in agation using the discontinuous Galerkin finite element the Iterative Solution of Linear Systems of Equa- method (DG-FEM). Our study is conducted in the con- tions text of Firedrake, an automated code generation frame- work that drastically simplifies in-depth experimentation The solution of a sparse linear system of equations with with multiple problem discretisations, solution methods a large SPD matrix is a common and computationally in- and low-level performance optimisations. tensive task when considering numerical simulation tech- niques. This calculation is often done by iterative meth- Michael Lange, Christian Jacobs ods whose performance is largely influenced by the selected Department of Earth Science and Engineering preconditioner. In this work, two new preconditioners built Imperial College London upon the dynamic FSAI algorithm in combination to two [email protected], [email protected] graph reordering techniques are presented, namely BTF- SAI and DDFSAI. Fabio Luporini Department of Computing Victor A. Paludetto Magri, Andrea Franceschini, Carlo Imperial College London Janna, Massimiliano Ferronato [email protected] University of Padova PP16 Abstracts 145

[email protected], [email protected], [email protected] [email protected], [email protected] Michael Dumbser University of Trento PP1 [email protected] Large Scale Computation of Long Range Interac- tions in Molecular Dynamics Simulations Dominic E. Charrier, Tobias Weinzierl Durham University We consider a physical system of polyelectrolytes in ionic [email protected], liquids, and perform a molecular dynamics simulation to [email protected] study the dynamics of electrophoresis. A computational effort of including long range interactions among the par- ticles is made by using a large scale computer cluster. We Luciano Rezzolla use an efficient method of distributing computational loads Frankfurt Institute of Advanced Studies among the nodes, and obtain a new result that is in agree- [email protected] ment with experimental findings. Alice Gabriel, Heiner Igel Pyeong Jun Park Ludwig-Maximilians-Universit¨at M¨unchen School of Liberal Arts and Sciences [email protected], Korea National University of Transportation [email protected] [email protected]

PP1 PP1 Malleable Task-Graph Scheduling with a Practical Accelerated Subdomain Smoothing for Geophysi- Speed-Up Model cal Stokes Flow We focus on the problem of minimizing the makespan of Scalable approaches to solving linear systems arising from malleable task graphs arising from multifrontal factoriza- geophysical Stokes flow typically involve the use of multi- tions of sparse matrices. The speedup of each task is as- grid methods. Most of the computational effort is thus sumed to be perfect but bounded. We propose a new expended applying smoothing operators, and as such one heuristic which we prove to be a 2-approximation and might naturally ask if this process can be accelerated with which performs better than previous algorithms on simula- the use of increasingly-available per-node coprocessors on tions. The dataset consists of both synthetic and realistic modern large-scale computational clusters. Here we dis- graphs. cuss several alternate approaches to performing subdomain smoothing on a hybrid CPU-GPU supercomputer, in an Bertrand Simon attempt to asses the performance gains possible from the ENS Lyon use of this new technology. Chebyshev-Jacobi and modern [email protected] incomplete factorization approaches are considered. Loris Marchal Patrick Sanan CNRS USI - Universit`a della Svizzera italiana France ICS [email protected] [email protected] Oliver Sinnen University of Auckland PP1 [email protected] ExaHyPE - Towards an Exascale Hyperbolic PDE Engine Fr´ed´eric Vivien ExaHyPEs mission is to develop a novel simulation engine INRIA and ENS Lyon for hyperbolic PDE problems at exascale. With our ap- [email protected] proach - high-order discontinuous Galerkin discretization on dynamically adaptive Cartesian grids - we tackle grand- PP1 challenge simulations in Seismology and Astrophysics. We present a first petascale software release bringing together Overview of Intel Math Kernel Library (Intel some of ExaHyPE’s key ingredients: Grid generation based MKL) Solutions for Eigenvalue Problems upon spacetrees in combination with space-filling curves, a We outline both existing and prospective Intel Math Ker- novel ADER DG WENO scheme and compute kernels tai- nel Library (Intel MKL) solutions for standard and general- lored to particular hardware generations. ized Hermitian eigenvalue problems. Extended Eigensolver Angelika Schwarz functionality, based on the accelerated subspace iteration Technische Universit¨at M¨unchen FEAST algorithm, is a high-performance solution for ob- [email protected] taining all the eigenvalues and optionally all the associated eigenvectors, within a user-specified interval. To extend the available functionality, a new approach for finding the k Vasco Varduhn largest or smallest eigenvalues is proposed. We envision au- University of Minnesota tomatically producing a search interval by a preprocessing [email protected] step based on classical and recent methods that estimate eigenvalue counts. Computational results are presented. Michael Bader Technische Universit¨at M¨unchen Irina Sokolova 146 PP16 Abstracts

Novosibirsk State University grams (Batched BLAS). We focus on multiple independent Pirogova street 2, Novosibirsk, Russia, 630090 BLAS operations on small matrices that are grouped to- [email protected] gether as a single routine. We aim to provide a more ef- ficient and portable library for multi/manycore HPC sys- Peter Tang tems. We achieve 2x speedups and 3x better energy ef- Intel Corporation ficiency compared to vendor implementations. We also 2200 Mission College Blvd, Santa Clara, CA demonstrate the need for Batched BLAS and its potential [email protected] impact in multiple application domains.

Alexander Kalinkin Pedro Valero-Lara Intel Corporation University of Manchester [email protected] [email protected] Jack J. Dongarra PP1 Department of Computer Science The University of Tennessee Achieving Robustness in Domain Decomposition [email protected] Methods: Adpative Spectral Coarse Spaces

Domain decomposition methods are a popular way to solve Azzam Haidar large linear systems. For problems arising from practi- Department of Electrical Engineering and Computer cal applications it is likely that the equations will have Science highly heterogeneous coefficients. For example a tire is University of Tennessee, Knoxville made both of rubber and steel, which are two materials [email protected] with very different elastic behaviour laws. Many domain decomposition methods do not perform well in this case, Samuel Relton specially if the decomposition into subdomains does not University of Manchester, UK accommodate the coefficient variations. For three popular [email protected] domain decomposition methods (Additive Schwarz, BDD and FETI) we propose a remedy to this problem based on Stanimire Tomov local spectral decompositions. Numerical investigations for University of Tennessee, Knoxville the linear elasticity equations will confirm robustness with [email protected] respect to heterogeneous coefficients, automatic (non regu- lar) partitions into subdomains and nearly incompressible Mawussi Zounon behaviour. Univeristy of Manchester [email protected] Nicole Spillane Centre de Math´ematiques Appliqu´ees Ecole Polytechnique PP1 [email protected] Spectrum Slicing Solvers for Extreme-Scale Eigen- Victorita Dolean value Computations University of Strathclyde [email protected] We present progress on a recently developed new paral- lel eigensolver package named EVSL (Eigen-Value Slicing Library) for extracting a large number of eigenvalues and Patrice Hauret the associated eigenvectors of a symmetric matrix. EVSL Michelin is based on a divide-and-conquer approach known as spec- [email protected] trum slicing. It combines a thick-restart version of the Lanczos algorithm or a subspace iteration with deflation, Fr´ed´eric Nataf and uses a new type of polynomial filters constructed from CNRS and UPMC a least squares approximation of the Dirac delta function [email protected] to obtain interior eigenvalues. Several computational re- sults are presented emphasizing parallel performance for Clemens Pechstein matrices arising from the density functional theory based University of Linz electronic structure calculations. [email protected] Eugene Vecharynski Daniel J. Rixen Lawrence Berkeley National Laboratory TU Munich [email protected] [email protected] Ruipeng Li Robert Scheichl Lawrence Livermore National Laboratory University of Bath [email protected] [email protected] Yuanzhe Xi University of Minnesota PP1 [email protected] A Proposal for a Batched Set of Blas. Chao Yang We propose a Batched set of Basic Linear Algebra Subpro- Lawrence Berkeley National Lab PP16 Abstracts 147

[email protected] Toshiyuki Imamura RIKEN Advance Institute for Computational Science Yousef Saad [email protected] Department of Computer Science University of Minnesota Masahiko Machida [email protected] Japan Atomic Energy Agency [email protected]

PP1 Compressed Hierarchical Schur Linear System Solver for 3D Photonic Device Analysis with Gpu Accelerations

Finite-difference frequency-domain analysis of wave optics and photonics requires linear system solver for Helmholtz equation. The linear system can be ill-conditioned when computation domain is large or perfectly-matched layers are used. We propose a direct method named Compressed Hierarchical Schur for time and memory savings, with the GPU acceleration and hardware tuning. These new tech- niques can efficiently utilize modern high-performance en- vironment and greatly accelerate development of future de- velopment of photonic devices and circuits.

Cheng-Han Du Department of Mathematics, National Taiwan University [email protected]

Weichung Wang National Taiwan University Institute of Applied Mathematical Sciences [email protected]

PP1 Parallel Adaptive and Robust Multigrid

Numerical simulation has become one of the major topics in Computational Science. To promote modelling and simula- tion of complex problems new strategies are needed allow- ing for the solution of large, complex model systems. After discussing the needs of large-scale simulation we point out basic simulation strategies such as adaptivity, parallelism and multigrid solvers. These strategies are combined in the simulation system UG4 (Unstructured Grids) being pre- sented in the following. We show the performance and efficiency of this strategy in various applications.

Gabriel Wittum G-CSC University of Frankfurt, Germany [email protected]

PP1 High performance eigenvalue solver for Hubbard model on CPU-GPU hybrid platform

When we solve the smallest eigenvalue and the correspond- ing eigenvector of the Hamiltonian derived from the Hub- bard model, we can understand properties of some interest- ing phenomenon in the quantum physics. Since the Hamil- tonian is a large sparse matrix, we usually utilize an it- eration method. In this presentation, we propose tuning strategies for the LOBPCG method for the Hamiltonian on a CPU-GPU hybrid platform in consideration of the property of the model and show its performance.

Susumu Yamada Japan Atomic Energy Agency [email protected]