66 PP16 Abstracts
Total Page:16
File Type:pdf, Size:1020Kb
66 PP16 Abstracts IP1 [email protected] Scalability of Sparse Direct Codes IP4 As part of the H2020 FET-HPC Project NLAFET, we are studying the scalability of algorithms and software for us- Next-Generation AMR ing direct methods for solving large sparse equations. We Block-structured adaptive mesh refinement (AMR) is a examine how techniques that work well for dense system powerful tool for improving the computational efficiency solution, for example communication avoiding strategies, and reducing the memory footprint of structured-grid nu- can be adapted for the sparse case. We also study the merical simulations. AMR techniques have been used for benefits of using standard run time systems to assist us in over 25 years to solve increasingly complex problems. I developing codes for extreme scale computers. We will look will give an overview of what we are doing in Berkeley at algorithms for solving both sparse symmetric indefinite Labs AMR framework, BoxLib, to address the challenges systems and unsymmetric systems. of next-generation multicore architectures and the com- plexity of multiscale, multiphysics problems, including new Iain Duff ways of thinking about multilevel algorithms and new ap- Rutherford Appleton Laboratory, Oxfordshire, UK OX11 proaches to data layout and load balancing, in situ and 0QX in transit visualization and analytics, and run-time perfor- and CERFACS, Toulouse, France mance modeling and control. iain.duff@stfc.ac.uk Ann S. Almgren Lawrence Berkeley National Laboratory IP2 [email protected] Towards Solving Coupled Flow Problems at the Ex- treme Scale IP5 Scalability is only a necessary prerequisite for extreme scale Beating Heart Simulation Based on a Stochastic computing. Beyond scalability, we propose here an archi- Biomolecular Model tecture aware co-design of the models, algorithms, and data structures. The benefit will be demonstrated for the Lat- In my talk, I will introduce a new parallel computational tice Boltzmann method as example of an explicit time step- approach indispensable to simulate a beating human heart ping scheme and for a finite element multigrid method as driven by stochastic biomolecular dynamics. In this ap- an implicit solver. In both cases, systems with more than proach, we start from a molecular level modeling, and we ten trillion (1013) degrees of freedom can be solved already directly simulate individual molecules in sarcomere by the on the current peta-scale machines. Monte Carlo (MC) method, thereby naturally handle the cooperative and stochastic molecular behaviors. Then we Ulrich J. Ruede imbed these sarcomere MC samples into contractile ele- University of Erlangen-Nuremberg ments in continuum material model discretized by finite Department of Computer Science (Simulation) element method. The clinical applications of our heart [email protected] simulator are also introduced. Takumi Washio IP3 University of Tokyo, Japan [email protected] Why Future Systems should be Data Centric We are immersed in data. Classical high performance com- IP6 puting often generates enormous structured data through Conquering Big Data with Spark simulation; but, a vast amount of unstructured data is also being generated and is growing at exponential rates. To Today, big and small organizations alike collect huge meet the data challenge, we expect computing to evolve in amounts of data, and they do so with one goal in mind: two fundamental ways: first through a focus on workflows, extract ”value” through sophisticated exploratory analy- and second to explicitly accommodate the impact of big sis, and use it as the basis to make decisions as varied data and complex analytics. As an example, HPC systems as personalized treatment and ad targeting. To address should be optimized to perform well on modeling and simu- this challenge, we have developed Berkeley Data Analytics lation; but, should also focus on other important elements Stack (BDAS), an open source data analytics stack for big of the overall workflow which include data management data processing. and manipulation coupled with associated cognitive ana- lytics. At a macro level, workflows will take advantage of Ion Stoica different elements of the systems hierarchy, at dynamically University of California, Berkeley, USA varying times, in different ways and with different data [email protected] distributions throughout the hierarchy. This leads us to a data centric design point that has the flexibility to handle these data-driven demands. I will explain the attributes IP7 of a data centric system and discuss challenges we see for A Systems View of High Performance Computing future computing systems. It is highly desirable that the data centric architecture and implementation lead to com- HPC systems comprise increasingly many fast components, mercially viable Exascale-class systems on premise and in all working together on the same problem concurrently. the cloud. Yet, HPC efficient-programming dogma dictates that com- municating is too expensive and should be avoided. Pro- Paul Coteus gram ever more components to work together with ever less IBM Research, USA frequent communication?!? Seriously? I object! In other PP16 Abstracts 67 disciplines, deriving efficiency from large Systems requires nology and reaching the capabilities of the human brain. an approach qualitatively different from that used to derive In spite of good progress in high performance computing efficiency from small Systems. Should we not expect the (HPC) and techniques such as machine learning, this gap same in HPC? Using Grappa, a latency-tolerant runtime is still very large. I will then explore two related topics for HPC systems, I will illustrate through examples that through a discussion of recent examples: what can we learn such a qualitatively different approach can indeed yield ef- from the brain and apply to HPC, e.g., through recent ef- ficiency from many components without stifling communi- forts in neuromorphic computing? And how much progress cation. have we made in modeling brain function? The talk will be concluded with my perspective on the true dangers of su- Simon Kahan perintelligence, and on our ability to ever build self-aware Institute for Systems Biology and University of or sentient computers. Washington, USA Horst D. Simon [email protected] Lawrence Berkeley Laboratory [email protected] SP0 SIAG/Supercomputing Best Paper Prize: CP1 Communication-Optimal Parallel and Sequen- Scale-Bridging Modeling of Material Dynamics: tial QR and LU Factorizations Petascale Assessments on the Road to Exascale We present parallel and sequential dense QR factoriza- Within the multi-institutional, multi-disciplinary Exascale tion algorithms that are both optimal (up to polylogarith- Co-design Center for Materials in Extreme Environments mic factors) in the amount of communication they per- (ExMatEx), we are developing adaptive physics refinement form and just as stable as Householder QR. We prove methods that exploit hierarchical, heterogeneous architec- optimality by deriving new lower bounds for the num- tures to achieve more realistic large-scale simulations. To ber of multiplications done by non-Strassen-like QR, and evaluate and exercise the task-based programming mod- using these in known communication lower bounds that els, databases, and runtime systems required to perform are proportional to the number of multiplications. We many-task computation workflows that accumulate a dis- not only show that our QR algorithms attain these lower tributed response database from fine-scale calculations, we bounds (up to polylogarithmic factors), but that existing are preforming petascale demonstrations on the Trinity su- LAPACK and ScaLAPACK algorithms perform asymptot- percomputer, combining Haswell, Knights Landing, and ically more communication. We derive analogous commu- burst buffer nodes. nication lower bounds for LU factorization and point out recent LU algorithms in the literature that attain at least Timothy C. Germann some of these lower bounds. The sequential and parallel Los Alamos National Laboratory QR algorithms for tall and skinny matrices lead to signif- [email protected] icant speedups in practice over some of the existing algo- rithms, including LAPACK and ScaLAPACK, for example, up to 6.7 times over ScaLAPACK. A performance model CP1 for the parallel algorithm for general rectangular matrices Parallel Recursive Density Matrix Expansion in predicts significant speedups over ScaLAPACK. Electronic Structure Calculations James W. Demmel We propose a parallel recursive polynomial expansion al- University of California gorithm for construction of the density matrix in linear Division of Computer Science scaling electronic structure theory. The expansion con- [email protected] tains operations on matrices with irregular sparsity pat- terns, in particular matrix-matrix multiplication. By using Laura Grigori the Chunks and Tasks programming model[Parallel Com- INRIA put. 40, 328 (2014)] beforehand unknown sparsity pattern France is exploited while preventing load imbalance. We evaluate [email protected] the performance in electronic structure calculations, inves- tigating linear scaling with system size and parallel strong Mark Hoemmen and weak scaling. Sandia National Laboratories Anastasia Kruchinina [email protected] Uppsala universitet [email protected] Julien Langou University of Colorado Denver EmanuelH.Rubensson [email protected] Uppsala University [email protected] SP1 SIAG/Supercomputing Career Prize: Supercom- Elias Rudberg