Scientific Software Libraries for Scalable Architectures

Scientific Software Libraries for Scalable Architectures The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Johnsson, S. Lennart and Kapil K. Mathur. 1994. Scientific Software Libraries for Scalable Architectures. Harvard Computer Science Group Technical Report TR-19-94. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25811003 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Scientic Software Libraries for Scalable Architectures S Lennart Johnsson Kapil K Mathur TR August Parallel Computing Research Group Center for Research in Computing Technology Harvard University Cambridge Massachusetts To app ear in Parallel Scientic Computing SpringerVerlag Scientic Software Libraries for Scalable Architectures S Lennart Johnsson Kapil K Mathur Thinking Machines Corp and Thinking Machines Corp and Harvard University Abstract Massively parallel pro cessors introduce new demands on software systems with resp ect to p erfor mance scalability robustness and p ortability The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used p oses se rious challenges to software developers The Connection Machine Scientic Software Library CMSSL uses several novel techniques to meet these challenges The CMSSL contains routines for managing the data distribution and provides data distribution indep endent functionality High p erformance is achieved through careful scheduling of arithmetic op erations and data mo tion and through the automatic selection of algorithms at runtime We discuss some of the techniques used and provide evidence that CMSSL has reached the goals of p erformance and scalability for an imp ortant set of applications Introduction The main reason for large scale parallelism is p erformance In order for massively parallel architectures to deliver on the promise of extreme p erformance compared to conventional sup er computer architectures an eciency in resource use close to that of conventional sup ercomputers is necessary Achieving high eciency in using a computing system is mostly a question of e cient use of its memory system This is the premise on which the Connection Machine Scientic Software Library the CMSSL is based Another premise for the CMSSL is the notion of scalability Software systems must b e designed to op erate on systems and data sets that may vary in size by as much as four orders of mag nitude This level of scalability with resp ect to computing system size must b e accomplished transparently to the user ie the same program must execute not only correctly but also e ciently without change over this range in pro cessing capacity and corresp onding problem sizes Moreover programs should not have to b e recompiled for various system sizes This requirement will b e even more imp ortant in the future since over time the assignment of pro cessing no des to tasks is exp ected to b ecome much more dynamic than to day Robustness of software b oth with resp ect to p erformance and numerical prop erties are b ecoming increasingly imp ortant The memory system in each no de is b ecoming increasingly complex in order to match the sp eed of the memory system with that of an individual pro cessor The distributed nature of the total memory comp ounds the complexity of the memory system It is imp erative that software systems deliver a large fraction of the available p erformance over a wide range of problem sizes transparently to the user For instance small changes in array sizes should not impact p erformance in a signicant way Robustness with resp ect to p erformance in this sense will increase the demands on the software systems in particular on the runtime parts of the systems Robustness with resp ect to numerical prop erties is also b ecoming increasingly imp ortant The same software may b e used for problem sizes over a very wide range Condition numbers for the largest problems are exp ected to b e signicantly worse than for small problems As a minimum condition estimators must b e provided to allow users to assess the numerical quality of the results It will also b e increasingly necessary to furnish software for illconditione d problems and whenever p ossible automatically choose an appropriate numerical metho d Some parallel metho ds do not have as go o d a numerical b ehavior as sequential metho ds and this disadvantage is often increasing with the degree of parallelism The tradeo b etween p erformance and numerical stability and accuracy is very complex Much research is needed b efore the choice of algorithm with resp ect to numerical prop erties and p erformance can b e automated Portability of co des is clearly highly desirable in order to amortize the software investment over as large a usage as p ossible Portability is also critical in a rapid adoption of new technology thus allowing for early b enets from the increased memory sizes increased p erformance or decreased costp erformance oered by new technology But not all software is p ortable when p erformance is taken into account New architectures like MPPs require new software technology that often lags the hardware technology by several years Thus it is imp ortant to exploit the architecture of software systems such that architecture dep endent nonp ortable software is limited to as few functions as p ossible while maintaining p ortability of the vast amount of application software One of the purp oses of software libraries is to enable p ortability of application co des without loss of p erformance The Connection Machine Scientic Software Library to day has ab out user callable functions covering a wide range of frequent op erations in scientic and engineering computation In this pap er we illustrate how the goals of high p erformance and scalability have b een achieved The outline of the pap er is as follows In the next few sections we discuss memory systems for scalable architectures and their impact on the sequence to storage asso ciation used in mapping arrays to the memory system We then discuss data representations for dense and sparse arrays The memory system and the data representation denes the foundation for the CMSSL We then present the design goals for the CMSSL and how these goals have b een approached and achieved The multipleinstance capability of the CMSSL is an extension of the functionality of conventional libraries in the spirit of array op erations and critical to the p erformance in computations on b oth distributed and lo cal data sets The multipleinstance feature is discussed in Section Scalability and robustness with resp ect to p erformance b oth dep end heavily on the ability to automatically select appropriate schedules for arithmeticlogic op erations and data motion and prop er algorithms These issues are discussed by sp ecic examples A summary is given in Section Architectural mo del High p erformance computing has dep ended on elab orate memory systems since the early days of computing The Atlas introduced virtual memory as a means of making the main relatively slow memory app ear as fast as a small memory capable of delivering data to the pro cessor at its clo ck sp eed Since the emergence of electronic computers pro cessors have as a rule b een faster than memories regardless of the technology b eing used Today most computers conventional sup ercomputers excepted use MOS technology for b oth memories and pro cessors But the prop erties of the MOS technology is such that the sp eed of pro cessors is doubling ab out every months while the sp eed of memories is increasing at a steady rate of ab out yr Since the sp eed of individual memory units to day primarily built out of MOS memory chips is very limited high p erformance systems require a large number of memory banks units even when lo cality of reference can b e exploited High end systems have thousands to tens of MEMORY SYSTEM NETWORK M M M M M M P P P P P P Figure The memory system for distributed memory architectures thousands of memory banks The aggregate memory bandwidth of such systems far exceeds the bandwidth of a bus A network of some form is used to interconnect memory mo dules The no des in the network are typically of a low degree and for most networks indep endent of the size of the network A large variety of network top ologies can b e constructed out of no des with a limited xed degree Massively parallel architectures have employed two and threedimensional mesh top ologies buttery networks binary cub es complete binary trees and fattree top ologies The sp eed of the memory chips present the most severe restriction with resp ect to p erformance The second weakest technological comp onent is the communication system Constructing a communication system with the capacity of a full crossbar with a bandwidth equal to the aggre gate bandwidth of the memory system is not feasible for systems of extreme p erformance and would represent a considerable exp ense even for systems where it may b e technically feasible Hence with a constraint of a network with a lower bandwidth than that of the full memory system in MPPs pro cessors are placed close to the memory mo dules such that whenever lo cal ity of reference can b e exploited the p otential negative impact up on p erformance of the limited

Scientific Software Libraries for Scalable Architectures

Speeding up Spmv for Power-Law Graph Analytics by Enhancing Locality & Vectorization

Sparse-Matrix Representation of Spiking Neural P Systems for Gpus

A Framework for Efficient Execution of Matrix Computations

Automatic Selection of Sparse Matrix Representation on Gpus

Tridiagonalization of an Arbitrary Square Matrix William Lee Waltmann Iowa State University

REPRESENTATIONS of FINITE GROUPS Contents 1. Introduction 1

A Matrix-Based Approach to Global Locality Optimization

Data-Parallel Language for Correct and Efficient Sparse Matrix Codes

Complex Conjugation of Group Representations by Inner Automorphlsms

Optimization of GPU-Based Sparse Matrix Multiplication for Large Sparse Networks

Introduction to Programming with Arrays Using ELI

Sparse Matrix-Vector Multiplication on Gpgpus