VTK-M: Accelerating the Visualization Toolkit for Massively Threaded Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Feature Article VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures Kenneth Moreland ■ Sandia National Laboratories Hendrik Schroots ■ Intel Christopher Sewell ■ Los Alamos National Laboratory Kwan-Liu Ma ■ University of California, Davis William Usher ■ University of Utah Hank Childs ■ University of Oregon Li-ta Lo ■ Los Alamos National Laboratory Matthew Larsen ■ Lawrence Livermore National Laboratory Jeremy Meredith and David Pugmire ■ Oak Ridge National Laboratory Chun-Ming Chen ■ Ohio State University James Kress ■ University of Oregon Robert Maynard and Berk Geveci ■ Kitware lthough the basic architecture for high- This trend toward massive threading can be seen performance computing (HPC) plat- in high-performance computing today. The cur- forms has remained homogeneous and rent leadership-class computer at the Oak Ridge Aconsistent for more than a decade, revolutionary National Laboratory, Titan, requires between 70 changes are appearing on leading-edge supercom- and 500 million threads to run at peak, which is puters. Plans for future supercomputers promise 300 times more than was required by its prede- even larger changes. But one cessor, JaguarPF. In contrast, the system memory Traditional scientific troubling attribute of future HPC grew only by a factor of 2.3. visualization software machines is the massive increase The increasing reliance on concurrency to achieve in concurrency required to sus- faster execution rates invalidates the scalability of approaches do not fare tain peak computation. In fact, much of our scientific HPC code. New processor well in massively threaded most project billions of threads architectures are leading to new programming environments. To address the to achieve an exaflop.1 This in- models and new algorithmic approaches. The de- needs of the high-performance crease is partially accredited to sign of new algorithms and their practical imple- computing community, the requiring more cores to achieve mentation are a critical extreme-scale challenge.2,3 VTK-m framework fills the faster aggregate computing rates To address these needs, HPC scientific visual- gaps in functionality by and partially accredited to using ization researchers working for the United States bringing together the most additional threads per core to Department of Energy are building a new library recent research. hide memory latency. Because of called VTK-m that provides a framework for sim- cost and power limitations, the plifying the design of visualization algorithms on system memory will not commensurately increase, current and future architectures. VTK-m also pro- which means algorithms will need strong scaling vides a flexible data model that can adapt to many (that is, more parallelism per datum). scientific data types and operate well on multi- 48 May/June 2016 Published by the IEEE Computer Society 0272-1716/16/$33.00 © 2016 IEEE g3mor.indd 48 4/19/16 2:07 PM threaded devices. Finally, VTK-m serves as a con- sor or small set of processors capable of executing tainer for algorithms designed in the framework numerous threads that may require coordination. and gives the visualization community a common From our point of view, the principle advan- point to collaborate, contribute, and leverage mas- tage of the mixed-mode parallel model is that sively threaded algorithms. we can leverage our existing software to manage the message-passing parallel units. The VTK-m The Challenges of Highly Threaded framework focuses on the intranode parallelism Visualization that often requires massive threading and syn- The scientific visualization research community chronized execution. has been building scalable HPC algorithms for Thus, new HPC systems require a much higher more than 15 years, and today there are multiple degree of parallelism that could require threads production tools that provide excellent scalability. operating on as few as one to 10 data cells. At this However, our current visualization tools are based fine degree of parallelism, our conventional visu- on a message-passing programming model. They alization breaks down in multiple different ways. expect a coarse decomposition of the data that works best when each processing element has on Load Imbalance the order of 100,000 to 1 million data cells. Typically, data are partitioned under the assumption For many years, HPC visualization applications that the amount of work per datum is uniform. How- such as ParaView, VisIt, EnSight, and FieldView ever, this is not true for all visualization algorithms, have supported parallel processing on distributed many of which generate data conditionally based memory computer systems. The approach used by on the input values. With only a few exceptions, all these software products is a bulk synchronous current parallel visualization functions completely parallel model, where algorithms perform the ignore this load imbalance, which is considered tol- majority of their computation on independent erable when amortized over larger partitions. local operations.4 When the data gets decomposed to the cell level, This parallel computation model has worked this amortization no longer occurs, which results well for the last 15 years. Even recent multicore in a much more severe load imbalance. Finely processors could be leveraged reasonably effi- threaded visualization algorithms need to be cog- ciently as independent message-passing processes nizant of potential load imbalance and schedule on each core, allowing these tools to scale to pet- work accordingly. ascale machines.5 However, processors designed for HPC instal- Dynamic Memory Allocation lations are undergoing transformative design When the amount of data a visualization algo- changes. With physical limitations that prevent rithm generates is dependent on the values of the individual cores from executing instructions ap- input data, the output data’s size and structure is preciably faster than their current rate, manu- not known at the execution outset. In such a case, facturers are increasing the total computational the algorithm must dynamically allocate memory bandwidth by adding more cores to each proces- as data are generated. sor.6 Some HPC processor designs go even further Because our conventional parallel visualiza- to increase the total possible execution throughput tion algorithms operate on coarse partitions in by removing latency-hiding features and incor- distributed memory spaces, processing elements porating vector processing. A consequence of all can dynamically allocate memory completely in- these features is that it is no longer sufficient to dependent from one another. In contrast, dynamic treat each core as an independent processor. memory allocation from many threads within a The upshot is that our parallel computing model shared memory environment requires explicit syn- is no longer symmetric. The relationship between chronization that inhibits parallel execution. two cores on the same processor differs signifi- cantly from the relationship between two nodes of Topological Connections a supercomputer. To address this asymmetry, a pop- Scientific visualization algorithms are dominated ular approach is to use a mixed-mode, or hybrid, by operations on topological connections in parallel computing model that incorporates two meshes. Care must be taken when defining these levels of task organization. The first level comprises connections across boundaries of data assigned to distributed-memory message-passing nodes in a different processing elements. Mutual data being cluster-like arrangement. Then, within each node read must be consistent, and mutual data being of the distributed-memory arrangement is a proces- written must be coordinated. IEEE Computer Graphics and Applications 49 g3mor.indd 49 4/19/16 2:07 PM Feature Article Predecessors of VTK-m lthough the VTK-m software project itself started little scale visualization problem, but the software packages did Amore than a year ago, the software originated as an not integrate well. Recognizing the prospect of substantial aggregation of three predecessor products: PISTON, Dax, duplication of effort, the developers of these software and EAVL. The US Department of Energy (DoE) high- projects came together to work under a unified software performance computing (HPC) community predicted early product: VTK-m. Although VTK-m was born from a new on that leadership class facilities would be transitioning code base, the PISTON, Dax, and EAVL developers con- to heavily threaded processors and that we would need tributed and evolved their respective technologies. The a significant change to our visualization algorithms and development of its predecessors has been phased out, and software.1 Consequently, researchers at the DoE national VTK-m is now a unified, well-integrated product of these laboratories began considering the challenges of visualiza- three predecessors with continuing evolving capabilities. tion on accelerator processors and created three separate toolkits, each focusing on a specific aspect of the problem. The first toolkit, PISTON,2 considers the design of por- References table multithreaded visualization algorithms. Built on top 1. S. Ahern et al., “Scientific Discovery at the Exascale: Report of the Thrust library,3 algorithms in PISTON comprise a se- from the DOE ASCR 2011 Workshop on Exascale Data quence of general parallel operations. Originally designed Management, Analysis, and Visualization,” Dept. of Energy for CUDA, Thrust has a flexible device back