Technology for the K

 Atsuji Ogasa  Hiroyuki Maesaka  Kiyotaka Sakamoto  Sadanori Otagiri

Visualization technology makes or videos from the results of numerical computations performed by . To visualize the results of parallel computation on a super-large scale of a few to a few tens of thousands of parallel processes, processing in which bulky computation result data are rendered at high speeds is required. The conventional method of transferring computation result data to a visualization server may give rise to many challenges that are difficult to meet such as reassembly and transfer of extensive amounts of computation result files. This paper presents technologies for solving these challenges to visualize super-large-scale computations performed by the K computer and the results of applying those technologies.

1. Introduction transferring the numerical data computed on The results of numerical computations numerical computation results obtained by performed by supercomputers are obtained a to a visualization server, in as a series of digital numerical data on scales which visualization is used, as shown that have recently come to exceed one billion. in Figure 1. The number of parallel processes Computer (CG) can be used to render performed in a server is often one to about a these numerical data as images or videos to help dozen. In the past, numerical computations people intuitively understand them, which is a were usually performed in a supercomputer on technology called visualization. scales of a few tens to a few hundreds of parallel This paper presents solutions to challenges processes. And, for visualization, the obtained for visualizing super-large-scale computation data on numerical computation results were results with the K computernote)i and the reassembled into one to about a dozen files technologies applied. depending on the number of parallel processes performed by the visualization server. In the 2. Challenges in visualization K computer, however, numerical computation of large-scale parallel takes place on scales of a few to a few tens of computation thousands of parallel processes. Reassembling Generally, visualization is achieved by the resultant data for visualization takes a tremendous amount of time and trouble, and the note)i “K computer” is the English name that RIKEN has been using for the extensive size of the reassembled files give rise to supercomputer of this project since July challenges: 2010. “K” comes from the Japanese word 1) Too much transfer time is required to “Kei,” which means ten peta or 10 to the 16th power. move the computation result data to the

348 FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 348–356 (July 2012) A. Ogasa et al.: Visualization Technology for the K computer

visualization server. For example, the fi le 3. Visualization of the K size of double precision scalar data with computer computational grid points of the cube of The fundamental cause of the challenges in 40963 is about 550 Gbytes, which requires visualizing the results of computations with the at least one hour for transfer even with a K computer is that computation result data are 1000BASE-T network. diffi cult to transfer and reassemble. The cause of 2) Even if the computation result data are these challenges can be removed by the following successfully transferred, the visualization approaches. server has insuffi cient memory to store the 1) Eliminating the need to transfer computation result data. For example, the computation result data by visualizing on a memory size of PC , which are node of the K computer often used as visualization servers, is only 2) Eliminating the need to reassemble up to 192 Gbytes and the 550-Gbyte data computation result data by visualizing with mentioned above cannot be stored even if the same number of parallel processes as they are converted into single-precision the computation data. In general, visualization processing is 3) Thinning grids out from computation result often performed while interactively changing data at regular intervals so as to reduce the parameters as in the sweeping of a cross-section fi le size may cause important position or iso-surface value. On a node of the K to be lost. Appropriate thinning requires computer, however, even visualization processing the characteristics of each type of data to be must be performed as a batch job for operational taken into consideration, and this cannot be . For that , ,1) handled in a general manner. which is suitable for grasping at a glance the In this way, with large-scale parallel physical quantity distribution in the entire computations such as those performed by 3D computational without interactive the K computer, there is a problem that processing such as cross-section sweeping, is the conventional visualization technique of employed as a visualization technique for the reassembling computation result data for present development. transfer to the visualization server cannot be Based on these approaches, we have employed. developed visualization software that uses

Supercomputer

Visualization server

Data Visualization result transfer Results of numerical Visualization calculation software

Figure 1 Concept of visualization.

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 349 A. Ogasa et al.: Visualization Technology for the K computer

volume rendering, which effi ciently runs on a computation result data. node of the K computer even with the number of 2) To call an API from the numerical parallel processes on a scale exceeding thousands computation program. and a visualization library that includes tools One good point of the former is that for assisting use of the visualization software. visualization can be achieved without modifying Figure 2 shows the concept of visualization the numerical computation program. An with the K computer. Data on the results of example of program coding based on 1) is shown numerical computations are visualized on a node in Figure 3. of the K computer and the result of visualization The latter is advantageous in that the data is output as an fi le in JPEG or some other in the memory can be directly visualized, which format. The size of the image fi le depends on eliminates the data transfer cost from reading the resolution of the image regardless of the fi les and makes it easier to visualize the process number of computational grid points, and is as of computation. small as a few hundred Kbytes to a few Mbytes in many cases, which means that the result of 3.2 File interface visualization can be easily transferred to the A fi le interface is intended for directly user’s terminal for viewing. reading the computation result data. File As interfaces for passing the computation interfaces are already prepared for some fi elds result data to the visualization library, we have of numerical computation performed by the K prepared program and fi le interfaces. computer. Simply by outputting computation A program interface calls an application result data in a format compatible with the programming interface (API) with C or fi le interface, the user can directly read the FORTRAN from the user program. 1 computation result data to the visualization shows the APIs and data structures of the library for visualization without having to use visualization library. A program interface can be APIs to create a program. used in two ways: An image of using the visualization library through program and fi le interfaces is shown in 3.1 Program interface Figure 4. 1) To call an API from the program to read the As tools to help people use visualization

K computer User terminal

Data transfer Visualization result image Visualization result (still image, video) Results of Visualization numerical calculation software

Figure 2 Concept of visualization with the K computer.

350 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) A. Ogasa et al.: Visualization Technology for the K computer

Table 1 APIs and data structures of visualization library. Structure for structured grid data: s_PBR_array_info Common API Description CPBR_Init Initializes PBVR Member name Type Description Specification of existence domain CPBR_Get_imageinfo Acquires composite image information extent float[2][3] of entire model CPBR_Run Runs PBVR datatype int Data type CPBR_Finalize Finalizes PBVR datasize int[3] Grid size Outputs visualization result as a JPEG CPBR_OutputJpeg file coordtype int Coordinate type Outputs visualization result as a PNG coord float * Coordinate data CPBR_OutputPng file data void * Visualization data

API for structured Structure for unstructured grid data: s_PBR_cell_info Description grid data definition Member name Type Description Initializes a structure that stores CPBR_Init_array_info Specification of existence domain structured grid data parameters extent float[2][3] of entire model Registers a structure that stores node.n int Number of nodes CPBR_Set_array_info structured grid data parameters with PBVR node.datatype int Data type node.data void * Visualization data (node)

API for unstructured node.coord float * Coordinate data Description grid data definition Data type (common to all cell.datatype int Initializes a structure that stores elements) CPBR_Init_ucd_info unstructured grid data parameters cell.n_tet int Number of tetrahedral elements Registers a structure that stores Tetrahedral element connection cell.c_tet int * CPBR_Set_ucd_info unstructured grid data parameters information with PBVR cell.d_tet void * Tetrahedral element data value cell.n_pyr int Number of pyramidal elements Viewer operation API Description Pyramidal element connection cell.c_pyr int * information Rotates an object around the CPBR_Set_view specified axis cell.d_pyr void * Pyramidal element data value Number of triangular prism cell.n_prism int elements Triangular prism element cell.c_prism int * software, we have prepared a command library connection information Triangular prism element data and file splitter tool. The command library is a cell.d_prism void * value tool for creating a visualization parameter file cell.n_hex int Number of hexahedral elements that stores the point of view information and Hexahedral element connection cell.c_hex int * transfer functions for volume rendering. It is information implemented as an by Python and cell.d_hex int Hexahedral element data value runs on the login node of the K computer. The file splitter tool, which is used when the computation result is output as a single file, is intended for implemented volume rendering with a target of splitting the file into an arbitrary number of files achieving a parallelization efficiency of 80% or for the purpose of efficient parallel visualization. higher on a scale of 1000 parallel processes. The existing methods of implementing parallelization 4. Visualization software for by volume rendering include parallelization for massively parallel processes each pixel to be rendered and parallel volume To provide visualization software that rendering for split computational domains in works with massively parallel processes, we have order to composite rendered sub-images and

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 351 A. Ogasa et al.: Visualization Technology for the K computer

Note: In the original code, the part that follows the double slash “//” is written in Japanese. Figure 3 Example of coding with APIs.

Program interface File interface

K computer Login node Login node K computer Computation code

Computation result Command library for Visualization parameter generation parameters File read program

API Command library for Visualization PBVR File parameter generation parameters Computation splitter PBVR result tool

Image file Image file

Figure 4 Interfaces for using visualization library.

352 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) A. Ogasa et al.: Visualization Technology for the K computer

create an entire image. However, parallelization the computation domain is split, one of the split for each pixel requires inter-node domains may have another domain inserted between computation nodes to which the cells before or after along the direction of the line of to be visualized that are sorted along with sight. These things may make it hard for people the line of sight penetrating the pixel are to judge the order. In this case, the data on distributed, resulting in lower efficiency with numerical computation results must be further massively parallel processes. In the method split until it becomes possible to judge the order of parallelization for each split computation before visualization can be performed. domain, sub-images must be composited by each Accordingly, we have adopted particle-based pixel in order from the point of view toward the volume rendering (PBVR)2) developed by the far side so as to obtain the penetration effect. Koyamada Laboratory of Kyoto University as a This means that inter-node communication may general-purpose volume rendering technique for be necessary in some cases to judge the order of massively parallel processes. Figure 5 shows a images. It also means that, depending on how processing block of PBVR. For PBVR,

Rank 0 Rank n-1

Repeat by Repeat by number of cells number of cells

Data type conversion Data type conversion

Vertex coordinates Vertex coordinates geometric transformation geometric transformation

Data gradient Data gradien computation computation

Computation of number Computation of number of generated particles of generated particles

Particle generation Particle generation

Particle projection Particle projection

Sub-image Sub-image

Compositing Compositing

Final image

Figure 5 PBVR processing block.

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 353 A. Ogasa et al.: Visualization Technology for the K computer

parallelization is performed for each cell enclosed 5. Estimation of visualization in a computation grid, rather than each pixel or library performance split domain. While the amount of computation We have estimated the performance of the required for cell rendering is larger than with visualization library with regard to large-scale the existing techniques, the rendering process data on parallel computation results. First, we for each cell can be carried out independently used scalar data on grid points of a Cartesian grid without inter-node communication, which makes with a computational space resolution of 1024 it suitable for massively parallel processes. In × 1024 × 900 (approximately 900 million cells) addition, compositing in PBVR is performed not as inputs and used the K computer to measure for each pixel but for each subpixel resulting the processing performance of generation of a from pixel splitting, which eliminates the need volume-rendered image with a resolution of 1024 to consider the order of rendering to obtain the × 1024 with 1000 and 2000 parallel processes. penetration effect. Consequently, no inter-node Based on the result of this measurement, we communication for judging the order takes place estimated the visualization time for large-scale and there is never a case in which the order data with a computational space resolution of cannot be judged. 4096 × 4096 × 3600 (approximately 60.4 billion Meanwhile, visualization in a massively pixels) with the number of parallel processes parallel environment is known to cause a large changed from 1000 up to 80 000. The result of amount of inter-node communication due to the estimation is shown in Figure 6. With data compositing, which decreases the processing of this scale, the visualization time is estimated efficiency, even if the cell rendering efficiency to be almost linearly reducible with up to is improved.3) As compositing techniques, the 20 000 parallel processes and, with over 20 000 direct send and binary swap methods have parallel processes, visualization is estimated to widely been used up to now. The former is be achievable in an almost constant time. We applicable to an arbitrary number of parallel expect to achieve the purpose of the development processes and easy to implement but inter- as shown below. node communication causes a bottleneck with • By visualizing large-scale data on the massively parallel processes, leading to a serious compute nodes it becomes unnecessary degradation in speed. The latter features a low to transfer a large amount of data to the communication cost but is only applicable to the visualization server. number of parallel processes that is a power of • By conducting visualization in a parallel two. Numerical computation is not necessarily environment it becomes unnecessary to performed by parallel processes in a number that rebuild the data. is a power of two and the binary swap method • To work efficiently in a massively parallel does not suit the purpose of visualizing without environment on a scale of several thousands reassembling the result data. For the present to several tens of thousands of parallel development, the 2-3 swap method,4) which processes in the K computer. is an improvement of binary swap to make it applicable to an arbitrary number of parallel 6. Future challenges processes, has been adopted and implemented as We to measure the visualization a compositing technique. performance with the visualization library on a scale of a few tens of thousands of parallel processes in line with the expansion of the system scale of the K computer to verify the estimation

354 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) A. Ogasa et al.: Visualization Technology for the K computer

200

180

160

140

120

100

80 Processing time (s) 60

40

20

0 1000 2000 5000 10 000 20 000 40 000 60 000 80 000 Number of parallel processes

Figure 6 Estimated visualization time for 60.4 billion cells.

result. We also intend to select some cases References of massively parallel large-scale computation 1) R. A. Drebin et al.: Volume Rendering. , Vol. 22, No. 4, SIGGRAPH ’88, actually performed by the K computer to test the pp. 65–74, 1988. application of the visualization library, thereby 2) N. Sakamoto et al.: Improvement of particle- based volume rendering for visualizing irregular developing an environment and tools for making volume data sets. & Graphics, the library even easier to use. Vol. 34, No. 1, pp. 34–42 (2010). 3) J. Nonaka et al.: Performance Evaluation of Sort-Last Image Compositing in a Massively 7. Conclusion Parallel System. Proceedings of the Symposium on Computational Fluid Dynamics (CDROM). We have identified challenges and (in Japanese) No. 22, Research Paper No. B4-1, developed a visualization library to solve them. 2008. 4) H. Yu et al.: Massively Parallel Volume The K computer has been used to measure the Rendering Using 2-3 Swap Image Compositing, visualization time with the visualization library, Proceedings of IEEE/ACM Supercomputing 2008 Conference, 1-11. and the visualization time in a massively parallel environment has been estimated. In this way, we have verified that the result of massively parallel large-scale computation can be visualized by using the visualization library. Lastly, we would like to extend our sincerest gratitude to Mr. Motoyoshi Kurokawa of RIKEN, who offered assistance and discussed implementing the visualization library in the K computer.

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 355 A. Ogasa et al.: Visualization Technology for the K computer

Atsuji Ogasa Kiyotaka Sakamoto Fujitsu Ltd. Fujitsu Systems East Ltd. Mr. Ogasa is currently engaged Mr. Sakamoto is currently engaged in development of applications for in development of applications for visualization in the field of in the field of scientific computation. computation.

Hiroyuki Maesaka Sadanori Otagiri Fujitsu Systems East Ltd. Fujitsu Systems East Ltd. Mr. Maesaka is currently engaged Mr. Otagiri is currently engaged in development of applications for in development of applications for visualization in the field of scientific visualization in the field of scientific computation. computation.

356 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012)