Super-Kamioka Computer System for Analysis

V Akira Mantani V Yoshiaki Matsuzaki V Yasushi Yamaguchi V Kouki Kambayashi (Manuscript received April 7, 2008)

The Institute for Cosmic Ray Research (ICRR) of the University of Tokyo newly established the Kamioka Observatory in 1996, and has continued to observe the elementary particles known as neutrinos by using the Super-Kamiokande Neutrino Detection Equipment. This equipment contains a 50 000-ton ultrapure water tank measuring 39.3 meters in diameter, 41.4 meters in height, and located 1000 meters underground. A total of 11 129 photomultiplier tubes (PMTs, 50 cm in diameter) are mounted on the inner wall of the tank. The Kamioka Observatory continues observation 24 hours a day, 365 days a year in order to detect neutrinos observable only for 10 seconds from a supernova explosion which may occur once every dozens of years. The current size of total accumulated data is nearly 350 TB. In February 2007, Fujitsu installed a computer system (known as the “Super-Kamioka Computer System for Analysis”) using mass storage disk drives for saving and rapidly accessing the observed data. This paper describes the configuration of the Super-Kamioka Computer System for Analysis, explains how data is managed and rapidly accessed, and how throughput performance is improved.

1. Introduction storage management system with a new storage The Kamioka Observatory of the Insti- system offering faster access based on Fujitsu’s tute for Cosmic Ray Research (ICRR) of the ETERNUS4000 model 500 magnetic disk drive University of Tokyo1) has been conducting system. The Kamioka Observatory must analyze research on elementary particle physics through past data using newly changed parameters. neutrino detection and nucleon decay searches. Given the need for faster data access in such Super-Kamiokande2) (Figure 1) was construct- analysis, Fujitsu’s Parallelnavi Shared Rapid File ed as neutrino observation equipment in 1996, System (SRFS) was installed. and enabled the discovery of the phenomenon of This paper outlines the observation of data, neutrino oscillations in 1998. Super-Kamiokande describes the analysis system, and explains how subsequently achieved other impressive results high-speed data access is acquired. such as proving that neutrinos have mass, and is now being used to solve such cosmic mysteries 2. Outline of data observation as the birth of black holes and stars. Neutrino The Super-Kamiokande detector consists of a observation entails the storage of about 50 GB cylindrical stainless-steel water tank (39.3 meters of raw data a day, and the total size of current- in diameter, 41.4 meters in height, and designed ly accumulated data is almost 350 TB (including to hold about 50 000 tons of ultrapure water) the size of processed data). In February 2007, we and 11 129 photomultiplier tubes (used as photo- replaced the conventional, hierarchical tape-drive sensors) mounted on the inner wall of the tank.

FUJITSU Sci. Tech. J., Vol.44, No.4, pp.435-441 (October 2008) 435 A. Mantani et al.: Super-Kamioka Computer System for Analysis

Super-Kamiokande is located 1000 meters under- occur during the observation of solar neutrinos ground in the Kamioka Mine in Gifu Prefecture involve environmental gamma rays coming from in order to prevent cosmic rays from influenc- outside the tank and minute residual amounts ing observation. Super-Kamiokande is used of such radioactive materials as Radon in the for such research purposes as observing neutri- water inside the tank. Such background events nos coming from space and proton decay events. generate Cherenkov light in the same way as a Neutrinos from space can be roughly classified neutrino reaction and thus are a source of confu- into three types: neutrinos from the sun (solar sion. However, the Super-Kamiokande system neutrinos), neutrinos generated by the reaction can distinguish neutrino events from background of cosmic rays with the earth’s atmosphere events by analyzing the observed particle gener- (atmospheric neutrinos), and neutrinos generat- ation position and travel direction. The data ed by a supernova explosion at the end of a star’s observed on background events clearly identified life (supernova neutrinos). Neutrinos entering by using the reaction position and other factors is the tank of Super-Kamiokande may react with discarded immediately after being acquired. pure water in the tank and generate a glimmer- After the data of background events is ing bluish-white light (known as “Cherenkov discarded, the remaining data is converted into light”). The energy, reaction position, and travel a world standard format (known as “ZEBRA”3)) direction of the incoming neutrinos are calcu- established by the European Organization lated by sensing Cherenkov light with the for Nuclear Research (CERN).4) The convert- photomultiplier tubes. This operation is called ed data is then sent to the Super-Kamioka event reconstruction. Computer System for Analysis. This format One particularly important point of neutri- conversion operation is called reformat opera- no observation is to remove background events. tion. One record consists of about 5 KB and For example, the background events that may about 11 million events are recorded each day.

3UN

.EUTRINOS +AMIOKA-INE

M TONS PHOTOMULTIPLIERTUBES OFULTRAPURE WATER M

)LLUSTRATIONCOPYRIGHTC +AMIOKA/BSERVATORY )NSTITUTEFOR#OSMIC2AY2ESEARCH)#22 5NIVERSITYOF4OKYO

Figure 1 Super-Kamiokande Neutrino Detection Equipment.

436 FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) A. Mantani et al.: Super-Kamioka Computer System for Analysis

Therefore, about 50 GB are saved per day. The system located at the experimental site in the Super-Kamioka Computer System compensates mine, the system outside the mine (located in the data of all remaining events by using the the computation and research buildings), and parameters for respective characteristics of the the technology necessary to acquire high-speed photomultiplier tubes and the parameters for access. water quality (e.g., transparency of water), and reconstructs these events in real time. Since 3. System configuration of these parameters may be subject to seasonal Super-Kamioka Computer variations and the applications used to recon- System for Analysis struct events may be constantly improved and The Super-Kamioka Computer System advanced, the events are often reanalyzed by for Analysis consists of the system inside the using all the accumulated raw data. Given the mine (for collecting data observed by Super- huge amount (about 110 TB) of raw data needed Kamiokande and converting the data format), the for reanalysis, a system capable of superfast data system outside the mine (for accumulating the access must be used. format-converted data and analyzing the accumu- Figure 2 shows the configuration of the lated data), the terminals used daily and data Super-Kamioka Computer System for Analysis. backup system, the monitoring system, and the The next section describes the configuration of Gigabit Ethernet for connecting these systems.

%XPERIMENTALSITEINSIDEMINE #OMPUTATIONBUILDINGANDRESEARCHBUILDING

3YSTEMINSIDETHEMINEFRONT ENDPROCESSINGSYSTEM 3YSTEMOUTSIDETHEMINE UNITS /NLINEDATASERVERS /BSERVEDDATAABOUT'"DAY 2EFORMATSERVERS &ILETRANSFER $ATASERVERS4" UNITS SERVERS 0RELIMINARYDATA

x UNITS STORAGE $ATAANALYSIS /BSERVEDDATA SERVERS &RONT ENDDATA "ACKBONE "ACKBONE

ACQUISITIONSERVERS NETWORK x UNITS NETWORK $ATAGENERATEDBY PROCESSINGOBSERVEDDATA

x UNITS *OBCONTROL SERVERS #ONTROL 3TRUCTUREOF

OBSERVEDDATA x UNITS "ACKUPCOPYBYRESEARCHERS .EUTRINOS /BSERVED 0-4NO 4IME !MOUNTOFENERGY DATA /BSERVED DATA /BSERVATIONSYSTEM !UTOMATICLOAD DISTRIBUTIONIN "ACKUPCOPY ANALYSISJOBS 0-4 0-4 4"

0-4 0-4

0-4 2EACTION 0-4

0-4 0-4 0APERS

0-40HOTOMULTIPLIERTUBE #ONFERENCEPRESENTATION 3UPER +AMIOKANDEEXPERIMENTALWATERTANK JOURNALPUBLISHING ETC

Figure 2 Super-Kamioka Computer System for Analysis.

FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) 437 A. Mantani et al.: Super-Kamioka Computer System for Analysis

The following describes the main constitu- execute the event reconstruction described above. tive systems inside and outside the mine. Then, the reformat servers reformat the data and transfer necessary data to the data servers at a 3.1 System inside the mine rate of about 50 GB per day. Super-Kamiokande observes neutrinos on a 24-hour basis in order to steadily detect impor- 3.2 System outside the mine tant events involving cosmic rays. The front-end The system outside the mine is used to processing system for this detection is a real-time accumulate and analyze the observed data system that requires high availability. sent from the front-end processing system. The front-end processing system consists This system can simultaneously process up to of front-end data acquisition servers (24 1080 jobs for high-speed parametric data analysis PRIMERGY RX200 S3 units), online data (using four CPUs/node × 270 nodes). Calibration servers (two PRIMERGY RX300 S3 units and for newly accumulated observation data and one ETERNUS4000 model 100), reformat reanalysis jobs are always executed, and typical- servers (ten PRIMERGY RX200 S3 units), and ly about 500 jobs and sometimes more than an online network (four Catalyst4948 units and 1080 jobs (including jobs awaiting execution) two Catalyst2960G units) for connecting these are input. The disk drives and file systems must components. therefore be carefully designed to ensure efficient The observation system is connected to data access in these jobs. The next section the front-end data acquisition servers via a describes the concrete points for such design. dedicated interface developed by ICRR of the 1) System configuration and main functions University of Tokyo. Dedicated applications The system outside the mine consists of collect the observed data. The front-end data data servers (three PRIMEQUEST 520 units acquisition servers send the data collected to the and six ETERNUS4000 model 500 units), job preliminary data storage connected to the online control servers (ten PRIMERGY BX620 S3 data servers. Thus, the sent data is accumulated units), file transfer servers (two PRIMERGY in this preliminary data storage. BX620 S3 units), and data analysis servers (270 The online data servers communicate with PRIMERGY BX620 S3 units), and a backbone data servers (used as the final storage of observed network (Catalyst6509E) for connecting these data) in the system outside the mine. Should this components. The servers are connected via the data-server communication be interrupted, the backbone network and multiple Gigabit Ethernet observed data sent to the online data servers in lines for high-speed network access. real time must not be lost. To reduce the risk of The data servers have a total data storage losing such real-time observed data, the storage capacity of 700 TB. SRFS provides the file capacity of online data servers must be as large sharing environment for the job control servers as possible. The ETERNUS4000 model 100 has and data analysis servers. limited storage capacity of up to 2 TB in one 2) Program development environment and job RAID group. However, we constructed a 5-TB control file system by connecting multiple RAID groups Users can use the job control servers to to one file system by using the Logical Volume develop programs and enter jobs into the data Manager (LVM) of the Linux system. This analysis servers. The job control servers have a enables the online data servers to accumulate development environment such as Intel compil- observed data for up to about 100 days. ers and the VTune performance analyzer as part The front-end data acquisition servers of CPU performance analysis applications. The

438 FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) A. Mantani et al.: Super-Kamioka Computer System for Analysis

Parallelnavi Network Queuing System (NQS) for the analysis programs, and used SRFS to environment—batched-job operation support measure job performance. Based on the measure- software—is also provided to immediately execute ment results, we decided on an input/output the developed and debugged programs. Thus, all data length of 8 MB and designed the conditions operations required for analysis operations in necessary to achieve maximum performance development, execution, and evaluation can be when using this input/output data length. executed on one terminal. 2) Storage The jobs input from a job control server The file transfer servers are comprised are assigned by NQS to idle CPUs selected from of sets of seven hard disk drives (HDDs) for among the data analysis servers, and then execut- enabling concurrent use of the drives in parallel. ed by the CPUs. Therefore, users need not search A set of seven HDDs is called a physical volume. for idle resources. In each cabinet of the blade When this configuration is used without modifica- servers that constitute the data analysis servers, tion, the maximum physical volume performance two 1000BASE-T interface lines are shared by restricts maximum input/output performance. ten blades. If too many jobs are concentrated To ease this restriction, we constructed a logical in one blade-server cabinet, network overhead volume by combining multiple physical volumes may increase excessively. Therefore, the NQS so that the physical volumes could be used when environment was set so that jobs are as evenly using a logical volume. distributed to the cabinets as possible. The important points for increasing maximum logical volume performance are deter- 4. High-speed data access mining how many physical volumes to use for Only accumulating the observed data would constituting a logical volume (number of stripe require a function for storing about 50 GB in a columns), and how many volumes of data to 24-hour period. Since the data analysis servers input to and output from one physical volume at must process all accumulated data, however, data a time per input/output operation on the logical must be input from and output to the servers as volume (stripe width). We determined these quickly as possible so that analysis processing values according to the performance measure- can always continue. ment results. Specifically, we decided that the We took the following measures for this stripe width should be 128 or 256 KB using 16 purpose: or 8 stripe columns in order to achieve optimum 1) File system performance. When using 16 stripe columns, The data analysis servers and file transfer however, we found that the system cannot be servers constitute a file system mounted over a comprised of a number of HDDs exceeding the network so that data files can be utilized evenly limit. We therefore employed 8 stripe columns from any data analysis server. The Network and a stripe width of 256 KB. File System (NFS) is generally used as such a 3) Network file system mounted over a network. However, SRFS is a network file system in which our experience shows that NFS does not offer real data is accessed via the Gigabit Ethernet. high processing speed and encounters difficulty However, traffic other than input/output opera- in simultaneously processing the input/output tions can be expected to reduce input/output requests concurrently issued from all the data performance, and the broadcast packets issued analysis servers. To solve these problems, we by SRFS itself may disturb other commu- employed SRFS instead of NFS. nications. As a result, we constructed an We then created input/output model jobs input/output-dedicated network.

FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) 439 A. Mantani et al.: Super-Kamioka Computer System for Analysis

4) Input/output library considering high throughput performance. We developed a dedicated input/output 2) Suppression of communication time-out library in which the input/output data length was The time-out value and retry count in 8 MB. This provides high-speed data input/output network communications are important design from user-developed analysis programs. points. An excessively large time-out value delays error detection and recovery. Conversely, 5. Throughput performance an inadequately small time-out value increas- Since up to four analysis programs can es the retry communication count and reduces be executed on one data analysis server of this communication performance. system, up to 1080 input/output requests may Since throughput performance was consid- be simultaneously issued by all data analy- ered more important than error detection sis servers. For smooth analysis, high-speed performance, we set a time-out value for contin- throughput performance that can process this ued processing even under a high load. Given the many input/output requests without delay must difficulty in acquiring the appropriate time-out be supported. value by calculation, however, tuning operation to The following describes the measures taken obtain a time-out value was eventually performed for acquiring high throughput and cites the according to the actual measurement results. throughput measurement results. In the actual measurement, the 1080 model 1) Suppression of network slowdown jobs described in the previous section were Ideally, the network band should be fully executed concurrently to simultaneously used in order to efficiently use a network. The issue input/output requests to one file system. network band can be fully used by employing one Measurement was repeated by incrementing and of two methods: (1) dividing a request into multi- decrementing the time-out value. Moreover, the ple operations to be executed in parallel on one occurrence of a time-out invokes retry process- network, or (2) simultaneously executing multi- ing and results in varying job execution time. We ple requests on one network. Since there are therefore determined the optimal value by consid- too many physical interfaces on the data analy- ering that the job execution time was roughly the sis system side in this system as compared with same as when no time-out occurred. the number of physical interfaces on the data 3) Measurement results of throughput server side; however, the band at the data server performance is too narrow and the network may slow down. In measuring throughput performance, the To prevent such network slowdown, we restricted 1080 jobs used for the time-out value tuning the number of data analysis servers assigned to described above were concurrently executed on the physical interface of each data server. 270 data analysis servers, in order to measure More specifically, each data server has seven the input/output speed in input/output opera- physical interfaces connected to the input/output tions on three data servers. In this case, it takes network, but up to 270 data analysis servers a certain amount of time from starting execu- must be processed. Therefore, we assigned tion of the first job to starting the concurrent each physical interface to 38 or 39 data analysis parallel operation of 1080 jobs, and the number servers. of jobs operating in parallel gradually decreases When this setting is applied, an error in any because shorter jobs terminate earlier. We there- physical interface of a data server may render fore executed one job several times continuously, the 38 or 39 connected data analysis servers then discarded the first and last execution results unusable. However, we employed this design by so as to exclude the measurement results for less

440 FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) A. Mantani et al.: Super-Kamioka Computer System for Analysis

than 1080 concurrent jobs. and problem-solving approaches prove helpful to We executed jobs on the data analysis you. servers to read/write data on the data servers and acquired throughput performance of 960 MB/s as Acknowledgement the average read/write value. The authors wish to thank Research Associate Yusuke Koshio of the Kamioka 6. Conclusion Observatory, Institute for Cosmic Ray Research, This paper outlined the observation of data The University of Tokyo, for his kind guidance at the Kamioka Observatory of the Institute of and cooperation. Cosmic Ray Research, University of Tokyo, intro- duced the computer system used for on-site data References analysis, and then described the means and 1) The Institute for Cosmic Ray Research (ICRR) of the University of Tokyo, Kamioka Observatory. background of acquiring high-speed data access. http://www-sk.icrr.u-tokyo.ac.jp/index_e.html Because the actual data observation system has 2) Super Kamiokande. http://www-sk.icrr.u-tokyo.ac.jp/sk/index-e.html different site data characteristics and a differ- 3) CERN: The ZEBRA System. ent system configuration, our solution cannot http://wwwasdoc.web.cern.ch/wwwasdoc/ zebra_html3/zebramain.html be considered optimal in all cases. However, we 4) CERN. hope that our computer system design concepts http://public.web.cern.ch/Public/Welcome.html

Akira Mantani Yasushi Yamaguchi Fujitsu Ltd. Fujitsu Ltd. Mr. Mantani joined Fujitsu Ltd., Japan in Mr. Yamaguchi joined Fujitsu Ltd., 1985 and has been engaged in the in- Japan in 1980 and was engaged in the stallation support of computer systems installation support of supercomputer for national science institutions. He systems up to 2001. Then he was worked in London from 1996 to 2003 engaged in the development of appli- and was engaged in the installation and cations and the construction of social support of major European supercom- computer systems until 2006. Since puter systems. 2007, he has mainly supported critical computer systems in various projects.

Yoshiaki Matsuzaki Kouki Kambayashi Fujitsu Ltd. Fujitsu Ltd. Mr. Matsuzaki joined Fujitsu Ltd., Mr. Kambayashi joined Fujitsu Ltd., Japan in 1988 and has been engaged Japan in 1989 and has been engaged in the development of applications and in the installation support of comput- the support of computer systems for ers and network systems for research research and development institutes in and development institutes. He has the Tsukuba area. Since 2007, he has been supporting the Super-Kamioka also been engaged in developing the Computer System ever since its instal- scientific computer business. lation in 2007.

FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) 441