Introduction of Supercomputer System for DNA Data Bank of Japan
Introduction of Supercomputer System for DNA Data Bank of Japan
V Hideaki Sugawara V Hiroaki Yamada V Hisahito Yamada (Manuscript received March 19, 2008)
The Center for Information Biology and DNA Data Bank of Japan (CIB-DDBJ) at the National Institute of Genetics (NIG) has joined forces with GenBank (USA) and EMBL (Europe) to form the International Nucleotide Sequence Database Collaboration (INSDC) and make their databases accessible to researchers over the Internet. In parallel with advances made in the life sciences, the INSDC databases are increas- ing rapidly in size by 30–50% annually. Moreover, a large-scale computer system for constructing and releasing them and providing search and analysis services has become necessary. Important requirements of such a system will include ongoing enhancements to system performance and capacity and support for software evolu- tion. This paper describes a new supercomputer system that went into operation in March 2007. It meets a variety of difficult requirements including support for a rap- idly growing database, provision of high-speed and stable search/analysis services, smooth migration from the old system, compatibility with existing user assets, high reliability, and support for security. This paper also introduces the contributions that this new system is making to life sciences research both inside and outside Japan.
1. Introduction issues and how it is contributing to Japanese and In February 2007, the National Institute of international research in the life sciences. Genetics (NIG) of the Research Organization of Information and Systems introduced a super- 2. DDBJ overview computer system consisting of large-scale CIB-DDBJ is working with EBI/EMBL4) servers, clusters of personal computers (PCs), in Europe and NCBI/GenBank5) in the USA as large-capacity storage, keyword search systems, part of the DDBJ/EMBL/GenBank International and other components. It has been in opera- Nucleotide Sequence Database Collaboration tion since March 2007. This supercomputer (INSDC).6) The INSDC databases consist of system aims to provide data collection, database deoxyribonucleic acid (DNA) (or ribonucleic acid construction, data release, and search and [RNA]) nucleotide sequences experimentally analysis services for the DNA Data Bank of determined by researchers around the world Japan (DDBJ)1)-3) operated by the Center for and collected and edited by three inter- Information Biology and DNA Data Bank of national databanks—DDBJ, EMBL, and Japan (CIB-DDBJ) at NIG. It is provided for GenBank—according to a data construction both NIG and research institutions inside and scheme established by these three parties. To the outside Japan for research purposes. user, this data is provided in the form of a unified In this paper, we describe how this super- computer file. The databases are prepared by computer system has solved a number of difficult editing DNA nucleotide sequence data received
442 FUJITSU Sci. Tech. J., Vol.44, No.4, pp.442-448 (October 2008) H. Sugawara et al.: Introduction of Supercomputer System for DNA Data Bank of Japan
directly from researchers, but they also include in data from here on. DNA data processed by the Japan Patent Office, Each of the INSDC databanks (DDBJ, European Patent Office, and United States Patent EMBL, and GenBank) makes its data accessible and Trademark Office. over the Internet and provides keyword search The nucleotide sequence database is a set and homology search services (discussed in more of data units called “entries”. In addition to the detail later). These databanks have, as a result, nucleotide sequence itself, each entry includes come to be used by researchers throughout the information about the researcher who determined world. The database provided by each databank the sequence plus related references, organism, may be copied and used by anyone in the world. and gene function and features. Data collect- However, the ongoing increase in the amount of ed by each databank is exchanged daily among data makes it difficult for a user to keep a copy the three databanks so that the same informa- up to date. For this reason, these search services, tion will be provided by them. DDBJ collects and which enable a user to use a personal computer edits about 20% of the data released by these to search for data in a database that is always three international databases. maintained with current data from around the When DDBJ first released its nucleotide world, are increasing in value, and they have sequence database in July 1987, it consisted of been in great demand by researchers. only 66 entries and 108 970 base pairs. In recent years, however, the INSDC databases have been 3. New computer system increasing at an annual rate of 130–150%. In supporting DDBJ operations September 2007, about 20 years after the initial To provide a dependable database that release, they had reached 76 273 345 entries and continues to grow at an explosive rate, as 79 706 204 461 base pairs. The growth in data described in the previous section, it is impor- registered in INSDC is shown in Figure 1. tant that the performance and capacity of the In addition, genome sequencers that can computer system be continuously upgraded and operate several tens to a hundred times as fast that software be allowed to evolve. Six years as conventional sequencers are now being intro- after the launch of the old system, it became duced, and this should only accelerate the growth difficult to support the rapid increase in data and the demand for new services. Plans for upgrading to a new system were drawn up in 2005–2006, culminating in the delivery of a .UCLEOTIDES new supercomputer system from Fujitsu. This %NTRIES
section summarizes the requirements of this new supercomputer system and explains how those
requirements were met. %NTRIES §
.UCLEOTIDES § 3.1 Requirements of new supercomputer system 1) Support explosive growth in data • Support a massive amount of nucleotide sequence data that increases by 1.5 times 3OURCE $$"* $$"*%-",'EN"ANK DATABASE GROWTH annually, making for about 400 million
Figure 1 entries in five years time INSDC database growth. • Shorten the update interval for all data
FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) 443 H. Sugawara et al.: Introduction of Supercomputer System for DNA Data Bank of Japan
(from 3 to 2 months) database server, achieving high-speed and 2) Provide high-speed and stable stable response. search/analysis services 2) Provision of high-speed and stable • Achieve high-speed response and high search/analysis services throughput in homology search services • For keyword searches, it achieves a stable • Achieve high-speed and stable response in response within ten seconds for 30 million keyword search services regardless of search entries of complex data, even for complicat- conditions for data that is continuously ed search conditions. growing • For homology searches, it achieves improved 3) Achieve smooth migration from the old standalone performance five to eight times system and provide user-asset inheritance the old level and high-throughput perfor- and compatibility mance by parallel processing. • Achieve smooth and rapid migration of 3) Smooth migration from the old system and the large number of existing application user-asset inheritance and compatibility programs used for diverse services • For core data processing, it uses the Solaris 4) Achieve high reliability and provide security operating system (OS), the same as the old measures system. • Achieve a safe and secure system that • For homology search processing, it uses supports data exchange among the three Linux as the de facto standard OS. international databases and the provision • For analysis processing, it uses both Solaris of services to researchers inside and outside and Linux to provide optimal processing Japan according to the characteristics of the analy- sis program. 3.2 Features of the new supercomputer 4) High reliability and security measures system • It supports active maintenance and a redun- The new supercomputer system was devel- dant configuration and detects faults in oped to satisfy the above requirements. Its advance through a diagnostics monitoring features are summarized below. function. 1) Resource allocation and system design • It uses Symfoware and HiRDB, which are supporting explosive growth of data domestically developed RDBMS software. • For core data processing, the system combines large-scale symmetric multi- 3.3 Configuration and performance of new processing (SMP) servers, large- supercomputer system capacity storage, backup equipment, The main items of equipment making up and the latest relational database the newly introduced supercomputer system are management system (RDBMS) software listed below and shown in Figure 2. to achieve high-speed processing of large 1) Database construction/release servers volumes of data. Large-scale SMP server: PRIMEPOWER2500 • For homology searches and analysis process- (28 CPUs) ing, it uses a new PC-cluster system, Large-capacity storage: ETERNUS8000 achieving high-speed response and high (125 TB) throughput through distributed processing. Blade server: BladeSymphony • For keyword searches, it uses a large-scale 2) Homology search server extensible markup language (XML) PC cluster: PRIMERGY RX200 S3 (66 nodes)
444 FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) H. Sugawara et al.: Introduction of Supercomputer System for DNA Data Bank of Japan
3) Keyword search server es. This system consists of ETERNUS8000 Large-scale XML database server: 125 TB high-capacity storage equipment and ShunsakuEngine (2520 CPUs) a BladeSymphony blade server centered on a 4) Analysis servers PRIMEPOWER2500 (featuring 28 CPUs and Large-scale SMP server: PRIMEPOWER2500 88 GB of memory) large-scale SMP server running (32 CPUs) the Solaris OS, which has a proven record of PC cluster: PRIMERGY RX200 S3 (66 nodes) stable operation of large-scale database systems. 5) Backup equipment Two types of database management software are Tape library: VD800 (750 TB) used for different applications: HiRDB is used for 6) Other servers/network equipment the database construction system and Symfoware Network server: PRIMERGY is used for the database release system. Both RX200 S3/RX300 S3 the database and release data are stored in Switches/routers: the ETERNUS8000 high-capacity storage equip- Catalyst6506/IPCOM S2400, etc. ment, and in addition to performing periodic The performances of the new supercomputer backups to VD800 backup equipment, the system system and the old system are compared in terms can perform remote backups of especially impor- of major components in Table 1. tant data to disk drives connected to servers belonging to the National Institute of Informatics. 4. Details of new genome The database construction system stores sequence database system nucleotide sequence analysis data submitted 4.1 Database construction/release servers mainly by researchers in Japan using either The database construction/access servers the SAKURA8) nucleotide sequence submis- constitute a system for uniformly managing sion system or a mass submission system. This DDBJ data and creating data for release purpos- data is checked and edited by specialists called
.ETWORK SERVERS 02)-%2'9 #ATALYST)0#/- )NTERNET 28 328 3 3
+EYWORD SEARCH SERVER (OMOLOGY SEARCH SERVER
3HUNSAKU%NGINE 02)-%2'9 8 3 #05S NODES
$ATABASE CONSTRUCTIONRELEASE SERVERS !NALYSIS SERVERS
02)-%0/7%2 #05S #05S
02)-%2'9 8 3 NODES "LADE3YMPHONY "ACKUP EQUIPMENT
%4%2.53 6$ 4" 4"
Figure 2 New supercomputer system.
FUJITSU Sci. Tech. J., Vol.44, No.4, (October 2008) 445 H. Sugawara et al.: Introduction of Supercomputer System for DNA Data Bank of Japan
“annotators” at NIG. Checked data is edited into making extensive use of multiple-alignment units of entries and passed on to the database analysis based on biological similarities among release system on a daily basis. nucleotide sequences and amino acid sequences. In addition to data received from the The DDBJ provides Web-based services11) database construction system, the database for conducting homology searches by the above release system edits updated data received software or multiple-alignment analysis (by from EMBL/GenBank also on a daily basis and CLUSTALW) of the INSDC database and converts it to DDBJ format. The system releases other databases having more than 70 million all daily edited data of the three internation- entries (as of November 2007), including a al databases. The entire database is released key-protein database. Search requests can be in entry (flat-file) format, DDBJ-XML format, made by E-mail or from programs using a Web INSDC-XML format, or FASTA format. Users application programming interface. Search may download data from a DDBJ anonymous results may be viewed on a Web browser or FTP site9) and may retrieve data using the received by E-mail. getentry10) search system (in units of entries) The homology search server uses a PC or by the homology/keyword search systems cluster consisting of 66 nodes (2 of which are described below. management nodes) of PRIMERGY RX200 S3 servers. Each node has two 3.0-GHz dual-core 4.2 Homology search server CPUs, 8 GB of memory, and 140 GB of hard disk A homology search searches for biological storage (RAID1) and runs 64-bit-mode Linux for similarity among DNA, RNA, and amino-acid the OS. sequences. In life sciences research, it is widely In response to a search request from a user, used for estimating genetic function, evolutionary the system automatically distributes the request analysis, etc. There are various types of software to all the nodes using a network queuing system. for conducting homology searches depending on This minimizes the waiting time and achieves the type of data targeted for retrieval and the a high-speed response and high throughput for algorithm used to conduct the search. At present, the user. The HMMPFAM service, in particu- researchers are making much use of BLAST, lar, achieves ultrahigh-speed processing by using PSI-BLAST, FASTA, SSEARCH, and HMMPFAM, parallelization software developed by Fujitsu for example. Life sciences researchers are also Laboratories. As a result of these performance improve- ments, the usage rate of the popular Table 1 BLAST/FASTA/CLUSTALW services increased Performance comparison with old system. by 10–30% in October 2007 over the same month 0ERFORMANCE 3ERVICEAPPLICATION %QUIPMENT COMPARISON a year earlier. 5.)8 SERVER $ATABASE "LADE SERVER !BOUT TIMES 4.3 Keyword search server CONSTRUCTIONRELEASE 3TORAGE The DDBJ provides the ARSA12),13) integrat- +EYWORD SEARCH 3HUNSAKU%NGINE !BOUT TIMES ed keyword search service targeting INSDC (OMOLOGY SEARCH 0# CLUSTER !BOUT TIMES and 24 other important databases in the life 5.)8 SERVER sciences. In past keyword search systems such !NALYSIS !BOUT TIMES 0# CLUSTER as SRS,14) the deterioration in response time 3TORAGE 3!. STORAGE !BOUT TIMES accelerated as search conditions became more "ACKUP EQUIPMENT 4APE ARCHIVE