An Efficient De Novo Peptide Sequencing Tool for Large-Scale

1 SWPepNovo: An Efficient De Novo Peptide Sequencing Tool for Large-scale MS/MS Spectra Analysis Chuang Li, Kenli Li, Keqin Li, Xianghui Xie, Feng Lin Abstract—Tandem mass spectrometry (MS/MS)-based de novo searching, it is generally assumed that the genomes are pre- peptide sequencing is a powerful method for high-throughput cisely sequenced, and the protein-coding genes and RNA protein analysis. However, the explosively increasing size of genes are just annotated completely. But the latter is not MS/MS spectra dataset inevitably and exponentially raises the computational demand of existing de novo peptide sequencing satisfactory because a lot of alternatively spliced genes do not methods, which is an issue urgently to be solved in computational exist in now available databases [7]. The major limitation of biology. This paper introduces an efficient tool based on SW26010 this method is its highly dependence on the protein database. many-core processor, namely SWPepNovo, to process the large- In addition, due to the use of the relatively simple scoring scale peptide MS/MS spectra using a parallel peptide spectrum modules, it is easy to miss the identification by database matches (PSMs) algorithm. Our design employs a two-level parallelization mechanism: (1) the task-level parallelism between searching. Thus, the database searching methods cannot pro- MPEs using MPI based on a data transformation method and vide a complete solution for protein analysis. a dynamic feedback task scheduling algorithm, (2) the thread- Many efforts have directed their attention to the develop- level parallelism across CPEs using asynchronous task transfer ment of de novo sequencing methods for protein analysis. and multithreading. Moreover, three optimization strategies, Various de novo sequencing methods such as PEAKS [8], Pep- including vectorization, double buffering and memory access op- timizations, have been employed to overcome both the compute- Novo+ [7], pNovo [9], and UniNovo [10] have been developed bound and the memory-bound bottlenecks in the parallel PSMs in recent years. De novo sequencing can directly extract a algorithm. The results of experiments conducted on multiple protein sequence from a MS/MS spectrum without knowledge spectra datasets demonstrate the performance of SWPepNovo of the organism or even the genomic sequences [11], and it against three state-of-the-art tools for peptide sequencing, includ- can process post-translational modifications (PTMs), sequence ing PepNovo+, PEAKS and DeepNovo-DIA. The SWPepNovo also shows high scalability in experiments on extremely large datasets variations and the mass spectra with low signal-to-noise ratio sized up to 11.22 GB. The software and the parameter settings that cannot be effectively processed by database searching are available at https://github.com/ChuangLi99/SWPepNovo. methods [12]. As a consequence, de novo peptide sequencing, Index Terms—Large-scale MS/MS spectra analysis, de novo as an irreplaceable tool to discover new proteins and PTMs, peptide sequencing, high performance computing, SW26010 has been widely acknowledged in the research of proteomics at present. However, the number of MS/MS spectra data has been in- I. INTRODUCTION creasing sharply benefits from the technological breakthroughs In post-genomic era, proteomics has become the most active of the modern spectrometry in recent years [13]. Besides, research fields, and mass spectrometry has developed into a the protein and peptide analysis criteria have become more leading technology for large-scale analysis of proteins, includ- demanding, e.g. with chemical and post translational modifi- ing high-throughput analysis of proteins and determination of cations and/or when considering enzyme semi-unconstrained their primary structures[1]. There are two basically methods searches[14]. Accordingly, analyzing this huge amount of for protein analysis using MS/MS spectra: database-search MS/MS data using de novo peptide sequencing becomes a based peptide sequencing and de novo peptide sequencing [2]. significant challenge for proteome researchers. Without devel- Database-search based peptide sequencing, which aims at oping more powerful and efficient de novo peptide sequencing retrieving all candidate sequences from a specified protein algorithms, the computational bottlenecks that we can expect sequence database for each MS/MS spectrum [3], is a widely is that the scope of discoveries will be limited to small- used method for protein analysis [4] [5] [3] [6]. In database scale MS/MS spectra data. Breakthrough of efficient de novo sequencing method is crucial for large-scale protein analysis Chuang Li, Kenli Li and Keqin Li are with College of Computer Science and Electronic Engineering, Hunan University, National Supercomputing Cen- in computational biology [15]. ter in Changsha, Changsha, 410082, China (e-mail: [email protected], Fortunately, various high performance computing systems [email protected]) (HPCS) such as Intel Many Integrated Core Architecture Keqin Li is also with Department of Computer Science, State University of New York, New Paltz, NY 12561, USA. (e-mail: [email protected]) (MIC) [16] and Graphics Processing Unit (GPU) [17] have Xianghui Xie was with State Key Laboratory of Mathematic Engineering recently been developed to improve the computational effi- and Advance Computing, Jiangnan Institute of Computing Technology, Wuxi, ciency. GPU can support parallel programming language such Jiangsu, China (e-mail: [email protected]) Feng Li was with School of Computer Science and Engineering, Nanyang as Open Computing Language (OpenCL) [18] and Compute Technological University, 639798, Singapore (e-mail: asfl[email protected]) Unified Device Architecture (CUDA) [19], which is widely- 2 used in computational biology research. MIC architecture, present the existing parallel works in protein and peptide mass which contains 60+ cores and 512-bit-wide vector units, is spectra analysis. a coprocessor designed to highly parallel multithreaded application with high memory requirements. A similar architecture, A. De Novo Peptide Sequencing SW26010 many-core processor, has recently been developed at National Research Center of Parallel Computer Engineering De novo peptide sequencing aims to deduce an amino acid & Technology for protein and peptide analysis. This paper sequence according to MS/MS spectrum without the use of a proposes an efficient parallel PSMs algorithm for large-scale protein sequence database. Figure 1 shows the processing flow MS/MS spectra data analysis on SW26010. The main con- of MS/MS spectra analysis using de novo sequencing methods, tribution and innovation of this study can be summarized as which mainly includes three key parts: follows: S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 The experimental stage T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 1) We design and implement the parallel PSMs algorithm 850.3 100 95 687.3 90 85 588.1 using a two-level parallelization mechanism. To our 80 75 High performance liquid 70 65 best knowledge, our algorithm is the first attempt to 60 chromatography 55 851.4 425.0 50 45 949.4 improve the efficiency of large-scale MS/MS spectra RelativeAbundance 40 326.0 35 524.9 30 25 589.2 20 1048.6 226.9 397.1 data analysis and processing. 1049.6 15 489.1 10 629.0 5 0 2) We present a high-effective structural optimized MS/MS 200 400 600 800 1000 1200 1400 1600 1800 2000 Peptide mixture m/z data organization to overcome the memory access band- MS/MS spectrum width bottleneck and propose a highly scalable intra- MPE communication scheme, which gets a paralleliza- The data analysis stage tion efficiency of over 85%. 3) We adopt the SW26010 processor for large-scale protein Scoring analysis that uses the parallel PSMs algorithm. In design Sequence realization, we also employ asynchronous task transfer (SGFLEEDELK) and propose a series of effective optimization strategies to decrease the communication costs between the man- Spectrum graph agement processing elements (MPEs) and computing processing elements (CPEs) and to balance the workload Fig. 1: Workflow of the de novo peptide sequencing. on each CPE, which results in a 10 speedup compared with the un-optimized version. 1) Experimental spectra generation: First, the mixed pro- 4) We also prove the scalability of SWPepNovo by scaling teins digest into mixed peptides using by enzymes. the size of datasets and the number of SW26010 nodes. And then the peptides will be fragmented and ionized We obtain a ideal speedup on a multi-node cluster that (e.g., higher energy collisional dissociation (HCD) [20], contains three SW26010 processors with a total of 4096 collision-induced dissociation (CID) [21]) in liquid chro- CPE. Experimental results show that our method has matography tandem mass spectrometry (LC-MS/MS). an excellent performance on scalability and without Finally, the MS/MS spectra will be output. Figure 2 is sacrificing accuracy and correctness in the de novo a MS/MS spectrum, which contains the measured m=z peptide sequencing results. and intensity of the fragments, represented by the peaks We believe that the techniques we use can guide the design [10]. Different ionization methods have dramatic impact of similar work on the neo-heterogeneous SW26010 many- on the propensities for producing particular fragment core architecture. The software and the parameter settings are ion types. For example, in CID, there are six series available from https://github.com/ChuangLi99/SWPepNovo. of fragment ions, which are denoted by type fragments Users without access to TaihuLight, SWPepNovo can be run C-terminal x, y and z and N-terminal a, b and c type as a multi-threaded (OpenMP) application on a MPI cluster. fragments [12], as shown as Figure 3. The rest of this paper is organized as follows. Section II gives the MS/MS-based de novo peptide sequencing, the Sun- 88 145 292 405 543 663 778 907 1020 1166 way TaihuLight supercomputer and the related work. Section S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 146 III provides details of computational design and optimization strategies.

Load more