1

SWPepNovo: An Efficient De Novo Peptide Sequencing Tool for Large-scale MS/MS Spectra Analysis Chuang Li, Kenli Li, Keqin Li, Xianghui Xie, Feng Lin

Abstract—Tandem mass spectrometry (MS/MS)-based de novo searching, it is generally assumed that the genomes are pre- peptide sequencing is a powerful method for high-throughput cisely sequenced, and the protein-coding genes and RNA protein analysis. However, the explosively increasing size of genes are just annotated completely. But the latter is not MS/MS spectra dataset inevitably and exponentially raises the computational demand of existing de novo peptide sequencing satisfactory because a lot of alternatively spliced genes do not methods, which is an issue urgently to be solved in computational exist in now available databases [7]. The major limitation of biology. This paper introduces an efficient tool based on SW26010 this method is its highly dependence on the protein database. many-core , namely SWPepNovo, to process the large- In addition, due to the use of the relatively simple scoring scale peptide MS/MS spectra using a parallel peptide spectrum modules, it is easy to miss the identification by database matches (PSMs) algorithm. Our design employs a two-level parallelization mechanism: (1) the task-level parallelism between searching. Thus, the database searching methods cannot pro- MPEs using MPI based on a data transformation method and vide a complete solution for protein analysis. a dynamic feedback task scheduling algorithm, (2) the thread- Many efforts have directed their attention to the develop- level parallelism across CPEs using asynchronous task transfer ment of de novo sequencing methods for protein analysis. and multithreading. Moreover, three optimization strategies, Various de novo sequencing methods such as PEAKS [8], Pep- including vectorization, double buffering and memory access op- timizations, have been employed to overcome both the compute- Novo+ [7], pNovo [9], and UniNovo [10] have been developed bound and the memory-bound bottlenecks in the parallel PSMs in recent years. De novo sequencing can directly extract a algorithm. The results of experiments conducted on multiple protein sequence from a MS/MS spectrum without knowledge spectra datasets demonstrate the performance of SWPepNovo of the organism or even the genomic sequences [11], and it against three state-of-the-art tools for peptide sequencing, includ- can process post-translational modifications (PTMs), sequence ing PepNovo+, PEAKS and DeepNovo-DIA. The SWPepNovo also shows high scalability in experiments on extremely large datasets variations and the mass spectra with low signal-to-noise ratio sized up to 11.22 GB. The software and the parameter settings that cannot be effectively processed by database searching are available at https://github.com/ChuangLi99/SWPepNovo. methods [12]. As a consequence, de novo peptide sequencing, Index Terms—Large-scale MS/MS spectra analysis, de novo as an irreplaceable tool to discover new proteins and PTMs, peptide sequencing, high performance computing, SW26010 has been widely acknowledged in the research of proteomics at present. However, the number of MS/MS spectra data has been in- I.INTRODUCTION creasing sharply benefits from the technological breakthroughs In post-genomic era, proteomics has become the most active of the modern spectrometry in recent years [13]. Besides, research fields, and mass spectrometry has developed into a the protein and peptide analysis criteria have become more leading technology for large-scale analysis of proteins, includ- demanding, e.g. with chemical and post translational modifi- ing high-throughput analysis of proteins and determination of cations and/or when considering enzyme semi-unconstrained their primary structures[1]. There are two basically methods searches[14]. Accordingly, analyzing this huge amount of for protein analysis using MS/MS spectra: database-search MS/MS data using de novo peptide sequencing becomes a based peptide sequencing and de novo peptide sequencing [2]. significant challenge for proteome researchers. Without devel- Database-search based peptide sequencing, which aims at oping more powerful and efficient de novo peptide sequencing retrieving all candidate sequences from a specified protein algorithms, the computational bottlenecks that we can expect sequence database for each MS/MS spectrum [3], is a widely is that the scope of discoveries will be limited to small- used method for protein analysis [4] [5] [3] [6]. In database scale MS/MS spectra data. Breakthrough of efficient de novo sequencing method is crucial for large-scale protein analysis Chuang Li, Kenli Li and Keqin Li are with College of Computer Science and Electronic Engineering, Hunan University, National Supercomputing Cen- in computational biology [15]. ter in Changsha, Changsha, 410082, China (e-mail: [email protected], Fortunately, various high performance computing systems [email protected]) (HPCS) such as Intel Many Integrated Core Architecture Keqin Li is also with Department of Computer Science, State University of New York, New Paltz, NY 12561, USA. (e-mail: [email protected]) (MIC) [16] and Graphics Processing Unit (GPU) [17] have Xianghui Xie was with State Key Laboratory of Mathematic Engineering recently been developed to improve the computational effi- and Advance Computing, Jiangnan Institute of Computing Technology, Wuxi, ciency. GPU can support parallel programming language such Jiangsu, China (e-mail: [email protected]) Feng Li was with School of Computer Science and Engineering, Nanyang as Open Computing Language (OpenCL) [18] and Compute Technological University, 639798, Singapore (e-mail: asfl[email protected]) Unified Device Architecture (CUDA) [19], which is widely- 2 used in computational biology research. MIC architecture, present the existing parallel works in protein and peptide mass which contains 60+ cores and 512-bit-wide vector units, is spectra analysis. a coprocessor designed to highly parallel multithreaded appli- cation with high memory requirements. A similar architecture, A. De Novo Peptide Sequencing SW26010 many-core processor, has recently been developed at National Research Center of Parallel Computer Engineering De novo peptide sequencing aims to deduce an amino acid & Technology for protein and peptide analysis. This paper sequence according to MS/MS spectrum without the use of a proposes an efficient parallel PSMs algorithm for large-scale protein sequence database. Figure 1 shows the processing flow MS/MS spectra data analysis on SW26010. The main con- of MS/MS spectra analysis using de novo sequencing methods, tribution and innovation of this study can be summarized as which mainly includes three key parts: follows:

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 The experimental stage T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 1) We design and implement the parallel PSMs algorithm 850.3 100

95 687.3 90

85 588.1 using a two-level parallelization mechanism. To our 80

75 High performance liquid 70 65 best knowledge, our algorithm is the first attempt to 60 chromatography 55 851.4 425.0 50

45 949.4 improve the efficiency of large-scale MS/MS spectra RelativeAbundance 40 326.0 35 524.9

30

25 589.2 20 1048.6 226.9 397.1 data analysis and processing. 1049.6 15 489.1

10 629.0 5

0 2) We present a high-effective structural optimized MS/MS 200 400 600 800 1000 1200 1400 1600 1800 2000 Peptide mixture m/z data organization to overcome the memory access band- MS/MS spectrum width bottleneck and propose a highly scalable intra- MPE communication scheme, which gets a paralleliza- The data analysis stage tion efficiency of over 85%. 3) We adopt the SW26010 processor for large-scale protein Scoring analysis that uses the parallel PSMs algorithm. In design Sequence realization, we also employ asynchronous task transfer (SGFLEEDELK) and propose a series of effective optimization strategies to decrease the communication costs between the man- Spectrum graph agement processing elements (MPEs) and computing processing elements (CPEs) and to balance the workload Fig. 1: Workflow of the de novo peptide sequencing. on each CPE, which results in a 10 speedup compared with the un-optimized version. 1) Experimental spectra generation: First, the mixed pro- 4) We also prove the scalability of SWPepNovo by scaling teins digest into mixed peptides using by enzymes. the size of datasets and the number of SW26010 nodes. And then the peptides will be fragmented and ionized We obtain a ideal speedup on a multi-node cluster that (e.g., higher energy collisional dissociation (HCD) [20], contains three SW26010 processors with a total of 4096 collision-induced dissociation (CID) [21]) in liquid chro- CPE. Experimental results show that our method has matography tandem mass spectrometry (LC-MS/MS). an excellent performance on scalability and without Finally, the MS/MS spectra will be output. Figure 2 is sacrificing accuracy and correctness in the de novo a MS/MS spectrum, which contains the measured m/z peptide sequencing results. and intensity of the fragments, represented by the peaks We believe that the techniques we use can guide the design [10]. Different ionization methods have dramatic impact of similar work on the neo-heterogeneous SW26010 many- on the propensities for producing particular fragment core architecture. The software and the parameter settings are ion types. For example, in CID, there are six series available from https://github.com/ChuangLi99/SWPepNovo. of fragment ions, which are denoted by type fragments Users without access to TaihuLight, SWPepNovo can be run C-terminal x, y and z and N-terminal a, b and c type as a multi-threaded (OpenMP) application on a MPI cluster. fragments [12], as shown as Figure 3. The rest of this paper is organized as follows. Section II gives the MS/MS-based de novo peptide sequencing, the Sun- 88 145 292 405 543 663 778 907 1020 1166 way TaihuLight and the related work. Section S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 146 III provides details of computational design and optimization strategies. The experiment performance is evaluated in Section 100%

IV. Finally Section V, we validate our results with a previous y6 I n t study and conclude the paper. e n s i

t y7 y b7 % b3 y3 b4 y4 II.BACKGROUNDS y2 y5 y8 b6 b8 In this section, we started with an introduction to the b5 b9 y9 0% background knowledge for MS/MS-based de novo peptide 200 400 600 800 1000 sequencing and SW26010 many-core processor (The main processor of the TaihuLight Supercomputer), and then Fig. 2: An example of MS/MS spectrum. 3

a1 b1 c1 a2 DDR3 Memory Memory DDR3

O O O GC1 GC2 MC MC Row Computing Communication H N H N H N H Bus Core H2N C C H C C H C C H C OH PPU PPU Registers

R R ifi R R MPE CPE MPE CPE Data Transfer 1 2 3 4 cluster cluster LDM Network xn yn zn xn-1 scl

NoC Colum Control Communication Fig. 3: The type of fragment ions. Network Bus CPE CPE MPE cluster MPE cluster Memory Computing Level Level PPU PPU 2) Spectrum graph generation: The spectrum graph is con- Register LDM Level Level MC MC structed through transforming an effective peak set into GC3 GC4 a spectrum graph where each node in spectrum graph is DDR3 Memory Memory DDR3 the m/z value. The nodes are connected if the difference of the two m/z values equals to an amino acid mass. Fig. 4: The architecture of the SW26010 . 3) Match scoring: First, the candidate peptides are re- constructed based on the spectrum graph. A candidate peptide is a path where the accumulative value of the favorable performance, the CPE’s optimization mechanisms weights equals to the parent mass. Then, the similarity are indispensable. of the experimental MS/MS spectrum and candidates are calculating using scoring algorithm, the top one(s) is the result of the calculation. Finally, the identified MS/MS C. Related Works spectrum was merged as the final peptide sequence. Previous research has shown that many efforts of the ac- celeration of protein identification are focused on developed B. Sunway TaihuLight Supercomputer parallel database searching based peptide sequencing. Lee The Sunway TaihuLight supercomputer, which developed adopted the graph-based in-memory distributed system to by the NRCPC, is now ranked third in the latest TOP 500 list develop a novel sequence alignment algorithm [26]. Qi You of October 2018, and have installed in the National Super- developed a fast tool that can highly efficient analyze genome- computing Center in Wuxi. The peak performance of Sunway editing dataset [27]. In [28], Li considered the redundant TaihuLight is 125.436 Pflops. The sustained LINPACK perfor- candidate peptides in PSMs, and adopted inverted index s- mance is 93.015 Pflops, leading to a performance-per-watt of trategy for speeding up tandem mass spectrometry. And also, 6051 MFLOPS/W [22]. The Sunway TaihuLight architecture some of the prevalent peptide sequencing methods adopted consists of four levels: the entire computing system, a super high performance computing (HPC) technology and cloud node, a cabinet, and a computing node. All computing nodes in computing [29]. Notably, Zhu presented an efficient OpenGL- entire computing system are connected with Sunway Network based Multiple peptide sequence alignment implementation on which is a customized network. A total of 1024 TB of memory GPUs hardware [30]. In[31], a based-GPU feature detection and 20PB of storage are in Sunway TaihuLight. The system algorithm was presented by Hussong to reduce the running software is a 64-bit Sunway RaiseOS [23]. time of PSMs. The SW26010 is the core of an embedded Sunway Tai- As another most powerful method for protein analysis, huLight system, which the general architecture is illustrated de novo peptide sequencing has drawn limited attention in in Figure 4 [24]. The processor includes four core-groups proteomics. A pioneering research on speeding up de novo (CG) which equip with a single management processing el- peptide sequencing was done by Frank [32]. In [32], Frank ement (MPE) and 8x8 computing processing elements (CPE), presented a discriminative boost ranking-based match scoring one (MC) and 8 GB physical memory algorithm, which using machine learning ranking algorithms [22]. SW26010 processor supports two user programming and producing identical speedup results while maintaining the modes: (1) Chip-sharing mode; and (2) CG private mode. The same identification result. Another efficient real-time de novo Chip-sharing mode offers specific for applications with high sequencing algorithm, namely Novor, was recently presented memory requirements. In its CG private mode, the SW26010 by Ma [33]. Compared with other de novo peptide sequencing processor serves as a NUMA (Non-uniform memory access) methods, Novor shows a very fast sequencing speed. PEAKS Architecture. [8], which developed by Ma, does the best job as acceleration To some degree, the programming on CPE cluster is similar de novo peptide sequencing. Although Peaks achieves great to GPU. MPE plays the role of the normal CPU and is mainly performance both in the speed and accuracy of de novo peptide responsible for task management, Input/Output and communi- sequencing analyses, it is a commercial software which cannot cation. CPE cluster play the role of the accelerate card which freely available to academic users. determine the computational power [25]. The peptide spectrum Recently Sunway TaihuLight supercomputer provide match scoring in de novo peptide sequencing requires more tremendous compute power to researchers. There are a computing power. To run de novo peptide sequencing with an few early development experiences on Sunway TaihuLight 4 supercomputer. Chen et al. [34] designing and implementing a parallel AES algorithm, and the result shows that the parallel AES algorithm achieved a good speed-up performance. Fang et al. [35] have implemented and optimized a library, namely swDNN, which that supports efficient deep neural Spectrum Chunk 0 networks (DNNs) implementation on Sunway TaihuLight supercomputer. In [36], a SW26010-based programming framework was presented for Sea Ice Model (SIM) algorithm. According to the experiment results, the programming

framework for SIM algorithm offers up to 40% performance Spectrum Chunk 1 increase. …… Spectrum sequence …… ……

…… …… terminator

(a) Original (b) Sorted (c) Spectrum Sets of concatenated III.COMPUTATIONAL DESIGN MS/MS spectra MS/MS spectra spectrum groups In this section, we present the efficient de novo pep- Fig. 6: The MS/MS data transformation. tide sequencing for large-scale MS/MS spectra analysis on SW26010. Our algorithm design is benefits of both the fact that the inherent parallelism of the match scoring progress is O(N 2), where N is the quantity of the amino acid sequences and each CPE in CG enables eight single-precision or si- of this sorting process. multaneous double-precision floating points operations, and Concatenation: Even though sorting by length has some- combines: (1) the task-level parallelism between MPEs using what balanced workload in each MS/MS spectrum, various a data transformation method and a dynamic feedback task MS/MS spectrum still have different parent mass. To overcome scheduling algorithm, (2) thread-level parallelism across CPEs this issue, spectrum within a spectrum groups are concatenated using asynchronous task transfer and multithreading. The with spectrum to form spectrum groups. The lengths of diagram in Figure 5 illustrates the algorithm framework of spectrum groups within a spectra chunk are nearly equal to our implementation. the size of the biggest spectrum in that set. Each thread will achieve workload balancing by inserting spectrum terminators between the concatenated spectra. The time complexity is A. The Task-level Parallelism between MPEs O(N 3), where N is the quantity of the amino acid sequences In the task level parallel part, the dataset is divided into of this sorting process. properly-sized chunks depending on the length of peptide 2) Dynamic Distribution Framework: As is shown in Fig- and the integral multiple of CG when handling a large-scale ure 7, a random set of chunks is executed to explore the MS/MS spectra dataset. Then, all MPE in CG continues to pro- availability of CGs, which includes information about comput- cess these chunks in a coarse-grained parallel fashion. Since ing time and load balancing. Then, based on feedback in the each CG in SW26010 has its own dedicated random access previous step, we have defined a feedback regulatory factor for memory (RAM), and the communication mode is limited to each CG. Finally, we adjust the amount of the chunks assigned register communication, assignment and scheduling of tasks to the CG according to the feedback regulatory factor of each is critical to improve the parallel performance. To support de MPE. novo peptide sequencing tasks for large-scale dataset effec- tively and efficiently, we use a MS/MS data transformation Task Feedback Feedback method to optimize the original sequences reading format Distributor Feedback and implement a dynamic distribution framework to assign MS/MS data chunks to MPE. The precise details about data transformation and dynamic distribution framework are in Feedback subsections later. 1) Data Transformation: In order to better match the Chunk m SW26010 capabilities, our parallel implementation is not Chunk i Chunk j Chunk n directly loading MS/MS spectra dataset, but transforming them into a format. As the Figure 6 shows, the transformation process consists of two steps: Sorting by spectrum sequence MPE MPE MPE MPE length, and concatenation. Sequencing Sequencing Sequencing Sequencing Sorting: In the SW26010 processor architecture, each MPE CG1 CG2 CG3 CG4 manages 64 CPE for parallel processing. This means that the threads in CPE will have to wait for each other to finish their Fig. 7: A flowchart of the task distribution framework. workload instead of continuing on independently. As shown, for the purpose of shortening the waiting time, all the MS/MS In implementation, choosing an appropriate load parameters spectrum are sorted by the length of spectrum to minimize as the feedback regulatory factor is very important to eliminate deviation between neighboring threads. The time complexity system bottleneck and balance the load dynamically. Our 5

Task pool

Sequential phase ……

Stage 1: MS/MS spectrum data transformation. Stage 2: The spectrum files and parameters are preprocessed

Experiment Dynamic Task Data Distributor

Coarse-grained parallel processing Parallel phase

Stage 3: The spectrum MPE MPE MPE MPE graph is constructed

C C C C C C C C Stage 4: The candidate Fine-grained P …… P P …… P P …… P P …… P peptides are parallel processing E E E E E E E E reconstructed and scored

Stage 5: The best MPE MPE MPE MPE matching result writeback to MPE

CG1 CG2 CG3 CG4

Result Collector

Fig. 5: The algorithm framework of our implementation. Our two-level parallelization scheme on SW26010 many-core architecture. combines: (1) task-level parallelism between MPEs using a dynamic feedback task distribution method (based on MPI), and (2) thread-level parallelism across CPEs (based on aThread).

priority is the CPE of SW26010 processor utilization. In order to calculate CPE utilization, we have extracted the real-time StartStart information parameters from the file /proc/stat/. Besides, since the length of task queue decides that whether the scheduler can keep the consistence with the system requirements. Too long computing time of a task queue can result that the UsersUsers choosechoose aa MPEMPE inin thethe computingcomputing environmentenvironment serviceservice asas thethe tasktask schedulingscheduling host.host. many-core system will be in the state of overload. Thus, the average task queue length is also important to the feedback regulatory factor. In our experiment, we obtained the single SchedulingScheduling hosthost usesuses configurationconfiguration requirementsrequirements toto filterfilter staticstatic CPE average queue length from the related parameters in resourceresource informationinformation andand accessaccess thethe realreal-time-time informationinformation ofof MPE file /proc/loadavg/. The detailed steps of dynamic feedback computingcomputing resourcesresources throughthrough thethe NOC networknetwork.. task scheduling process are described in the Figure 8. The experimental results show that the dynamic feedback task distribution can keep the system load imbalance below 9 SchedulingScheduling MPEMPE distributesdistributes chunkchunkss toto otherother MPEMPE,, monitors the executionexecution statusstatus ofof thethe sequencesequence tasktask andand collectcollect thethe computingcomputing percent in most cases. resultsresults..

3) The intra-MPE Communication Scheme: Because of the According toto thethe changeschanges ofof feedbackfeedback regulatoryregulatory factorfactor,, thethe geometricgeometric weightedweighted coefficientcoefficient isis adjustedadjusted,, gogo backback toto StepStep 22.. CPEs do not have cache and the latency of accessing DDR3 memory is fairly high, how to explicitly manage the use of local device memory (LDM) in the SW26010 architecture Complete the task when the load on each is critical. Direct memory access (DMA) has been widely CG is balanced acknowledged as an efficient method to transfer data between Fig. 8: Feedback and adjustment mechanism based task dy- LDM and memory. Thus, we adopt the asynchronous DMA- namic scheduling process. fetching strategy, which presented by Sunway TaihuLight, to overlap data transfers from shared memory to local device memory. Figure 9 illustrates this strategy. 6

Algorithm 1 Parallel PSMs algorithm MPE MemoryController Require: the dataset of candidate peptides and experimental spectra. spectra: experimental spectra; graph: the spectrum graph; Wait 1st fetch DMA_Fetch(buffer1) paths: the dataset of candidate peptides; DMA_Reply pathi: the i-th candidate peptides. Ensure: the top one score of each path DMA_Fetch(buffer2) 1: athread init(); // initialization CPEs Call 2nd fetch 2: athread set num threads (NUM THRADS); 3: for each spectrum in Spectra Scoring buffer 1 4: search graph, get paths 5: for each pathi ∈ paths 6: para ← parameters of match scoring DMA_Reply 7: athread spawn(path Slave, ¶); Wait 2st fetch DMA_Fetch(buffer3) 8: athread join(); 9: end for 10: end for Call 2nd fetch 11: athread halt(); 12: return Scoring buffer 2 DMA_Reply loaded into a separate CG for de novo sequencing. The peptide Fig. 9: Asynchronous data transfer strategy. When an chunk sequencing executed on the CG consists of four phases. In in one buffer is scored, the subsequent chunk is being fetched the first phase, each MS/MS spectrum obtains the spectrum to the other buffer using DMA-fetching intrinsics. graph in MPE. The spectrum graph is constructed through transforming an effective peak set into a spectrum graph where each node in spectrum graph is the m/z value. The nodes are B. The Thread-level Parallelism across CPEs connected if the difference of the two m/z values equals to an amino acid mass. In the second phase, the candidate peptide In neo-heterogeneous SW26010 many-core architecture, the dataset are reconstructed based on the spectrum graph. A CPE cluster plays the role as a coprocessor which dominates candidate peptide is a path where the sum of the weights equals the computing power. In order to fully utilizing the parallel to the parent mass. In the third phase, each MPE distributes super-computing power of the CPE cluster, it is vital to coor- each spectrum graph and the corresponding candidate peptides dinate the relationship between CPE and MPE. SW26010 sup- to the CPE cluster. In the last phase, the candidate peptides ports four kinds of programming models to combine CPE and are scored in CPE cluster. The pseudo-code of the parallel MPE, which can be used to implement and optimize parallel PSMs algorithm shown in the algorithm 1. The parallelization application. By employing the dynamic parallel mode, the task at this level is implemented through a special accelerate thread distribution between the CPEs should be taken seriously. In our library, called aThread. implementation, we adopt the dynamic parallel programming We further employ an optional asynchronous task-loading model, which is shown in Figure 10. MPE and CPE in strategy as shown Figure 11. First we set a list of processes SW26010 serve different functions during the computation. to load only reads. When other processes on the active run- MPE is serve as a job manager, which responsible for tasks queue are computing scores, these processes is responsible allocations, and CPE used as computing node is primarily for loading new read spectra chunks. Once the loading and responsible for receiving and executing tasks and returning computation are completed, all processes in the run-queue results to the corresponding MPE. will receive the load data. When using 8 processes in our experiments, this strategy reduces the computational idle time between two read blocks by more than 85%.

CPE ask for init data IV. OPTIMIZATION STRATEGIES In our experiments, the dataset size up to 10 GB and the excessive data replication exist in scoring process. Thus, a MPE CPE …… native parallel implementation is not a perfect solution. To get better performance for large-scale de novo peptide sequencing, we employ three optimization strategies, including vectoriza- tion, memory access optimization, and double buffering. The performance increase of optimization strategies is shown in Collecting the results Table I.

Fig. 10: The dynamic parallel programming model. A. Vectorization Vectorization is critical to run the codes efficiently in neo- In the thread level parallel part, the chunks are consecutively heterogeneous systems. In the Sunway TaihuLight architec- 7

Procs with the Procs with the Reading Procs Mapping Procs normal model normal model Processes Start Processes Start

load parameters load parameters load parameters

build idx. load read

build idx. build idx. load read

Broadcast Reads Broadcast Reads

load parameters load parameters load parameters

build idx. load read

build idx. build idx. load read

Broadcast Reads Broadcast Reads

(b). Use Asynchronous (a). Normal Cases Task-loading Fig. 11: Asynchronous task-loading strategy.

TABLE I: Running Time Before and After Optimization B. Memory Access Optimization Optimization Techniques Running Time (seconds) Benefits As described in Figure 4, each CPE contains a user con- Before Optimization 418s 0% trolled Scratch Pad Memory. The improvement of replacing Vectorization 385s 7.8% caches by Controlled Scratch Pad Memory is more initiative Memory Access Optimization 379s 9.3% and efficient for programmers. Meanwhile, the compiling system supports a collective communication interface and Double buffering 364s 12.9% DMA intrinsic, which provides asynchronous transmission With Both Optimization 352s 19.5% mode between the Scratch Pad Memory in CPE and main memory of MPE. If the sum of data transmitted is equal to the multiple of 128 , the limit on maximum peak performance of DMA intrinsic can achieve. Typically DMA supports three kinds of models to transmit data. In its single-CPE mode, each ture, each CPE in CG can process eight floating point opera- Scratch Pad Memory exchanges data within main memory tions within an instruction cycle. SW26010 many-core proces- individually. In its broadcast mode, data in the main memory sor is specially offered SIMD processing unit and correspond- are scheduled to CPEs. The single-row mode, each row of ing instructions. Moreover, the original automatic vectorization SPMs transfers data with main memory. In our work, we have does not support an efficient binary file generation. Therefore, used the single-CPE mode to design and implement parallel vectorization is a key point of the optimization process for PSMs algorithms which can make full use of the compute efficient implementation of de novo peptide sequencing. resource. In our implementation, since the data dependence exists During the optimization process, each Scratch Pad Memory in the innermost loop, the first task is modifying dependent sequentially receives a dataset of candidate peptides from main statements to eliminate data dependence. Then, too achieve an memory. Each CPE completes scoring and sends the top one efficient utilization of all available CPEs computing resources to main memory respectively. Note that the number of data by utilize vectorization, we adopt inter loop vectorization op- transferred must be a multiple of 128 byte. The DMA intrinsic eration to manually expand the inner loop. When the variable can reduce the number of memory access during each round of mapping operations has occurred in function, SIMD store or scoring. We have obtain optimal performance and get higher SIMD load, the standard type must be a 32 boundary speedup of SW26010 using DMA intrinsic. alignment. In the practice implementation code process, we derive a data padding to make each memory access to be C. Double Buffering naturally aligned natural. As the key of the entire optimization process, the vectorization technique achieved a performance of Although the DMA intrinsic can be reduce the cost of 57 Glops. memory access, the parallel efficiency and scalability still have 8 a huge margin of improvement, especially in the part of candi- which generated by tandem spectrometry experiment that date peptides scoring. For best performance and minimal com- analyzed a mixture of liver cancer [37]. In order to accurately munication consumption, we have used the double-buffering measure the speedup, three different datasets and parameters mechanism to overlap communication cost and computation are used in the experiments, as show as the Table III. We cost. Our design is based on the following insights. mainly considered the scale of the MS/MS dataset to test the • In Scratch Pad Memory, CPEs will allocate a double sequencing speed, in which Dataset.1, Dataset.2 and Dataset.3 memory space to accommodate the data of 2 groups when are acted as small, medium and large computing scale respec- the multi-cycle DMA has read/write operations. tively. • The typical approach to hide memory access performance TABLE II: The Sunway TaihuLight System Configuration is to make the two sets buffer from each other. When one set serves as the message buffer, the other one is Sunway TaihuLight System Configurations calculated. Core SW26010 processor • The DMA bandwidth between LDM and memory are Processor Node 4 CGs (4 MPEs and 256 CPEs) affected by initial dimension of loops. And a memory space in Scratch Pad Memory is required for data buffered OS Sunway Raise OS 2.0.5 (based on Linux) storage when several rounds of DMA has a high number Instruction Sunway-64 Instruction Set of write and read operations. Compile language Fortran, C, C++ Based on the above design philosophy, we have imple- Parallel programming interface OpenACC 2.0, OpenMP 3.1, MPI 3.0 mented the double-buffering mechanism in parallel PSMs algorithm to accelerate de novo peptide sequencing. In our implementation, the memory access overhead in double buffer- Performance on a single SW26010 node ing mechanism divided into two parts: unsheltered part P , which includes all the cost of transmitting the data in the first Firstly, we have compared the single SW26010 performance round and last round. Another is the overlapping cost part of proposed SWPepNovo implementation to PepNovo+. Three P × (N − 1). Eq. (1) shows the speedup of optimization by different datasets (see Table III) were used in the experiments using the double buffering mechanism: to enhance the accuracy of the experimental results. Note that the entire de novo peptide sequencing progress on a Sunway T Speedup = TaihuLight node with a SW26010 processor. max[T/No.ofcore + p × N,P × (N − 1)] + P In order to show the SWPepNovo excelled in speed, we (1) also performed the experiment on PepNovo+, DeepNovo-DIA In the candidate peptides scoring process, the commu- and Peaks, which executed on an Intel E5-2640 CPU running nication cost is much less than the computation cost. Eq. Linux CentOS 6.5. PepNovo+, operated via command-line (1) indicates that the speedup of CG is nearly CoreNumber interface, is freely available to the researchers. DeepNovo- when P is negligible because it’s not barely measurable. On DIA and PEAKS, the state-of-the-art implementation of de the basis of the theoretical analysis and the experimental novo peptide sequencing using exhaustive listing of sequences, results in the subsequent sections, we conclude that parallel achieves the optimal performance compared with the existing PSMs algorithm gains better optimization performances with de novo sequencing methods. Table IV shows the running time the double buffering mechanism by adequately utilize the comparison between SWPepNovo, PepNovo+, DeepNovo- advantage of the CG. Table I shows the performance of peptide DIA and PEAKS. As we can see in Table IV, the SWPepNovo sequencing with parallel PSMs algorithm before and after the spends considerably less than PepNovo+ and Peaks. Notably, optimization. in Exp.3, SWPepNovo spent 385 seconds in total, remarkably lower than PepNovo+ 8967 and PEAKS 4521. V. EXPERIMENTAL RESULTS AND DISCUSSIONS A series of experiments were performed to evaluate the TABLE IV: Execution time (seconds) of three de novo se- performance and scalability of our proposed SWPepNovo implementation. In this section, we will first introduce the quencing methods. experimental environments and dataset. And then compare the performance of SWPepNovo and some of state-of-the-art Software Dataset.1 Dataset.2 Dataset.3 de novo peptide sequencing tools. Finally, we evaluate the PEAKS 759 s 2060 s 4521 s scalability and accuracy of SWPepNovo. PepNovo+ 1582 s 4381 s 8967 s DeepNovo-DIA 152 s 381 s 825 s Experimental setup SWPepNovo 91 s 187 s 385 s In the experiments, we have implemented and evaluated the parallel PSMs algorithm on Sunway SW26010 many-core ar- Figure 12 illustrates the average speed of SWPepNovo on chitectures. The configuration of the Sunway TaihuLight Sys- three different datasets. The average parent precursor mass tem listed in the Table II. The experimental spectra data used of the Dataset.3 is 1097 Da, and the length of corresponding in the experiments was obtained from https://www.iprox.org/, average peptide is 10. SWPepNovo can easily de novo peptide 9

TABLE III: emphDe novo peptide sequencing parameters

Dataset Instrument Tolerance PTMs Tag lenth Size Precursors: 2Da 66MB (MGF) Dataset.1 LTQ Orbitrap Fusion C+57:M+16 6 Fragment: 0.75Da (21 185 spectra) Precursors: 2Da 256MB (MGF) Dataset.2 LTQ Orbitrap Fusion C+57:M+16 6 Fragment: 0.75Da (49 225 spectra) Precursors: 2Da 509MB (MGF) Dataset.3 LTQ Orbitrap Lumos C+57:M+16 6 Fragment: 0.75Da (120 212 spectra)

Performance on the SW26010 cluster 27 In order to evaluate the performance of multi-node acceler- Exp.3 14 176 ation, we have implemented the SWPepNovo on a SW26010 291 cluster. The impact of the number of nodes in SW26010 23 cluster on the performance of SWPepNovo is illustrated in Exp.2 10 145 Figure 14. As shown in Figure 14, it shows the performance 219 of SWPepNovo against the number of nodes in SW26010 25 cluster, where the X axis represents the number of SW26010 Exp.1 12 118 processor in the cluster and the Y axis represents speedup. 196 In the experiment of three nodes, we got 47 times speedup 0 100 200 300 400 in Dataset.1, 51 times speedup in Dataset.2 and 52 times in Dataset.3. The performance of SWPepNovo increase with PEAKS PepNovo+ DeepNovo-DIA SWPepNovo the size of the cluster. The advantage of SWPepNovo over PepNovo+ gets more significant as the cluster size increases. Fig. 12: De novo sequencing speeds (spectrum/second) of SWPepNovo, PepNovo+, DeepNovo-DIA and PEAKS. Speeedup sequence more than 291 MS/MS spectrum per second, while 60 PepNovo+ only get in 13 spectrum per second, and PEAKS 50 only get in 25 spectrum per second. 40 As Figure 13 shows, SWPepNovo achieves up 28 times 30 Dataset.1 52 speedup on a SW26010 against the PepNovo+. This validates 47 51 20 39 Dataset.2 that the parallel PSMs algorithm get high parallel efficiency Speedup 31 35 25 and speedup ratio using a single SW26010 many-core proces- 10 19 23 Dataset.3 sor. 0 1 2 3

30 No.of nodes

25 Fig. 14: Performance of SWPepNovo on multi-nodes.

20

15 Performance for processing large-scale datasets

Speedup SWPepNovo 10 To illustrated the large-scale data-processing capacity of our 5 parallel PSMs algorithm, we also performed the experiment on SWPepNovo with extremely large datasets. The extremely 0 large datasets are formed by merging Dataset.1, Dataset.2 and Dataset.1 Dataset.2 Dataset.3 Dataset.3. Figure 15 shows the execution time of SWPepNovo Experimental Data set with the dataset size increasing from 0.51GB (120,212 spectra) up to 11.22GB (2,644,664 spectra). With a 11.22 GB of Fig. 13: Performance of SWPepNovo. dataset, SWPepNovo took only 78.5 minutes. From Figure 15 we also can see that SWPepNovo+ can de novo sequence 10 extremely large spectra datasets with a linear increase in National Outstanding Youth Science Program of National execution time with the dataset size. Meanwhile, the validity Natural Science Foundation of China (Grant No. 61625202), is demonstrated by comparing the SWPepNovo results with the Key Program of National Natural Science Foundation of PepNovo+. This validates that the parallel PSMs algorithm China (Grant No. 61432005), the International (Regional) Co- achieves much higher executive performance than the original operation and Exchange Program of National Natural Science serial PSMs algorithm without sacrificing the accuracy and Foundation of China (Grant No. 61860206011), and the Hunan correctness of the de novo peptide sequencing results. Provincial Innovation Foundation for Postgraduate (Grant No. 541109080026.

90

80 11.22, 78.5 REFERENCES 70 10.2 [1] J. H. Gross, “Tandem mass spectrometry,” in Mass Spectrometry. 60 9.18 8.16 Springer, 2017, pp. 539–612. 50 7.14 [2] J. Allmer, “Algorithms for the de novo sequencing of peptides from 40 6.12 tandem mass spectra,” Expert review of proteomics, vol. 8, no. 5, pp. 4.08 5.1 SWPepNovo 645–657, 2011. 30 [3] J. K. Eng, A. L. McCormack, and J. R. Yates, “An approach to correlate 3.06 20 2.04 tandem mass spectral data of peptides with amino acid sequences Runtime (minutes) Runtime 10 1.02 in a protein database,” Journal of the American Society for Mass 0.51 Spectrometry, vol. 5, no. 11, pp. 976–989, 1994. 0 [4] R. Craig and R. C. Beavis, “Tandem: matching proteins with tandem 0 2 4 6 8 10 12 mass spectra,” Bioinformatics, vol. 20, no. 9, pp. 1466–1467, 2004. Size of dataset (GB) [5] M. Hirosawa, M. Hoshida, M. Ishikawa, and T. Toya, “Mascot: multiple alignment system for protein sequences based on three-way dynamic programming,” Bioinformatics, vol. 9, no. 2, pp. 161–167, 1993. [6] D. Li, Y. Fu, R. Sun, C. X. Ling, Y. Wei, H. Zhou, R. Zeng, Fig. 15: Performance of SWPepNovo on datasets sized 0.51- Q. Yang, S. He, and W. Gao, “pfind: a novel database-searching software 11.22GB. system for automated peptide and protein identification via tandem mass spectrometry,” Bioinformatics, vol. 21, no. 13, pp. 3049–3050, 2005. [7] A. Frank and P. Pevzner, “Pepnovo: de novo peptide sequencing via probabilistic network modeling,” Analytical chemistry, vol. 77, no. 4, Accuracy analysis pp. 964–973, 2005. [8] B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, and In this subsection, we verified the accuracy of SWPepNovo G. Lajoie, “Peaks: powerful software for peptide de novo sequencing by by comparing the resultant cosine values of SWPepNovo tandem mass spectrometry,” Rapid communications in mass spectrome- try, vol. 17, no. 20, pp. 2337–2342, 2003. to PepNovo+. The results are presented in Table V. This [9] H. Chi, R.-X. Sun, B. Yang, C.-Q. Song, L.-H. Wang, C. Liu, Y. Fu, Z.-F. validates that SWPepNovo achieves much higher executive Yuan, H.-P. Wang, S.-M. He et al., “pnovo: de novo peptide sequencing performance than PepNovo+ without sacrificing the accuracy and identification using hcd spectra,” Journal of proteome research, and correctness of the results. vol. 9, no. 5, pp. 2713–2724, 2010. [10] K. Jeong, S. Kim, and P. A. Pevzner, “Uninovo: a universal tool for de novo peptide sequencing,” Bioinformatics, vol. 29, no. 16, pp. 1953– TABLE V: Accuracy analysis of SWPepNovo 1962, 2013. [11] K. G. Standing, “Peptide and protein de novo sequencing by mass spectrometry,” Current Opinion in Structural Biology, vol. 13, no. 5, Dataset Consine of SWPepNovo Consine of PepNovo+ pp. 595–601, 2003. [12] V. Dancˇ´ık, T. A. Addona, K. R. Clauser, J. E. Vath, and P. A. Pevzner, Dataset 1 0.98546 0.98546 “De novo peptide sequencing via tandem mass spectrometry,” Journal Dataset 2 0.97852 0.97852 of computational biology, vol. 6, no. 3-4, pp. 327–342, 1999. [13] C. Li, T. Chen, Q. He, Y. Zhu, and K. Li, “Mruninovo: an efficient Dataset 3 0.98365 0.98365 tool for de novo peptide sequencing utilizing the hadoop distributed computing framework,” Bioinformatics, vol. 33, no. 6, pp. 944–946, 2016. [14] W. H. Tang, B. R. Halpern, I. V. Shilov, S. L. Seymour, S. P. Keating, VI.CONCLUSIONS A. Loboda, A. A. Patel, D. A. Schaeffer, and L. M. Nuwaysir, “Dis- covering known and unanticipated protein modifications using ms/ms As the size of the MS/MS spectra dataset increases rapidly, database searching,” Analytical Chemistry, vol. 77, no. 13, pp. 3931– the excessive computation time taken by de novo peptide 3946, 2005. sequencing has become a critical concern in computational [15] H. Lin, S. Peng, and J. Huang, “Special issue on computational resources and methods in biological sciences,” International Journal of Biological biology. This study presents SWPepNovo, a parallel PSMs Sciences, vol. 14, no. 8, pp. 807–810, 2018. algorithm to accelerate large-scale de novo peptide sequenc- [16] A. Didticas, “The intel many integrated core architecture,” in Interna- ing on Sunway TaihuLight Supercomputer. The experimental tional Conference on High Performance Computing & Simulation, 2012. [17] D. Luebke, M. Harris, J. Krger, T. Purcell, N. Govindaraju, I. Buck, results demonstrate that the parallel PSMs algorithm can C. Woolley, and A. Lefohn, “Gpgpu: General purpose computation on significantly reduce the execution time of large-scale MS/MS graphics hardware,” in ACM SIGGRAPH 2004 Course Notes, 2006, p. spectra analysis without sacrificing accuracy in the results. 208. [18] K. Perelygin, L. Shui, and X. Wu, “Graphics processing units and open computing language for parallel computing ,” Computers & Electrical ACKNOWLEDGMENT Engineering, vol. 40, no. 1, pp. 241–251, 2014. [19] W. N. Chen and H. M. Hang, “H.264/avc motion estimation impl- The research was partially funded by the National Key mentation on compute unified device architecture (cuda),” in IEEE R&D Program of China (Grant No. 2018YFB0203800), the International Conference on Multimedia & Expo, 2008. 11

[20] J. V. Olsen, B. Macek, O. Lange, A. Makarov, S. Horning, and M. Mann, Chuang Li is currently pursuing the Ph.D. degree “Higher-energy c-trap dissociation for peptide modification analysis,” in the College of Computer Science and Electronic Nature Methods, vol. 4, no. 9, pp. 709–712, 2007. Engineering, Hunan University, China. His research [21] J. M. Wells and S. A. Mcluckey, “Collisioninduced dissociation (cid) of interests include parallel computing, cloud comput- peptides and proteins,” Methods Enzymol, vol. 402, pp. 148–185, 2005. ing, machine learning, and big data. He has pub- [22] H. Fu, J. Liao, J. Yang, L. Wang, Z. Song, X. Huang, C. Yang, W. Xue, lished research articles in Bioinformatics. F. Liu, and F. Qiao, “The supercomputer: system and applications,” Science China Information Sciences, vol. 59, no. 7, p. 072001, 2016. [23] W. Dong, K. Li, L. Kang, Z. Quan, and K. Li, “Implementing molecular dynamics simulation on the sunway taihulight system with heteroge- neous many-core processors,” Concurrency and Computation: Practice and Experience, vol. 30, no. 16, p. e4468, 2018. [24] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li, “Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer,” IEEE Transactions on Parallel and Distributed Systems, 2018. [25] G. Xiao, K. Li, K. Li, and Z. Xu, “Efficient top-(k,l) range query pro- cessing for uncertain data based on multicore architectures,” Distributed Kenli Li received the Ph.D. degree in computer & Parallel Databases, vol. 33, no. 3, pp. 381–413, 2015. science from Huazhong University of Science and [26] J. Lee, Y. Yeu, H. Roh, Y. Yoon, and S. Park, “Bulkaligner: A Technology, China, in 2003. He is currently a full novel sequence alignment algorithm based on graph theory and trinity,” professor of computer science and technology at Information Sciences, vol. 303, pp. 120–133, 2015. Hunan University and director of National Super- [27] Q. Ren, “Crisprmatch: An automatic calculation and visualization tool computing Center in Changsha. His major research for high-throughput crispr genome-editing data analysis,” International areas include parallel computing, high-performance Journal of Biological Sciences, vol. 14, no. 8, pp. 858–862, 2018. computing, grid and cloud computing. He has pub- [28] Y. Li, H. Chi, L.-H. Wang, H.-P. Wang, Y. Fu, Z.-F. Yuan, S.-J. Li, Y.-S. lished more than 130 research papers in international Liu, R.-X. Sun, R. Zeng et al., “Speeding up tandem mass spectrometry conferences and journals, such as IEEE-TC, IEEE- based database searching by peptide and spectrum indexing,” Rapid TPDS, IEEE-TSP, JPDC, ICPP, CCGrid. He is an Communications in Mass Spectrometry, vol. 24, no. 6, pp. 807–814, outstanding member of CCF. He is a senior member of the IEEE and serves 2010. on the editorial board of IEEE Transactions on Computers. [29] J. Chen, K. Li, H. Rong, K. Bilal, N. Yang, and K. Li, “A disease diagnosis and treatment recommendation system based on big data mining and cloud computing,” Informationences, vol. 435, 2018. [30] X. Zhu, K. Li, A. Salah, L. Shi, and K. Li, “Parallel implementation of mafft on cuda-enabled graphics hardware,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 12, no. 1, pp. 205–218, 2015. [31] R. Hussong, B. Gregorius, A. Tholey, and A. Hildebrandt, “Highly Keqin Li is a SUNY Distinguished Professor of accelerated feature detection in proteomics data sets using modern computer science. His current research interests graphics processing units,” Bioinformatics, vol. 25, no. 15, pp. 1937– include parallel computing and high-performance 1943, 2009. computing, distributed computing, energy-efficient [32] A. M. Frank, “A ranking-based scoring function for peptide- spectrum computing and communication, heterogeneous com- matches,” Journal of proteome research, vol. 8, no. 5, pp. 2241–2252, puting systems, cloud computing, big data com- 2009. puting, CPU-GPU hybrid and cooperative comput- [33] B. Ma, “Novor: real-time peptide de novo sequencing software,” Journal ing, multicore computing, storage and file systems, of the American Society for Mass Spectrometry, vol. 26, no. 11, pp. wireless communication networks, sensor networks, 1885–1894, 2015. peer-to-peer file sharing systems, mobile computing, [34] Y. Chen, K. Li, X. Fei, Z. Quan, and K. Li, “Implementation and service computing, Internet of things and cyber- optimization of a data protecting model on the sunway taihulight physical systems. He has published over 550 journal articles, book chapters, supercomputer with heterogeneous many-core processors,” Concurrency and refereed conference papers, and has received several best paper awards. and Computation: Practice and Experience, p. e4758. He is currently or has served on the editorial boards of IEEE Transactions [35] J. Fang, H. Fu, W. Zhao, B. Chen, W. Zheng, and G. Yang, “swdnn: A on Parallel and Distributed Systems, IEEE Transactions on Computers, IEEE library for accelerating deep learning applications on sunway taihulight,” Transactions on Cloud Computing, IEEE Transactions on Services Computing, in Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE and IEEE Transactions on Sustainable Computing. He is an IEEE Fellow. International. IEEE, 2017, pp. 615–624. [36] B. Li, B. Li, and D. Qian, “Pfsi. sw: A programming framework for sea ice model algorithms based on sunway many-core processor,” in Application-specific Systems, Architectures and Processors (ASAP), 2017 IEEE 28th International Conference on. IEEE, 2017, pp. 119– 126. [37] J. Ma, T. Chen, S. Wu, C. Yang, M. Bai, K. Shu, K. Li, G. Zhang, Z. Jin, F. He et al., “iprox: an integrated proteome resource,” Nucleic acids research, vol. 47, no. D1, pp. D1211–D1217, 2018. Xianghui Xie Xianghui Xie received his Ph.D. degree in computer science from Institute of Com- puter Technology, Chinese Academy of Sciences, Beijing. He is a professor of State Key Laboratory of Mathematical Engineering and Advanced Comput- ing, Wuxi. His research interests include computer architecture and parallel computing. 12

Feng Lin is an Associate Professor in the School of Computer Science and Engineering, Nanyang Tech- nological University, Singapore. And he is the Di- rector of Biomedical Informatics Lab. His research interest includes biomedical informatics, bioimag- ing, computer graphics and visualization, and high performance computing. He has published 240 re- search papers and monographs including those in IEEE TPAMI and TMI. He is a Senior Member of IEEE.