Improved Nucleosome-Positioning Algorithm Inps for Accurate Nucleosome Positioning from Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE Received 29 Nov 2013 | Accepted 4 Aug 2014 | Published 18 Sep 2014 DOI: 10.1038/ncomms5909 Improved nucleosome-positioning algorithm iNPS for accurate nucleosome positioning from sequencing data Weizhong Chen1,2,*, Yi Liu1,3,*, Shanshan Zhu1, Christopher D. Green1, Gang Wei1 & Jing-Dong Jackie Han1 Accurate determination of genome-wide nucleosome positioning can provide important insights into global gene regulation. Here, we describe the development of an improved nucleosome-positioning algorithm—iNPS—which achieves significantly better performance than the widely used NPS package. By determining nucleosome boundaries more precisely and merging or separating shoulder peaks based on local MNase-seq signals, iNPS can unambiguously detect 60% more nucleosomes. The detected nucleosomes display better nucleosome ‘widths’ and neighbouring centre–centre distance distributions, giving rise to sharper patterns and better phasing of average nucleosome profiles and higher consistency between independent data subsets. In addition to its unique advantage in classifying nucleosomes by shape to reveal their different biological properties, iNPS also achieves higher significance and lower false positive rates than previously published methods. The application of iNPS to T-cell activation data demonstrates a greater ability to facilitate detection of nucleosome repositioning, uncovering additional biological features underlying the activation process. 1 Key Laboratory of Computational Biology, Chinese Academy of Sciences-Max Planck Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yue Yang Road, Shanghai 200031, China. 2 Graduate University of Chinese Academy of Sciences, Beijing 100049, China. 3 Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China. * These authors contributed equally to this work. Correspondence and requests for materials should be addressed to J.-D.J.H. (email: [email protected]). NATURE COMMUNICATIONS | 5:4909 | DOI: 10.1038/ncomms5909 | www.nature.com/naturecommunications 1 & 2014 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms5909 he technology of micrococcal nuclease (MNase) digestion of detecting different types of nucleosomes that are associated combined with high-throughput sequencing (MNase-seq) with different biological properties based on detected nucleosome Tis a powerful method to map the genome-wide distribution shapes. of nucleosome occupancy1–3. Although it has a high demand on The significant increase in accuracy of the iNPS algorithm sequencing depth, with the promise of probing whole genome- tackled the previously thought major limitations of the MNase- wide chromatin remodelling events incurred by all transcription seq technology, namely its low resolution and low consistency in factor binding and chromatin modifications at once and the nucleosome boundary determination5. We now demonstrate that rapidly decreasing sequencing cost, its popularity has continued the limitations do not lie so much in the technology per se, but to increase over the years. However, the analysis of nucleosome rather lie in the accuracy of the computational analysis method. positions is still at its infancy. Nucleosome positioning relies on The iNPS software package is freely downloadable at http:// nucleosome signal coverage, or the frequency distribution formed www.picb.ac.cn/hanlab/iNPS.html. by MNase digested DNA fragments in a cell population. A genome location where there is nucleosome occupancy in a number of cells would have high sequencing read coverage. Thus, Results the essential principle for nucleosome position detection is to find Improved nucleosome positioning from MNase-seq data. The the locations where the MNase-seq coverage is enriched. An original NPS algorithm4 mainly contains the following four steps: effective and efficient strategy is to generate a nucleosome (1) nucleosome scoring for generating a continuous wave-form sequencing profile that is able to intuitively depict the nucleosome signal for a distribution profile of genome-wide nucleosome distribution in a wave-form, based on which a peak-calling positioning, (2) wavelet denoising for the preliminary mild procedure is then used to find the peaks on the wave-form profile. smoothing of the signal waves, (3) Laplacian of Gaussian Zhang et al.4 have developed an algorithm for nucleosome convolution (LoG) for further smoothing the wave profile and detection called nucleosome positioning from sequencing (NPS). meanwhile detecting inflection points on the profile as the In practice, however, the accuracy of the NPS algorithm4 borders of nucleosome peaks and (4) peak filtering based on the needs much improvement: even if the thresholds for all the cutoff of peak shape and the P-value of Poisson approximation. filtering steps in the programme are lowered or eliminated In NPS, the original nucleosome profile was generated by (Supplementary Table 1), many visually obvious nucleosomes still extending each MNase-seq tag from the 50 end by 150 bp towards could not be detected. To solve this problem, we first identified the 30 position and taking the middle 75 bp as the enriched the technical problems in NPS contributing to the missing nucleosome signal, thus the nucleosomes were represented by the or mis-detected nucleosomes; then, we developed a new package peaks on the wave-like original profile. Borders of the nucleosome ‘improved NPS’ (iNPS) by combining the theoretical core peaks, represented by paired inflection points on the smoothed algorithm of NPS4 with new algorithms to address these profile, could be detected by LoG (Methods). However, the technical problems. iNPS exhibits a remarkably improved original NPS algorithm cannot precisely determine which pairs of performance over the original NPS, detecting nucleosomes with inflection points should be selected as final nucleosomes; thus, as higher quality, a lower false positive rate and stronger association false positives, some mildly concave parts are determined as with relevant biological events. We also find that NPS’ deficiency nucleosome positions while some obvious sharp peaks are missed can be largely traced to a hard-coded parameter (Supplementary (Fig. 1a,b). To solve this problem, our iNPS algorithm used a new Fig. 1a,b); yet, the performance of iNPS is still significantly better method, the first derivative of Gaussian convolution, for detecting than the ‘customized NPS’ that has the hard-coding problem max/min-extremum points to identify each inflection point pair fixed (Supplementary Fig. 1c). In addition to the NPS comparison as ‘main’ nucleosome peaks or ‘shoulders’ (Methods, Fig. 2a). (both default and customized NPS), we further demonstrated an This new step is based on the observation that a sharp peak has a overall advantage of iNPS over other recent algorithms used for summit on the smoothed profile, while a mildly concave shoulder nucleosome detection. In particular, iNPS has a unique advantage does not. Then, this step was followed by the next new step for NPS: results 30 Original scoring profile 25 Detected nucleosomes 20 15 10 5 Nucleosome signal 0 7,734,000 7,735,000 7,736,000 7,737,000 7,738,000 7,739,000 Coordinates of chromosome 1 NPS: nucleosome-detection profile 30 Original scoring profile 25 Nucleosome-detection profile 20 15 10 5 Nucleosome signal 0 7,734,000 7,735,000 7,736,000 7,737,000 7,738,000 7,739,000 Coordinates of chromosome 1 Figure 1 | Nucleosomes detection results by the NPS algorithm. (a,b) The nucleosome detection results of the NPS algorithm on the 7,734,000– 7,739,000 bp region of chromosome 1 (hg18) in resting human CD4 þ T cells. (a) The detected nucleosomes (orange line). (b) The nucleosome-detection profile (red line): wave-form signal within the detected nucleosomes. 2 NATURE COMMUNICATIONS | 5:4909 | DOI: 10.1038/ncomms5909 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms5909 ARTICLE a Original nucleosome profile c Gaussian convolution smoothed profile Max-extremum point Inflection points Most-winding position on ‘shoulder’ ‘Main’ nucleosome 12 Shoulder After step 3 20 10 15 10 8 5 Nucleosome signal 0 6 148,328,000 148,329,000 148,330,000 148,331,000 148,332,000 148,333000 148,,334,000 Coordinates of chromosome 1 Nucleosome signal Nucleosome 4 Region for the After step 4 determination of shoulder fate 2 20 15 10 0 5,979,950 5,980,050 5,980,150 5,980,250 5 Nucleosome signal Coordinates of chromosome 1 0 148,328,000 148,329,000 148,330,000 148,331,000 148,332,000 148,333,000 148,334,000 b Tag Coordinates of chromosome 1 Tag (25 nt) for 5′ of a 5′ 3′ ′ ′ potential nucleosome 3 5 After step 5 150 nt 20 Beginning: 15 tag coordinate bed data Extending each tag 5′ 3′ ′ ′ by 150 nt toward 3′ 3 5 10 Step1: nucleosome scoring 5 75 nt Nucleosome signal Wave-form nucleosome 0 signal profile ′ ′ Taking middle 75 nt 5 3 148,328,000 148,329,000 148,330,000 148,331,000 148,332,000 148,333,000 148,334,000 3′ 5′ Step2: convolution Coordinates of chromosome 1 Gaussian First derivative LoG Third derivative LoG with a Final results convolution of gaussian of gaussian smaller deviation 20 convolution convolution 15 Smoothed Max-extremum Inflection Most winding Inflection of more 10 profile Min-extremum position ‘detailed’ wave 5 Nucleosome signal 0 Step3: identifying ‘main’ 148,328,000 148,329,000 148,330,000 148,331,000 148,332,000 148,333,000 148,334,000 nucleosome and ‘shoulder’ Coordinates of chromosome 1 ‘Main’ nucleosome candidates ‘shoulder’ candidates d Results by NPS 20 Step 4: Determining whether one shoulder is an independent nucleosome 15 or the dynamic shifting part of the neighboring ‘main’ nucleosome. 10 ‘Main’ nucleosome 5 independent ‘shoulder’ Nucleosome signal integrated nucleosome with shoulder 0 148,328,000 148,329,000 148,330,000 148,331,000 148,332,000 148,333,000 148,334,000 Step 5: adjusting the inflection border.