Proteogenomic analysis and global discovery of PNAS PLUS posttranslational modifications in prokaryotes

Ming-kun Yanga,1, Yao-hua Yanga,1, Zhuo Chena, Jia Zhanga, Yan Lina, Yan Wanga, Qian Xionga, Tao Lia,2, Feng Gea,2, Donald A. Bryantb, and Jin-dong Zhaoa

aKey Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; and bDepartment of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802

Edited* by Jiayang Li, Chinese Academy of Sciences, Beijing, China, and approved November 13, 2014 (received for review July 6, 2014) We describe an integrated workflow for proteogenomic analysis and cyanobacterial cells adjust their cellular activities in response and global profiling of posttranslational modifications (PTMs) in to a wide range of environmental cues and stimuli. Recently, prokaryotes and use the model cyanobacterium Synechococcus sp. cyanobacteria have attracted great interest due to their crucial PCC 7002 (hereafter Synechococcus 7002) as a test case. We found roles in global carbon and nitrogen cycles and their ability to more than 20 different kinds of PTMs, and a holistic view of PTM produce clean and renewable biofuels such as hydrogen (14–16). events in this organism grown under different conditions was Synechococcus 7002 is a unicellular, marine cyanobacterium and obtained without specific enrichment strategies. Among 3,186 pre- a model organism for studying photosynthetic carbon fixation dicted -coding , 2,938 products (>92%) were and the development of biofuels (17, 18). However, whereas the identified. We also identified 118 previously unidentified genome of Synechococcus 7002 is fully sequenced, it is annotated and corrected 38 predicted gene-coding regions in the Synecho- only by in silico methods (www.ncbi.nlm.nih.gov/), with a large coccus 7002 genome. This systematic analysis not only provides portion (1,210 out of 3,186) of protein-coding genes annotated as comprehensive information on protein profiles and the diversity hypothetical proteins (17). Therefore, a comprehensive analysis of PTMs in Synechococcus 7002 but also provides some insights is needed to provide experimental support for the genome an- into photosynthetic pathways in cyanobacteria. The entire proteo- notation so as to facilitate systems-level analysis. Using our method, pipeline is applicable to any sequenced prokaryotic we performed the validation of the predicted protein-coding genes, organism, and we suggest that it should become a standard part identified previously unidentified genes, and corrected gene initia- of genome annotation projects. tion and stop-codon positions in Synechococcus 7002, and di- rectional RNA-Seq was used to determine the existence of a proteogenomics | post-translational modifications | cyanobacteria | number of previously unidentified genes identified in this study. photosynthesis | Synechococcus Significance roteogenomics refers to the correlation of mass spectrome- SCIENCES Ptry-derived proteomic data to refine genome annotation (1) Proteogenomics is the application of -

and has been applied to the identification of previously un- derived proteomic data for testing and refining predicted genetic APPLIED BIOLOGICAL identified genes and the correction and validation of predicted models. Cyanobacteria, the only prokaryotes capable of oxy- genes in various organisms (2–4). It is an important tool for in- genic photosynthesis, are the ancestor of chloroplasts in plants tegrating protein-level information into the genome annotation and play crucial roles in global carbon and nitrogen cycles. An process and can greatly improve genome annotation quality. The integrated proteogenomic workflow was developed, and we same experimental proteomic datasets are also useful in identi- tested this system on a model cyanobacterium, Synechococcus fying posttranslational modifications (PTMs) on a - 7002, grown under various conditions. We obtained a nearly wide level (5, 6). Many cellular proteins undergo appreciable complete genome translational profile of this model organism. amounts of PTM in response to certain stimuli, and this dynamic In addition, a holistic view of posttranslational modification process occurs in various cell compartments to dictate the fate (PTM) events is provided using the same dataset, and the and activity of the modified proteins (7). Identification and results provide insights into photosynthesis. The entire pro- mapping of PTMs in proteins have been improved dramatically, teogenomics pipeline is applicable to any sequenced prokar- mainly due to increases in the sensitivity, speed, accuracy, and yotes and could be applied as a standard part of genome resolution of mass spectrometry (MS). However, system-wide annotation projects. identification of multiple PTMs remains a highly challenging task, especially in situations where some reversible PTMs are Author contributions: T.L., F.G., and J.-d.Z. designed research; M.-k.Y., Y.-h.Y., Z.C., Y.L., induced by a particular stimulus and are present for only a short Y.W., T.L., and F.G. performed research; D.A.B. contributed new reagents/analytic tools; period (8). To the best of our knowledge, very few reports of M.-k.Y., Y.-h.Y., Z.C., J.Z., Y.L., Q.X., T.L., F.G., and J.-d.Z. analyzed data; and M.-k.Y., Z.C., T.L., F.G., D.A.B., and J.-d.Z. wrote the paper. proteogenomic datasets have presently been used to analyze PTM events comprehensively in a genome sequence (9, 10). The authors declare no conflict of interest. In this study, we developed a proteogenomic approach to carry *This Direct Submission article had a prearranged editor. out genome annotation and whole-proteome analysis of PTMs in Freely available online through the PNAS open access option. prokaryotes by using high resolution and high accuracy MS data Data deposition: MS raw data files, processed data files, annotated peptide spectral matches for all accepted single peptide hits, annotated peptide spectrum matches of and the cyanobacterium Synechococcus sp. PCC 7002 (hereafter identified peptides with stop codon read-through, annotated spectra of all identified Synechococcus 7002) as a test case. Cyanobacteria are a mor- peptides with specific PTMs, snapshots of genome browser for all the previously uniden- phologically diverse group of Gram-negative bacteria and are the tified genes along with the corrections to the existing gene models identified in this study, Tables S1–S9, RNA-Seq data, and in-house integrated proteogenomic analysis soft- only prokaryotes capable of oxygenic photosynthesis (11). It is ware have been deposited in the PeptideAtlas repository (www.peptideatlas.org) (acces- estimated that more than half of the photosynthetic activity on sion no. PASS00285). Earth is contributed by cyanobacteria (12). Cyanobacteria make 1M.-k.Y. and Y.-h.Y. contributed equally to this work. substantial contributions to global CO2 assimilation, O2 pro- 2To whom correspondence may be addressed. Email: [email protected] or [email protected]. duction, and N2 fixation and are the progenitors of chloroplasts This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. in higher plants (13). Cyanobacterial habitats are highly diverse, 1073/pnas.1412722111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1412722111 PNAS | Published online December 15, 2014 | E5633–E5642 Downloaded by guest on September 25, 2021 Fig. 1. Experimental and bioinformatic workflow of the proteogenomic analysis. Protein extracts were prepared from Synechococcus 7002 cultures grown to exponential phase (flask 1) and stationary phase (flask 2) and exposed to stresses, including iron de- ficiency (flask 3), phosphate deficiency (flask 4), ni- trogen deficiency (flask 5), A5 deficiency (flask 6), high light (flask 7), and high salt concentration (flask 8) (2.5 M). After protein extraction, proteins are subjected to Glu-C and Trypsin digestion, producing peptide mixture. The mixture was analyzed by means of nano-LC-MS/MS with an LTQ-Orbitrap Elite mass spectrometer. MS/MS peptide spectra were searched against specific organism genome sequences, vali- dating and correcting genomic annotations, as well as identifying previously unidentified protein-cod- ing genes and diverse PTMs.

More importantly, we characterized PTM features on a pro- identified from the two fractionation methods applied for ana- teome-wide level using the same experimental proteomic data- lyzing the cell-lysate proteins. Sequest, Mascot, MaxQuant, pFind, sets and identified many different PTM types that may play and X!Tandem searches identified 238,918; 239,779; 252,696; important roles in cellular functions. Our proteogenomic data 241,601; and 268,763 peptides, respectively, resulting in the iden- provided significant information for revising the genome anno- tification of 55,862 unique peptides. tation of Synechococcus 7002 and offered insights into the physiology of this model organism. The method and approach Protein Expression Analysis of Synechococcus. The CyanoBase da- can also be used to study genome annotation and cellular protein tabase (genome.microbedb.jp/cyanobase/) lists 3,186 protein-coding PTMs in other organisms. genes in the Synechococcus 7002 genome (3.41 Mb, released 2012). Unique peptides were mapped onto the genome-translated da- Results tabase, and proteins identified by at least two unique peptides or Proteogenomic Strategy for the Analysis of Synechococcus 7002. The by a single peptide with manual validation were reported. In aim of this study was to provide an experimental catalog of the total, we identified 2,938 Synechococcus 7002 proteins with FDR genome-wide gene expression and PTMs in Synechococcus 7002 1% (2,699 proteins with at least two unique peptides and 239 and to use this information to refine genome annotations. To proteins with single-peptide identification), which are listed in enhance coverage of the expressed genome, cultures from eight Table S1A. Proteins identified based on shared peptide evidence different growth conditions were combined, and total protein were listed separately (Table S1B). This comprehensive dataset extracts were isolated. This treatment mimicked native con- enabled us to address the general features of our proteomic ditions experienced by Synechococcus 7002 and allowed greater experiments, especially with respect to coverage of the genome PTM representation in the isolated proteins. A total of 52 sequence by the detected peptides. We first defined the protein- samples were generated and subjected to nanoscale liquid coding part of the genome by mapping the 3,186 annotated chromatographycoupledtotandem mass spectrometry (nano proteins onto the and plasmids (3.41 Mb) (Fig. LC-MS/MS) analysis on a high-resolution LTQ-Orbitrap Elite 2C); this result revealed that at least 87.1% (2.97 Mb) of the mass spectrometer. Using five different algorithms, MS-derived genome is annotated in the protein database and is therefore data were searched against (i) a protein database comprising the protein coding. We next used the sequences of all 2,938 proteins protein sequences of Synechococcus 7002 from CyanoBase and identified in our dataset to estimate the size of the expressed (ii) a database of a six-frame translated genome of Synecho- genome, which corresponded to 98.3% (2.92 Mb) of the an- coccus 7002. The complete workflow is summarized in Fig. 1. notated protein-coding regions. Finally, mapping the detected Peptide spectrum matches (PSMs) were filtered for first-rank peptide sequences onto the chromosome and plasmids captured assignments that passed a 1% false discovery rate (FDR) 2.30 Mb of the raw genome sequence or 77.4% of the protein- threshold. The complete list of peptides identified in our study, coding genome (Fig. 2C). The average sequence coverage per along with PSM scores, charge, m/z value, and mass error, is identified protein was 44.4%, and 1,211 proteins had peptide provided in Table S1A [Tables S1–S9 are available at www. evidence for more than 50% of their sequences (Fig. S1B). Each peptideatlas.org (accession no. PASS00285)]. We achieved protein was represented by 1–357 distinct peptides, with an av- a mean absolute mass deviation of 0.005006 Da (Fig. 2A). The erage coverage of 19 peptides per identified protein (Fig. S1C). average absolute peptide mass accuracy was 1.96 parts per mil- As an illustration of the high coverage of many of the proteins, lion (ppm) (SD, 1.78 ppm), and more than 94% of the identified Fig. S1D depicts the identification of peptides mapping to 100% PSMs had less than 5 ppm mass error, confirming the high ac- of phycocyanin, an alpha subunit (SYNPCC7002_A2210) gene curacy of peptide data obtained from the mass spectrometer product, where we identified 194 unique peptides based on (Fig. S1A). Fig. 2B shows the distribution of peptides and proteins 27,145 PSMs. It is clear from the data that our approach using

E5634 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al. Downloaded by guest on September 25, 2021 PNAS PLUS

Fig. 2. Overview of the proteogenomic results. (A) Scatter plot showing the distribution of the mass errors of all of the identified peptides. (B) The Venn diagram illustrates the relative contribution of the different fractionation methods used to the total number of peptides and proteins identified. (C)Venndiagramil- lustrating the coverage of several levels (whole, protein-coding, expressed, detected) of the Synechococcus 7002 genome. (D) Proteome landscape of the Synechococcus 7002. An overview of peptide data against the genome was generated using the Circos software (45). The concentric circles from the periphery

to the center represent (i) Synechococcus 7002 and six plasmids, (ii) proteins encoded by the genome, (iii) GC content of the Synechococcus SCIENCES 7002 genome, (iv) peptides identified in this study, genome search-specific peptides (GSSPs), (v) previously unidentified gene models (1, intergenic genes; 2,

different frame with annotated genes; and 3, opposite strand to existing genes discovered in this study), and (vi) revised gene models (N-terminal changes in APPLIED BIOLOGICAL annotated genes).

more than one method each of protein isolation, protein/peptide Integration and Visualization of and RNA-Seq Transcriptomic fractionation, and database search resulted in an increased number Data. Increasing numbers of investigators are now incorporating of peptide and protein identifications from the same sample. The RNA-Seq information with proteomic data to gain a more complete high quality of our protein identifications was demonstrated by understanding of cellular systems and improve genome annota- the following: the FDR at the peptide level was lower than 1.0%; tion (19–21). The comparative analysis of RNA and protein the average absolute peptide-mass accuracy was 1.96 ppm; 92% levels can be used as a validation tool to generate a protein atlas of 2,938 identified proteins were mapped by at least two unique with higher reliability. In this study, we performed directional peptides; and all modified peptides and peptides singly assigned RNA sequencing, and the results are presented in Table S1D. to proteins were manually verified. Among all of the identified We further integrated the proteomic data described above with proteins, a large number of hypothetical proteins were identified the transcriptome determined by RNA-Seq to facilitate the val- in Synechococcus 7002 (Table S1C). The Synechococcus 7002 idation of the proteomic results. In-depth transcriptome analyses of RNA-Seq data from Synechococcus cultures under eight dif- genome annotation contains 1,210 (38%) hypothetical proteins, ferent conditions have shown that >97% of the annotated ge- which are functionally uncharacterized due to lack of sequence nome is expressed in these cellular states, and our proteome data similarity to any known genes from model prokaryotes: i.e., set covers >93% of these RNA-Seq data (Fig. 3A). We have thus proteins predicted on the basis of the nucleic acid sequences only generated a comprehensive protein database that covers nearly and protein sequences with unknown function. In this study, we the entire expressed Synechococcus proteome. To compare peptide identified 918 proteins by MS that are annotated as hypothetical versus RNA abundance, we computed a scatter-plot of mRNA proteins, among which 311 were assigned by at least 100 PSMs, expression [reads per kilobase per million mapped reads (RPKM), and only 59 were single-peptide identifications (Fig. S2A). Blast 2,160,264 reads] mapped to a known gene (x axis) versus the was performed against the National Center for Biotechnology protein expression (PSMs, 1,212,428 spectra) (y axis) falling within Information (NCBI) database to obtain functional descriptions the region. The correlation value (r = 0.318) calculated using of the hypothetical proteins, and 99 MS-detected hypothetical Spearman’s rank correlation coefficient test suggested little or no proteins had assigned (GO) terms (Fig. S2B). correlation between protein expression (PSMs) and mRNA ex- The distribution of identified proteins among different biological pression (RPKM) (Fig. 3B). This result is in line with previous processes, molecular functions, and cellular localizations is il- studies, which have shown that protein expression is influenced lustrated in Fig. S2C. Thus, our results provide important clues by an array of posttranscriptional regulatory mechanisms and that for future functional studies of these hypothetical proteins. correlation between protein and mRNA levels is generally modest

Yang et al. PNAS | Published online December 15, 2014 | E5635 Downloaded by guest on September 25, 2021 Fig. 3. Comparison and integration of proteomics and transcriptomics data. (A) Comparison showing the overlap of protein identifications with the tran- scriptome determined using RNA-Seq in Synechococcus 7002. (B) Correlation between mRNA expression (RPKM) and protein expression (PSMs) in genes detected at both the mRNA and protein levels. (C) Screenshot of the main ABrowse genome browser interface. The genome browser interface consists of the navigation bar, the browsing canvas, and the detailed-information/user-space panel, which was used to covisualize experimental peptides and RNA-Seq data for Synechococcus 7002. The sequences of the protein and the peptides can be seen by zooming into the protein sequence track. This type of covisualization can be done on a large scale to comprehensively integrate the proteome with the genome and the transcriptome.

(22, 23); this result was highly similar to those of the previous the Synechococcus 7002 genome, and it is a step toward integration analyses on Medicago truncatula (24), human, and mouse (25). of proteomic (peptide-centric) and transcriptomic (RNA-Seq) data ThesedatawerecombinedwiththegenomedataonSynechococcus with current genome-location coordinates, allowing for in-depth 7002 and made publicly accessible online (lag.ihb.ac.cn:8080/), studies of individual genes and their protein counterparts as well as where they can be browsed by gene, protein, peptide, and PTM. more global studies using systems-biology approaches. Fig. 3C shows a screenshot of ABrowse displaying genomic, RNA-Seq, and proteomic data from Synechococcus 7002. This Identification of Previously Unidentified Genes. We also compared platform helps visualize all of the potential reading frames in the peptide sequences searched against a six-frame translated ge- Synechococcus 7002 genome and has the capability to help nome database with those present in the protein database. We browse and query the data to identify regions of interest quickly identified a set of unique orphan peptides that were not rep- with respect to structural annotation (e.g., previously unidentified resented among the predicted proteins of Synechococcus. These genes or identified peptides and their PTMs). The annotation peptides, designated as genome search-specific peptides (GSSPs), entries shown in the main browsing canvas of ABrowse are all mapped to unique locations on the Synechococcus 7002 genome. clickable, and their corresponding detailed information can be Out of the 2,778 GSSPs identified in this study, 486 peptides either displayed in the “Entry Detail” tab of the detailed-information/ mapped to regions of the genome where no gene was annotated or user-space panel. In addition, all of the modified peptides listed in did not match the gene model they were mapping to. Two gene- the main browsing canvas were highlighted in light blue, and the prediction programs, FgenesB and GeneMark, were used to previously unidentified genes identified in this work were high- identify ORFs in the region to which these GSSPs were mapped. lighted in light yellow in the protein-coding gene models region. Our in-house program was also used to identify ORFs after The interactive visual analytics tool provides a user-friendly web incorporating a wide range of information, including the peptide interface to browse, search, retrieve, and update information on score, an initiation codon, the number of peptide hits, and the

E5636 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al. Downloaded by guest on September 25, 2021 PNAS PLUS SCIENCES APPLIED BIOLOGICAL

Fig. 4. Identification of a previously unidentified gene, Z0010, based on peptides mapping to an intergenic region and RNA sequencing evidence. (A)Sixteen peptides mapped to an intergenic region between genes SynPCC7002_A0518c and SynPCC7002_A0519. algorithms FgenesB and GeneMark supported the presence of this additional gene. In addition, a protein corresponding to this previously unidentified gene has been annotated in the Leptolyngbya sp. PCC 7376 genome. (B) Protein sequence of a previously unidentified gene. Identified region is indicated by red text. (C) RNA sequencing evidence also supports the expression of the previously unidentified gene Z0010.

length of the previously unidentified protein. Using these pro- were predicted using our in-house program. Table S2 A–C lists the grams, we identified 118 previously unidentified protein-coding previously unidentified genes found in this study along with their regions and modified the annotation of 38 existing gene models, all genomic coordinates and the supporting peptide evidence. Among of which were assigned by at least two unique GSSPs. A graph- the previously unidentified genes identified, 13 were intergenic, 41 ical representation of all previously unidentified and corrected were frame-shifted, and 64 were on the strand opposite to an proteins identified in this study is shown in Fig. 2D. Further, we existing gene annotation. Fig. 4 depicts an example of a previously used comparative genomic strategies to investigate conservation unidentified gene that is located in the intergenic region between of previously unidentified gene models across related species. two predicted genes, SynPCC7002_A0518c and SynPCC7002_ The presence of orthologous genes in other species provides A0519. Another previously unidentified protein of 100 amino further support for the previously unidentified gene structures; acids (SynPCC7002_DZ003) was discovered in the region span- conversely, absence of conserved homologous protein-coding ning 11,658–11,960 on plasmid pAQ4 (Fig. S3). Nine GSSPs regions in other genomes may indicate that these genes or gene revealed a previously unidentified gene (SynPCC7002_Z0076) regions may be unique to Synechococcus 7002. Twenty-two of the encoded by the opposite strand of the SynPCC7002_A0539 gene previously unidentified ORFs had been previously annotated in and were identified as part of the gene model encoding a 91- other species, 80 previously unidentified genes were supported amino acid protein predicted by FgenesB and GeneMark and by FgenesB/GeneMark ORF predictions, and the remaining genes supported by directional RNA-Seq data (Fig. S4).

Yang et al. PNAS | Published online December 15, 2014 | E5637 Downloaded by guest on September 25, 2021 N Terminus Validation and Correction. Apart from identification of Identification of Peptides with Stop Codon Read-Through. Aside previously unidentified ORFs, we corrected some gene coordinates from noncanonical translation initiation, protein translation can using GSSP data. Using this strategy, we corrected 38 gene models sometimes continue through a stop codon, a mechanism known of N-terminal extensions with 105 unique GSSPs. Table S2D as “stop codon read-through” (30–32). Two examples of this contains the list of genes for which modification in the structure mechanism are the TGA and TAG stop codons that sometimes was suggested by our analysis, and this table contains information code for the rare amino acids selenocysteine and pyrrolysine, re- about previously assigned coordinates, our modifications, and the spectively (33, 34). In this study, we translated all of the TGA and corresponding peptide evidence. Fig. S5A depicts an example of TAG codons into selenocysteine (U) and pyrrolysine (O), re- correction of a gene model by extension of the N terminus. The spectively, while keeping the third stop codon, TAA, as the true current coordinates of the gene SynPCC7002_A0031, which stop codon. We found 23 read-through candidates with 43 unique codes for the Dps family DNA-binding stress response protein, GSSPs, and Table S2G provides a complete list of the peptides are 28,318–28,854. We found seven peptides mapping upstream identified with their genomic coordinates and RPKM values. of the gene, and elongated PCC7942 by comparative genomic Annotated peptide spectrum matches of the identified peptides analysis to detect another homologous protein in Synechococcus. with stop codon read-through have been deposited to the Pepti- In addition, both FgenesB and GeneMark gene-prediction algo- deAtlas database and can be accessed using the accession number PASS00285. Fig. S6B depicts a previously unidentified gene rithms predicted gene models that extended the gene in the 5′ di- model in which two genes are merged into one based on RNA- rection, in agreement with the newly annotated start codon. Seq and MS experimental evidence of translation of the TAG Thus, a combination of GSSPs, homology, and RNA-Seq–based stop codon as pyrrolysine (O). Read-through has been observed analysis allowed us to extend the gene model 48 nucleotides across the animal kingdom, and its widespread use suggests an upstream from the previous start codon. We further analyzed – additional level of regulatory complexity (31). Therefore, the protein translation start sites (TSSs) by probing N-terminal identified read-through in cyanobacteria offers a rich opportunity specific modifications. Formylation occurs on the initiator to enhance our understanding of the underlying molecular mech- , and N-terminal acetylation occurs on the second anisms and regulation of the read-through process more generally. amino acid after the initiator methionine is cleaved. These modifications directly mark the TSS for a protein-coding Identification of Signal Peptides. A signal peptide is a short N-terminal gene. Based on semiproteolytic peptides identified at <1% sequence that targets a protein for export or transport to a de- FDR and their upstream codons in the genome, we confirmed sired cellular location (35). Signal peptides are essential for the annotated TSS for 52 genes (Table S2E). Fig. S5B depicts proper cellular function in both eukaryotes and prokaryotes (35). two N-terminally formylated unique peptides that facilitate The average length of a signal peptide in Gram-negative bacteria confirmation of the TSS for SynPCC7002_A1803c, a protein is estimated to be 25 amino acids, with most signal peptides involved in the carbon dioxide-concentrating mechanism. We also between 20 and 30 amino acids (36). Although knowledge of identified N-terminal peptides that confirmed the TSS of three signal peptides is important for understanding protein function, revised gene models (SynPCC7002_A0315, SynPCC7002_A1929, they are difficult to confirm experimentally, and computational and SynPCC7002_A0646) (Table S2E). predictions are used to fill the gap. SignalP (37) and PrediSi (38) are two popular signal peptide prediction tools. It is clear that Identification of Proteins with Noncanonical Translation Initiators. In proteomic evidence can greatly increase the number of experi- prokaryotes, translation initiation is typically mediated by the mentally confirmed signal peptides and the confidence of signal start codon ATG. In addition, GTG and TTG are used as al- peptide predictions. We analyzed our peptide annotations to ternative start codons in agreement with the preference: ATG > confirm or refute signal peptide predictions. The lists of pre- GTG > TTG (26, 27). Studies on have estimated dicted signal peptides by SignalP, PrediSi, and MS/MS analysis the frequency of the start codon use as ATG 83%, GTG 14%, are provided in Table S3. A clear sequence motif, which closely and TTG 3% (28). In Synechococcus 7002, the frequencies of matches motifs used by SignalIP and PrediSi, emerges when we initiation codons were ATG 83.11%, GTG 9.51%, TTG 7.31%, examine the sequence immediately upstream of the 107 putative and CTG 0.06%. Additionally, it has been reported that ATT, signal peptides predicted by MS/MS analysis (Fig. S7A). SignalP ATA, and ATC are rare noncanonical translation initiators (28, 29). and PrediSi predict 130 and 332 proteins with signal peptides, A set of 19 putative proteins with noncanonical start codons con- respectively. However, there is a substantial discrepancy between taining 43 unique GSSPs (ATT or ATA) was identified and is these tools: only 81 signals are predicted by both. LC-MS/MS reported in Table S2F. Fig. S6A illustrates an example of the use of evidence provides a strategy to resolve the discrepancies between peptides to predict a nontraditional start codon. We identified two SignalP and PrediSi and identify signal peptides missed by both peptides that map to the opposite strand of SynPCC7002_A0826c; tools. Fig. S7B compares our predicted signal peptides with the however, none of the common ATG, GTG, or TTG start codons predictions made by SignalP and PrediSi. Our results confirm 12 signal peptide predictions. For 119 confirmed proteins, the MS/MS were found upstream of the region spanned by these two peptides. results include peptides upstream of the cleavage site predicted by A survey of available literature on translation initiation revealed SignalP/PrediSi and thus represent evidence against SignalP/PrediSi that ATA is known to function as a rare translation initiator, sug- predictions (Table S3D). Therefore, we refute 25 sites predicted gesting that this previously unidentified gene Z0045 may use ATA by SignalP and 110 sites predicted by PrediSi (with 8 refuted sites as the start codon. A representative MS/MS spectrum of the pep- predicted by both tools) (Fig. S7C). tide INDAALRKE is shown in Fig. S6A, and the nearly complete b and y ion series provide additional confidence to the annotation. Global Posttranslational Modification Discovery in Synechococcus The previously unidentified protein is supported by two gene- 7002. As mentioned above, proteogenomic data may be used to prediction algorithms, FgenesB and GeneMark, and we vali- identify PTMs on the proteome-wide level. Much less is known dated the expression of this gene at the level of the transcript about the type, frequency, and function of PTMs in prokaryotes, with seven mRNA reads and a 2.55 RPKM value. The use of this even for intensively studied model organisms such as E. coli, than noncanonical translation initiation codon may imply a specific in eukaryotes. PTM information gained from MS/MS data of regulatory mechanism, and further experiments are needed to cells grown under different conditions will contribute to an un- characterize the fundamental regulatory mechanisms for cyano- derstanding of PTMs in prokaryotic organisms. Recently, we bacterial cells. described a phosphoproteomic analysis of Synechococcus 7002 by

E5638 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al. Downloaded by guest on September 25, 2021 PNAS PLUS SCIENCES APPLIED BIOLOGICAL

Fig. 5. Summary of the identified proteins with PTMs in Synechococcus 7002. (A) Distribution of the number of proteins and sites with different PTMs in Synechococcus 7002. (B) Distribution of the predicted, identified, and modified proteins of Synechococcus 7002 according to their molecular functions. Different PTMs are represented with different colors. GO cate- gory classification of Synechococcus 7002 proteins as predicted from their genome annotations. *, Denotes overrepresentation.

using protein/peptide prefractionation, TiO2 enrichment, and positive rate is low, it is extremely unlikely that any of these LC-MS/MS analysis (39). In this study, we first analyzed the mass modification types represent a computational artifact. Moreover, spectrausingtheMODa(MODificationviaalignment)algorithm all of the selected modification types in Table S4B are supported (40), which allows for unrestrictive PTM searches with no limitation by studies on other species, further reinforcing the conclusion on the number of modifications per peptide and incorporates that they are not artifacts. several unique features to improve the sensitivity and accuracy of MODa was used to (i) discover and identify unexpected mod- peptide identification and eliminate the increases in false pos- ifications and (ii) assign posttranslationally modified peptide itives and false negatives. In all, 278,859 spectra of 70,042 unique sequences to MS/MS spectra, but MODa cannot be used to as- peptides and 2,469 proteins were identified by MODa. Among sign the localization of modifications to specific amino acids. To these results, 76,103 spectra of 28,905 unique peptides and 1,666 address this problem, the MaxQuant computational proteomics proteins contained PTMs with sizes accepted by the MODa platform was used to search for the specific PTM and determine search, regardless of the modification classification in Unimod. the localization probability of modifications in peptides (41). The Within these PTMs, 7,026 spectra of 3,059 unique peptides and identified peptides with different PTMs were further validated by 690 proteins contained in vivo PTMs. These findings, along with manual inspection of MS/MS data (see criteria in SI Materials the parameters for search and output column descriptions, are and Methods). MaxQuant takes advantage of high-resolution summarizedinTableS4A.TableS4B presents 25 common data such as those obtained by the LTQ-Orbitrap instruments modification types and the frequency of each. Because the false and employs algorithms that determine the mass precision and

Yang et al. PNAS | Published online December 15, 2014 | E5639 Downloaded by guest on September 25, 2021 accuracy of individual peptides (41). This method leads to greatly PTM Events Involved in Photosynthesis. It has been reported that enhanced peptide mass accuracy that can be used as a filter protein PTM events are involved in the regulation of photosyn- in database searching. Our approach identified 23 different thesis in cyanobacteria (39). According to the proteogenomic PTMs on 6,704 unique peptides from 2,230 Synechococcus 7002 data set contributed by this study, there exist many proteins with proteins with high confidence (FDR <1%). Fig. 5A shows the diverse PTMs in the photosynthetic apparatus (Table S9), which numbers and types of PTMs identified by MaxQuant in Syn- are illustrated in Fig. 6A. Mapping PTM events to the major echococcus 7002. We further confirmed the localization accuracy components of the photosynthetic apparatus will facilitate the of these modification sites using probability-based PTM scores. integration of proteogenomic data with biological function and These scores were used to rank peptides in MaxQuant searches may thereby provide insight into the potential functional rel- from the beginning, and they determined the localization prob- evance of the identified PTMs. To confirm the existence of ability of modifications in the peptides. For the modified pep- identified PTMs in Synechococcus 7002 further, we carried out tides detected, 11,839 modification sites were determined with immunoblotting analyses of Synechococcus 7002 total cell lysates a localization probability higher than 0.75 (Table S5A). In the using pan antiacetylation (lysine, abbreviated as K), trimethyla- tion (K), dimethylation (K), butyrylation (K), crotonylation (K), other 1,977 peptides, the modification sites could not be un- phosphorylation (tyrosine, abbreviated as Y), and succinylation ambiguously determined from the mass spectra but are limited to (K) antibodies. The specificity of these antibodies was confirmed the short amino acid sequences in the peptides. Details of the as in previous reports (42, 43). Strong signals were detected for all identified peptides with each PTM, including their protein IDs, seven antibodies tested (Fig. 6B). Interestingly, the succinylation sequences, search algorithm scores, PTM scores, and localization (K) and phosphorylation (Y) signals from Synechococcus 7002 P values, are provided in Table S5A. To evaluate the identifi- cells showed changes after the cells were treated with high light cation performance of PTM events by different algorithms, we for 2 h, suggesting that these two PTM events may play a role in compared the modified peptides and proteins identified by both the regulation of photosynthesis in Synechococcus 7002. Apart PTM search algorithms, and Fig. S7D presents the overlap be- from the high light treatment, we also performed an immuno- tween MODa and MaxQuant search results. It shows that 649 blot analysis of global protein and PTM levels upon different unique modified peptides and 1,518 modified proteins were stress inductions, and the results were shown in Fig. S7F.Al- identified by both PTM search algorithms. Table S4B presents though we cannot validate all of the predicted modifications the common modification sites identified by two PTM search at this time due to lack of available experimental data about algorithms. We also compiled a list of all proteins identified in PTMs for Synechococcus 7002, future research may confirm many this study along with the number of unique peptides, coverage, of these identified modifications and begin to uncover their bio- UniProt accessions, NCBI RefSeq accessions, protein product logical functions. description, domains, Gene Ontology information, signal peptide and transmembrane domain information in Table S5B. Development of a Software Tool for Automated Proteogenomic Analysis. We also developed a software tool for automated spectral searches, Biological Relevance of Modified Proteins. Previous conservation PTM discovery, and downstream analysis that is easily config- analysis indicated that posttranslationally modified proteins are urable and accessible. Inputs from different search engines can more conserved than unmodified proteins from prokaryotes to be used, which allows this software to be easily integrated with all eukaryotes (5). Owing to the high coverage and large number of existing frameworks. The outputs can be used to validate, refine, modified proteins identified, we compared the conservation and discover protein-coding genes using high-throughput MS levels between MS-detected modified and unmodified proteins. data from prokaryotes. Notably, the existing PTM algorithms In this study, we searched for orthologs of Synechococcus pro- (MODa and MaxQuant) were integrated into this tool for pro- teins against 652 bacterial species across the phylogenetic tree, as teome-wide PTM discovery. Based on the reported findings, this well as against 18 Archaea and 61 eukaryotes, by performing 2D tool is, to our knowledge, the first integrated software for both BLASTP. As shown in Fig. S7E, the MS-detected proteins were, large-scale PTM discovery and genome annotation, providing on average, more conserved than the unidentified proteins in comprehensive information on high-coverage protein expression the Synechococcus 7002 database throughout the three super- and PTM events in Synechococcus 7002. We anticipate that the kingdoms (P < 0.01). In addition, our analysis also indicated that software tool used herein can be applied to any sequenced most MS-detected modified proteins are more highly conserved prokaryotic organism to yield similar sets of novel discoveries. < than the MS-detected total proteins (P 0.01), suggesting that Discussion the modified proteins could be involved in the conserved func- In this study, we performed a comprehensive proteogenomic anal- tion class and translation machinery. Table S6 compares some ysis of gene expression in prokaryotes with the goals of (i) genome important PTMs between Synechococcus 7002 and other species. refinement, (ii) in-depth analysis of the detectable proteome, and This set of modified proteins supports the emerging view that (iii) global PTM discovery. To achieve optimal gene-expression PTM events are general and fundamental regulatory processes coverage, we performed both transcriptome and proteome analyses that occur in both prokaryotes and eukaryotes and opens the way of Synechococcus 7002 cultures from different culture conditions. for their functional and evolutionary analysis in cyanobacteria. Both technological platforms used in this study, Illumina GAIIx The entire set of predicted proteins, MS-detected proteins, directional RNA sequencing and LTQ-Orbitrap Elite MS, rep- and modified proteins was searched against the NCBI COG resent the state of the art in the fields of transcriptomics and (Clusters of Orthologous Genes) database and classified into proteomics, respectively. Our results demonstrate that the use of a wide range of functional classes. The distribution of protein high-resolution MS-based proteomics allows an almost complete proportions in different functional classes is illustrated in Fig. 5B genome annotation to be reached by incorporating dozens of and Tables S7 and S8. In particular, certain sets of modified conditions and different separation methods for a specific or- proteins were statistically overrepresented (P < 0.05) in certain ganism. The identified proteins represent 92.2% of the predicted functional classes (Fig. 5B). For example, the identified proteins proteome, which exceeds the results obtained in most published with , persulfide modification, farnesylation, and cases. Multiple factors may prevent 100% coverage, including hydroxymethylation PTMs were overrepresented in photosynthesis extensive posttranscriptional regulation that makes transcripts and respiration processes, suggesting that these PTM events are rather poor guides for protein expression. Here, we have used eight extensively involved in photosynthesis in Synechococcus 7002. different growth conditions, including typical laboratory conditions

E5640 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al. Downloaded by guest on September 25, 2021 PNAS PLUS SCIENCES APPLIED BIOLOGICAL

Fig. 6. Overview of the PTM events involved in the photosynthesis process. (A) Working scheme to delineate the PTM events in photosynthesis pathways in Synechococcus 7002. Modified proteins identified in Synechococcus 7002 are shown using a black arrow. Different PTMs are shown as squares with different colors. (B) Relative proteome-wide PTM levels in Synechococcus 7002 after a 2-h exposure to high light compared with standard conditions. Immunoblots were probed using antibodies for acetyllysine, succinyllysine, butyryllysine, crotonyllysine, dimethyllysine, trimethyllysine, and phosphotyrosine. Coomassie blue staining shows equal loading amounts (Left). HL, high light treatment; C, untreated control.

and several near-native aquatic environmental conditions. However, We consider the method used herein to be a valuable test bed not all possible growth conditions for Synechococcus 7002 can be for technology that will be widely applicable to any sequenced efficiently sampled, and it is likely that certain ORFs require specific prokaryotic organism. For eukaryotes, the database construction conditions, such as high or low temperature, for expression. In will need to be modified due to the high proportion of noncoding addition, various technical aspects associated with our experiments, sequences. A direct translation of eukaryotic DNA sequences such as incompatibility with buffers for soluble protein extraction, would lead to a dramatic increase of protein database sizes, and may have precluded detection. However, it is possible that appli- translation of mRNA transcripts into protein sequences may be cation of additional protein and peptide separation methods could used to solve this problem. Given the number of discoveries and enable improved coverage of low abundance proteins. modifications for the relatively small genome of Synechococcus A major impact of our methods is highlighted in the unique 7002 (3.41 Mb), we anticipate that similar efforts in organisms capability to experimentally annotate PTM events, which are with larger, less well-studied genomes will yield similar or greater increasingly recognized as playing important roles in prokaryotic sets of novel discoveries. The methods are reasonable in terms biology. Our results provide a level of coverage similar to that of of time, scale, and cost and are therefore accessible to most individual PTM analyses. Comprehensive analysis of PTM events researchers and scientists. The entire annotation pipeline is generalizable to any sequenced prokaryotic organism and may using the same dataset revealed a large and complex repertoire be extended to eukaryotes, with modification in the near fu- of PTMs, including those known to influence various cell pro- ture. We expect that the methods used in this study will be- cesses, including photosynthesis. The identified PTM datasets come an integral part of ongoing genome sequencing and from this study provide a foundation for studying the biological annotation efforts. functions of these PTM events in the photosynthesis process. In principle, the same method may be used for obtaining a holistic Materials and Methods view of signal transduction processes across time and perturba- The WT strain of Synechococcus 7002 was grown under different treatment tion conditions and will be generically applicable to any bi- conditions. Protein extracts were prepared from the cell mixtures, and ological model system and PTMs. proteins were subsequently reduced, alkylated, and digested using Glu-C

Yang et al. PNAS | Published online December 15, 2014 | E5641 Downloaded by guest on September 25, 2021 followed by trypsin. Proteins and peptides were separated by SDS/PAGE, PASS00285). Annotated peptide spectral matches for all of the accepted strong cation exchange , and Self-packed reversed-phase single peptide hits are deposited in PeptideAtlas and supplied as Supple- C18 columns. An Easy-nano liquid chromatography system (Thermo Fisher mentary Data 1. Annotated peptide spectrum matches of identified peptides μ μ × Scientific) equipped with an Easy-Spray 50-cm column (C18,2 m, 75 m with stop codon read-through are deposited in PeptideAtlas and provided as 50 cm; Thermo Fisher Scientific) was coupled to an LTQ Orbitrap Elite mass Supplementary Data 2. Annotated spectra of all of the identified peptides spectrometer (Thermo Fisher Scientific). Database search and spectral anal- with specific PTMs are deposited in PeptideAtlas and presented as Supple- ysis are described in SI Materials and Methods. Genome search specific peptide (GSSP) sequences that resulted from mass spectrometric data mentary Data 3. The snapshots of the genome browser for all of the pre- obtained at high resolution were used for genomic reannotation. Previously viously unidentified genes along with the corrections to the existing gene unidentified genes and gene model revisions were obtained using the gene- models identified in this study are deposited in PeptideAtlas and provided as prediction programs FgenesB and GeneMark, as well as our in-house soft- Supplementary Data 4. All supplemental tables (Tables S1–S9) are deposited ware (python script). Gene Ontology (GO) classification analysis was per- in PeptideAtlas and provided as Supplementary Tables. RNA-Seq data are formed using an in-house Perl script, and enrichment analysis for protein provided as Transcriptome. In-house integrated proteogenomic analysis function classes was performed using a hypergeometric test, with correction software is supplied as Supplementary Software, and detailed information is for multiple hypothesis testing. Conservation analysis was performed provided in the User’s Quick Start Guide file of this software package. using the two-directional BLASTP. For directional RNA-Seq, total RNA was isolated from cell mixtures grown under different growth conditions using ACKNOWLEDGMENTS. This work was supported by the National Basic the RNAprep pure Plant Kit (Tiangen), according to the manufacturer’s Research Program of China (973 Program, 2012CB518700), National Natural instructions, and sequencing was performed on an Illumina GAIIx. A detailed Science Foundation of China Grant 31370746, and the Hundred Talents overview of the methods used herein is presented in SI Materials and Program of the Chinese Academy of Sciences. Research in the laboratory of Methods. MS raw data files as well as processed data files are publicly D.A.B. was supported by National Science Foundation Grants MCB-0519743 available through PeptideAtlas (www.peptideatlas.org) (44) (data set ID and MCB-1021725.

1. Renuse S, Chaerkady R, Pandey A (2011) Proteogenomics. Proteomics 11(4):620–630. 24. Volkening JD, et al. (2012) A proteogenomic survey of the Medicago truncatula ge- 2. Kim MS, et al. (2014) A draft map of the human proteome. Nature 509(7502):575–581. nome. Mol Cell Proteomics 11(10):933–944. 3. Chaerkady R, et al. (2011) A proteogenomic analysis of Anopheles gambiae using 25. Branca RM, et al. (2014) HiRIEF LC-MS enables deep proteome coverage and unbiased high-resolution Fourier transform mass spectrometry. Genome Res 21(11):1872–1881. proteogenomics. Nat Methods 11(1):59–62. 4. Wilhelm M, et al. (2014) Mass-spectrometry-based draft of the human proteome. 26. Ishino Y, Okada H, Ikeuchi M, Taniguchi H (2007) Mass spectrometry-based pro- Nature 509(7502):582–587. karyote gene annotation. Proteomics 7(22):4053–4065. 5. Cao XJ, et al. (2010) High-coverage proteome analysis reveals the first insight of 27. Suzek BE, Ermolaeva MD, Schreiber M, Salzberg SL (2001) A probabilistic method for protein modification systems in the pathogenic spirochete Leptospira interrogans. identifying start codons in bacterial genomes. 17(12):1123–1130. Cell Res 20(2):197–210. 28. Ansong C, et al. (2011) Experimental annotation of post-translational features and 6. Gupta N, et al. (2007) Whole proteome analysis of post-translational modifications: translated coding regions in the pathogen Salmonella Typhimurium. BMC Genomics applications of mass-spectrometry for proteogenomic annotation. Genome Res 17(9): 12:433. 1362–1377. 29. Binns N, Masters M (2002) Expression of the Escherichia coli pcnB gene is transla- 7. Cain JA, Solis N, Cordwell SJ (2014) Beyond gene expression: The impact of protein tionally limited using an inefficient start codon: A second chromosomal example of post-translational modifications in bacteria. J Proteomics 97:265–286. translation initiated at AUU. Mol Microbiol 44(5):1287–1298. 8. Olsen JV, Mann M (2013) Status of large-scale analysis of post-translational mod- 30. Williams I, Richardson J, Starkey A, Stansfield I (2004) Genome-wide prediction of stop ifications by mass spectrometry. Mol Cell Proteomics 12(12):3444–3452. codon readthrough during translation in the yeast Saccharomyces cerevisiae. Nucleic 9. Ahrné E, Müller M, Lisacek F (2010) Unrestricted identification of modified proteins Acids Res 32(22):6605–6616. using MS/MS. Proteomics 10(4):671–686. 31. Jungreis I, et al. (2011) Evidence of abundant stop codon readthrough in Drosophila 10. Armengaud J (2010) Proteogenomics and : Quest for the ultimate and other metazoa. Genome Res 21(12):2096–2113. missing parts. Expert Rev Proteomics 7(1):65–77. 32. Chan CS, Jungreis I, Kellis M (2013) Heterologous stop codon readthrough of meta- 11. Zhang S, Bryant DA (2011) The tricarboxylic acid cycle in cyanobacteria. Science zoan readthrough candidates in yeast. PLoS ONE 8(3):e59450. 334(6062):1551–1553. 33. Hao B, et al. (2002) A new UAG-encoded residue in the structure of a methanogen 12. Johnson ZI, et al. (2006) Niche partitioning among Prochlorococcus ecotypes along methyltransferase. Science 296(5572):1462–1466. ocean-scale environmental gradients. Science 311(5768):1737–1740. 34. Srinivasan G, James CM, Krzycki JA (2002) Pyrrolysine encoded by UAG in Archaea: 13. Chellamuthu VR, Alva V, Forchhammer K (2013) From cyanobacteria to plants: Con- Charging of a UAG-decoding specialized tRNA. Science 296(5572):1459–1462. servation of PII functions during plastid evolution. Planta 237(2):451–462. 35. Ivankov DN, et al. (2013) How many signal peptides are there in bacteria? Environ 14. Rosgaard L, de Porcellinis AJ, Jacobsen JH, Frigaard NU, Sakuragi Y (2012) Bio- Microbiol 15(4):983–990. engineering of carbon fixation, biofuels, and biochemicals in cyanobacteria and 36. Dirix G, et al. (2004) Peptide signal molecules and bacteriocins in Gram-negative plants. J Biotechnol 162(1):134–147. bacteria: A genome-wide in silico screening for peptides containing a double-glycine 15. Lubner CE, et al. (2011) Solar hydrogen-producing bionanodevice outperforms nat- leader sequence and their cognate transporters. Peptides 25(9):1425–1440. ural photosynthesis. Proc Natl Acad Sci USA 108(52):20988–20991. 37. Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: Discriminating 16. Parmar A, Singh NK, Pandey A, Gnansounou E, Madamwar D (2011) Cyanobacteria signal peptides from transmembrane regions. Nat Methods 8(10):785–786. and microalgae: A positive prospect for biofuels. Bioresour Technol 102(22): 38. Hiller K, Grote A, Scheer M, Munch R, Jahn D (2004) PrediSi: Prediction of signal 10163–10172. peptides and their cleavage positions. Nucleic Acids Res 32(Web Server issue): 17. Ludwig M, Bryant DA (2011) Transcription profiling of the model cyanobacterium W375–W379. Synechococcus sp. strain PCC 7002 by Next-Gen (SOLiD™) sequencing of cDNA. Front 39. Yang MK, et al. (2013) Global phosphoproteomic analysis reveals diverse functions of Microbiol 2:41. serine/threonine/tyrosine phosphorylation in the model cyanobacterium Synechococcus sp. 18. Ludwig M, Bryant DA (2012) Acclimation of the global transcriptome of the cyano- strain PCC 7002. J Proteome Res 12(4):1909–1923. bacterium Synechococcus sp. strain PCC 7002 to nutrient limitations and different 40. Na S, Bandeira N, Paek E (2012) Fast multi-blind modification search through tandem nitrogen sources. Front Microbiol 3:145. mass spectrometry. Mol Cell Proteomics 11(4):M111.010199. 19. Nagaraj N, et al. (2011) Deep proteome and transcriptome mapping of a human 41. Cox J, Mann M (2008) MaxQuant enables high peptide identification rates, indi- cancer cell line. Mol Syst Biol 7:548. vidualized p.p.b.-range mass accuracies and proteome-wide protein quantification. 20. Chang C, et al. (2014) Systematic analyses of the transcriptome, translatome, and Nat Biotechnol 26(12):1367–1372. proteome provide a global view and potential strategy for the C-HPP. J Proteome Res 42. Tan M, et al. (2011) Identification of 67 histone marks and histone lysine crotonyla- 13(1):38–49. tion as a new type of histone modification. Cell 146(6):1016–1028. 21. Fanayan S, et al. (2013) Proteogenomic analysis of human colon carcinoma cell lines 43. Zhang Z, et al. (2011) Identification of lysine succinylation as a new post-translational LIM1215, LIM1899, and LIM2405. J Proteome Res 12(4):1732–1742. modification. Nat Chem Biol 7(1):58–63. 22. Schwanhäusser B, et al. (2011) Global quantification of mammalian gene expression 44. Desiere F, et al. (2006) The PeptideAtlas project. Nucleic Acids Res 34(Database issue): control. Nature 473(7347):337–342. D655–D658. 23. Wu L, et al. (2013) Variation and genetic control of protein abundance in humans. 45. Krzywinski M, et al. (2009) Circos: An information aesthetic for . Nature 499(7456):79–82. Genome Res 19(9):1639–1645.

E5642 | www.pnas.org/cgi/doi/10.1073/pnas.1412722111 Yang et al. Downloaded by guest on September 25, 2021