Phylogene Server for Identification and Visualization of Co-Evolving Proteins Using Normalized Phylogenetic Profiles
Total Page:16
File Type:pdf, Size:1020Kb
PhyloGene server for identification and visualization of co-evolving proteins using normalized phylogenetic profiles The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Sadreyev, Ilyas R., Fei Ji, Emiliano Cohen, Gary Ruvkun, and Yuval Tabach. 2015. “PhyloGene server for identification and visualization of co-evolving proteins using normalized phylogenetic profiles.” Nucleic Acids Research 43 (Web Server issue): W154-W159. doi:10.1093/nar/gkv452. http://dx.doi.org/10.1093/nar/gkv452. Published Version doi:10.1093/nar/gkv452 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:17820916 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA W154–W159 Nucleic Acids Research, 2015, Vol. 43, Web Server issue Published online 09 May 2015 doi: 10.1093/nar/gkv452 PhyloGene server for identification and visualization of co-evolving proteins using normalized phylogenetic profiles Ilyas R. Sadreyev1, Fei Ji1,2, Emiliano Cohen3, Gary Ruvkun1,2 and Yuval Tabach3,* 1Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA, 2Department of Genetics, Harvard Medical School, Boston, Boston, MA, USA and 3Department of Developmental Biology and Cancer Research, The Institute For Medical Research-Israel-Canada, The Hebrew University-Hadassah Medical School, Jerusalem, Israel Received February 19, 2015; Revised April 16, 2015; Accepted April 24, 2015 ABSTRACT tion together, for example members of the same pathway or protein complex, often show similar pattern of conserva- Proteins that function in the same pathways, pro- tion across phylogenetic clades. As an extreme case, when a tein complexes or the same environmental condi- species is no longer under evolutionary pressure to maintain tions can show similar patterns of sequence conser- a protein complex, pathway, organelle or specific function, vation across phylogenetic clades. In species that the corresponding proteins are lost or show a strong diver- no longer require a specific protein complex or path- gence. Phylogenetic profiling based on the binary calls ofthe way, these proteins, as a group, tend to be lost or presence or absence of orthologs in the surveyed genomes diverge. Analysis of the similarity in patterns of se- produced impressive results in evolutionary analyses of a quence conservation across a large set of eukaryotes wide variety of prokaryotes and of eukaryotic organelles, can predict functional associations between different such as the cilia (1,2) and mitochondria (3,4). When applied proteins, identify new pathway members and reveal to eukaryotic genomes, however, this approach has been less effective (5–8), in part due to the difference in evolutionary the function of previously uncharacterized proteins. rates in prokaryotes compared to eukaryotes: evolutionary We used normalized phylogenetic profiling to pre- rates and the resulting sequence divergence are higher in dict protein function and identify new pathway mem- prokaryotes, due to various factors including shorter gener- bers and disease genes. The phylogenetic profiles ation times and horizontal gene transfer. Although the bi- of tens of thousands conserved proteins in the hu- nary phylogenetic profile can provide sufficient resolution man, mouse, Caenorhabditis elegans and Drosophila for the analysis of divergence among prokaryotic proteins, genomes can be queried on our new web server, it can be less accurate in eukaryotes. In eukaryotes, study- PhyloGene. PhyloGene provides intuitive and user- ing loss or retention of proteins without considering the friendly platform to query the patterns of conser- levels of protein conservation between species might lead vation across 86 animal, fungal, plant and protist to higher amounts of noise in the data and limit the anal- genomes. A protein query can be submitted either ysis. Quantitative estimates of orthologous sequence con- servation may provide better resolution within the scale by selecting the name from whole-genome protein of evolutionary distances between eukaryotic species. At sets of the intensively studied species or by entering this scale, a simple presence of an orthologous protein in a protein sequence. The graphic output shows the a genome may be insufficient to make conclusions about profile of sequence conservation for the query and divergent evolution, and quantitative measures of sequence the most similar phylogenetic profiles for the pro- similarity would be more informative. These measures, how- teins in the genome of choice. The user can also ever, should take into account evolutionary distances be- download this output in numerical form. tween compared species. For example, when comparing a human protein to its mouse and yeast orthologues, the se- quence identity of 50% suggests a fast sequence divergence INTRODUCTION in mouse but relatively high conservation in yeast. Phylogenetic profiling of a protein specifies the relative se- Recently, we developed Normalized Phylogenetic Profile quence conservation or divergence among orthologous pro- (NPP) analysis based on a continuous measure of sequence teins across a set of reference genomes. Proteins that func- similarity that is adjusted to evolutionary distance between *To whom correspondence should be addressed. Tel: +972 54 397 3641; Fax: +972 2 641 4583; Email: [email protected] C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2015, Vol. 43, Web Server issue W155 species. The level of protein conservation for the ortholog sources like protein-protein interaction maps, results from in a given genome is normalized to the scale of conserva- high throughput screening or validated genetic pathways tion for all proteins in this genome, in effect taking into ac- and find overlap between the lists as we previously showed count the genome-wide distribution of conservation values (9–11). expected for this individual species. This method captures When an individual query protein is submitted as a se- protein sequence divergence and partial loss in the con- quence, the workflow is modified: (i) run BLASTP with the text of inter-species phylogenetic distances. Starting with submitted sequence as a query against protein databases of query proteins from a few well-studied genomes, we sur- multiple eukaryotic genomes and choose the top BLAST veyed and compared their phylogenetic profiles across full hit for each genome; (ii) filter out low scores (BLAST simi- protein sets from 86 eukaryotic genomes. These analyses larity score <50); (iii) normalize BLASTP bit scores by the revealed novel components in various biological pathways highest score among genome hits; (iv) transform top bit including RNAi, m6A RNA methylation and the MITF score against each ‘subject’ genome into a Z-score based pathway, as well as identified novel proteins associated with on pre-computed mean and standard deviation of scores melanoma and other diseases (9–11). generated by BLAST searches with all proteins from the se- Here we describe the implementation of the NPP algo- lected query genome against this ‘subject’ genome. Finally, rithm on a web-based server, PhyloGene, which allows the Pearson correlation coefficients are calculated between the user to submit protein queries, inspect the output in an in- query Z-score profile and the pre-computed Z-score profiles teractive graphic format and download the output in nu- for proteins from the genome of interest, followed by the merical format. selection of the top phylogenetic profiles with the highest similarity to that of the query protein. Pearson correlation coefficients for each of the top profiles and their statistical MATERIALS AND METHODS significance are calculated and displayed as described above. To generate phylogenetic profiles, NPP (9–11) performs four PhyloGene is an intuitive and easy to use web tool that steps: (i) for each protein from the genome of interest, run implements both of these modes. At the front page, the user BLASTP (12) against the full protein sets of multiple eu- can submit the query in the left panel and view the graphic karyotic genomes and choose the top BLAST hit for each output as a heatmap of sequence similarity patterns across genome; (ii) filter out low scores (BLAST similarity score multiple genomes in the right panel. The numerical output <50) and query proteins that do not have homologs across for more detailed analysis can be downloaded using a link to a portion of genomes; (iii) normalize BLASTP bit scores by the Excel table in the left pane. The query can be submitted the score of the query to itself (i.e. top BLAST score against by (i) selecting a protein name from the menu of all proteins the same genome); (iv) calculate Z-scores on the population for a genome of the user’s choice or (ii) submitting a protein of normalized scores for each separate genome (species). sequence by pasting in a window or uploading the sequence Finally, Pearson correlation coefficients r are calculated