Figure 1 Phylogenetic Profiling of the Arabidopsis Aren Ewing and Gernot Presting Department of Molecular Biosciences and Bioengineering, University of Hawaii, Honolulu, HI 96822, USA Conclusion: Introduction: Results: The goal of this project is to assign to Member of several biosynthetic pathways pathways based on co-evolution. Arabidopsis Clustered pathway proteins were found in the hierarchical tree using the greedy outlier removal clustered based on their phylogenetic profiles using several thaliana is the first fully sequenced plant, and program. Tryptophan and isoprenoid biosynthesis pathways were examined in detail. Most of different hierarchical trees and datasets. Within these many (60%) of its genes are functionally the proteins in the pathways do cluster, those that don’t either catalyze the first enzymatic step compact clusters we found new metabolic pathway annotated. This makes it a good candidate for or are misannotated / mispredicted (figure 3). New pathway members were found near known members that were not annotated in Aracyc. This study phlyogenetic studies. Utilizing existing pathway members in the hierarchical tree (figure 4). demonstrates the utility of a novel tool for assigning new information about pathways of Arabidopsis Figure 1 (left and poster border): All of the Arabidopsis genes with at least one match other members to pathways through the use of a high throughput (AraCyc), a tool was created to find novel than Arabidopsis (19324 rows) clustered in a binary tree. Clusters of unique profiles are pathway/ list search. genes that may be associated with known Figure 1 scattered throughout the tree. Red color signifies presence of that Arabidopsis gene in target pathways. Similar studies have been organism, black indicates absence. ( Isoprenoid = Tryptophan = ) A web interface allows submission of a list of performed by profiling microbial Arabidopsis genes and ranks the genes as they are http://us.expasy.org/sprot/ clustered within the tree, finding the largest and densest ppap/ such as E. coli (Pellegrin i, 1999). This is the Distribution of gene profiles first time Arabidopsis has been analyzed with cluster (genomics.hawaii.edu/prestinglab/projects/phyloP/).

phylogenetic profiling. In addition we Prevalence of Arabidopsis Genes in This tool may be useful for analyzing gene lists of co- Completely Sequenced Organisms Figure 2: Distribution of Arabidopsis Gene implemented a novel automated analysis to Matches in 170 Non-Plant Genomes. The regulated genes of microarrays studies or incomplete or find groups of pathway members that display 1600 poorly characterized metabolic pathways for gene co- 1400 number of Arabidopsis genes with matches to 1 Initial E Value Threshold similar gene profiles. 1200 evolution. This method may also be useful for finding Revised E Value Threshold or more completely sequenced non-plant 1000 genomes are shown. Initially, an E value of 1e-03 evolutionary events such as lateral gene transfer. 800

600 was used to determine presence or absence of Methods: 400 an Arabidopsis gene in a target organism. This Future Improvements: 200 value was later revised to 1e-05 for and

NUMBER OF ARABIDOPSIS GENES 0 0 20 40 60 80 100 120 140 160 180 1e-06 for based on analysis of single While this approach to assigning genes to pathways appears to Select and download predicted NUMBER OF GENOMES IN WHICH GENE IS PRESENT BLAST hits outside of Arabidopsis. function well for some pathways, opportunities for improvement DOWNLOAD proteins of all fully sequenced exist: other clustering and analysis methods, alteration of GENOMES genomes available from NCBI. presence threshold, or different alignment types (Smith- Gene profiles of 2 pathways Waterman) are options that will be explored. Additional data Figure 3: Gene profile clusters of 11 isoprenoid and 24 tryptophan biosynthesis genes. such as improved gene annotations, increased number of fully Data are presented as a matrix with Arabidopsis genes in rows and organisms in columns and sequenced organisms, and more detailed metabolic pathway sorted by kingdom (Archaebacteria = white , Eubacteria = grey , Eukaryota = dark grey). Red information will further enhance phylogenetic profiling. colo r sig nifies presence of that Arabidopsis gene in target organism, black indica tes absence.

Isoprenoid Biosynthesis: Three isozymes of first enzymatic A_apA_afA_hspA_mjA_mmA_mkA_maA_mazA_mtneA_paA_pyaA_pyfA_pyhA_ssA_stA_taA_tvagtagtaa babaabc9bcbh bsbt bb bl bf bobbp bprbbubj bmbrsbapbapsbspcajcacchmchtcgcchpachpcchpjchptchttchvclaclpclt codcoecogcobderdvhenfecceckechecefn gs gvhd hi hehhephepjlj lp ll li lisclin lmolmofmlmapmbmlmtcmthmgmgnmmympmpnmpuA_nenmmnmznosoi oypparpmphlpir pogprmpmapmmpsapspps rs rprc ripsaltsalysallshoshfshftsimstawstamstanstestrastrnstrmstrpstrtstepstemsteastesstrascsy syntt te tmtth td tp tw twtuu vcvp vv vvywwbwedmwsxc xcixf xftyp ypkypbmE_ceE_encE_pfE_droE_sacE_huE_muE_ratE Relevant Literature: BLAST COMPARISON At3g21500.1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 At4g15560.1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 step in isoprenoid pathway At5g11380.1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 At2g02500.1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 0 1 1 0 1 At2g26930.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 At5g62790.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 At1g63970.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 At1g63970.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 At4g34350.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 Mouse Compare proteins in Arabidopsis At5g60600.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 Human At5g60600.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 Eight genes encoding six enzymes Eisen, J. A. 1998. Phylo genomics: Improving functional Yeast against each of the (170) other that catalyze steps 2-7 in isoprenoid Arabidopsis Bacteria predictions for uncharacterized genes by evolutionary analysis. Drosophila genomes using Basic Local pathway Rat Genome Res 8: 163-167. Arabidopsis Alignment Search Tool (BLAST). Anthranilate synthase alpha subunit Lange, B. M. and M. Ghassemian 2003. Genome organization in cluster (first step of tryptophan Arabidopsis thaliana: a survey for genes involved in isoprenoid and chlorophyll metabolism. Plant Mol Biol 51: 925-948. Tryptophan Biosnthesis: biosynthesis pathway in Aracyc).

A_apA_afA_hspA_mjA_mmA_mkA_maA_mazA_mtneA_paA_pyaA_pyfA_pyhA_ssA_stA_taA_tvagtagtaababaabc9bcbhbsbt bbblbf bobbpbprbbubjbmbrsbapbapsbspcajcacchmchtcgcchpachpcchpjchptchttchvclaclpcltcodcoecogcobderdvhenfecceckechecefn gsgvhdhihehhephepjlj lpll li lisclinlmolmofmlmapmbmlmtcmthmgmgnmmympmpnmpuA_nenmmnmznosoioypparpmphlpirpogprmpmapmmpsapsppsrs rprc ripsaltsalysallshoshfshftsimstawstamstanstestrastrnstrmstrpstrtstepstemsteastesstrascsysyntt te tmtthtd tptw twtuuvcvpvvvvywwbwedmwsxcxcixfxftypypkypbmE_ceE_encE_pfE_droE_sacE_huE_muE_ratE At5g05730.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 Pazos, F. and A. Valencia 2001. Similarity of phylogenetic trees At2g29690.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 At3g55870.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 At1g25220.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 At1g24909.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 At5g57890.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 Anthranilate synthase beta subunit At5g17990.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 At5g48220.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 as indicator of -protein interaction. Protein Eng 14: 609- At2g04400.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 At3g54640.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 At4g27070.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 At5g38530.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 cluster At5g54810.1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 At4g02610.1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 At5g05590.1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 614. At1g07780.2 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 At1g29410.1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 1 At1g51110.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 At5g06850.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 At3g61720.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 CLUSTER DATA At5g12970.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1 At3g57880.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1 Pellegrini, M., E. M. Marcotte, M. J. Thompson, D. Eisenberg Perform agglomerative hierarchical At3g61300.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 At1g51570.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 Cluster of 10 tryptophan genes

Gene_name A_ap A_af A_hsp A_mj A_mm A_mk A_ma A_maz A_mt A_ne A_pa A_pya A_pyf A_pyh A_ss A_st A_ta A_tv agtc agtw At1g01010.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01020.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01030.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01040.1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 At1g01050.1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 and T. O. Yeates 1999. Assigning protein functions by At1g01060.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 encoding steps 2-5 of the tryptophan At1g01060.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 clustering on formatted matrix with At1g01070.1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 At1g01070.2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 At1g01080.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01090.1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 At1g01100.1 0 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 At1g01100.2 0 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 At1g01110.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 biosynthesis pathway (Aracyc) At1g01120.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 comparative genome analysis: Protein phylogenetic profiles. P At1g01130.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 presence / absence calls, At1g01140.1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 At1g01140.2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 At1g01140.3 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 At1g01150.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01160.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01170.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01180.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01190.1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 Natl Acad Sci USA 96: 4285-4288. At1g01200.1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01210.1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 generating a binary tree of gene At1g01220.1 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 1 1 0 0 At1g01225.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01230.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01240.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Eight remaining genes of tryptophan At1g01240.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01240.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01250.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01260.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01280.1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 At1g01290.1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 profiles. At1g01300.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01310.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 biosynthesis pathway At1g01320.1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 At1g01340.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At1g01350.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Acknowledgments: (Rows=Genes, Columns=Genomes) Five new genes found in pathways Figure 4: Amino Acid Biosynthesis Genes Cluster Together. We thank MHPCC (Maui High Performance Computing Center) Hierarchical tree was produced by clustering all genes with for a student engagement grant supporting Aren Ewing. Analyze Clusters BLAST matches (based on revised E values) in 170 non-plant genomes by their phylogenetic profile. Genes are listed by their Further Information Greedy Outlier Removal of Pathway Members Find clusters of known pathways 100000 Tryptophan Biosynthesis AGI number followed by the number of the step they catalyze in genomics.hawaii.edu/prestinglab/ Cytokinin Biosynthesis Isoprenoid Biosynthesis 10000 Random 1 within tree followed by manual Random 2 the pathway. Genes that had not previously been assigned to Datasets and web interface of Greed Outlier Algorithm: Random 3 1000 annotation using a custom “Greedy 100 Subset size their respective pathway in Aracyc are in bold. (chor = chorismate, genomics.hawaii.edu/prestinglab/projects/phyloP/ Outlier Removal” program. 10 trp = tryptophan and his = histidine biosyn thesis). [email protected]

1 0 5 10 15 20 25 30 Number of genes removed