RESEARCH EFFORT AND EVOLUTIONARY PROPERTIES OF GENES

by

Travis Struck

______Copyright © Travis Struck 2016

A Submitted to the Faculty of the

DEPARTMENT OF MOLECULAR AND CELLULAR BIOLOGY

In Partial Fulfillment of the Requirements

For the Degree of

MASTER OF SCIENCE

In the Graduate College

THE UNIVERSITY OF ARIZONA

2016

STATEMENT BY AUTHOR

The thesis titled Research Effort and Evolutionary Properties of Genes prepared by Travis Struck has been submitted in partial fulfillment of requirements for a master’s degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library.

Brief quotations from this thesis are allowable without special permission, provided that an accurate acknowledgement of the source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author.

SIGNED: Travis Struck

APPROVAL BY THESIS DIRECTOR

This thesis has been approved on the date shown below:

May 9th, 2016

Ryan Gutenkunst Date Assistant Professor of Molecular and Cellular Biology

2 Table of Contents

Abstract - Page 5 Research Effort and Evolutionary Properties of Genes - Page 6

3 List of Figures and Tables

Figure 1 - Page 8 Figure 2 - Page 10 Figure 3 - Page 12 Figure 4 - Page 13 Table 1 - Page 15

4 Abstract

Recent research effort (measured in number of publications) on genes is biased towards genes that have been studied heavily in the past. Some factors for why this occurs is that many of these historically studied genes are important for survival or there are more tools available that make genetic studies of them much more accessible. Studies of research effort on Saccharomyces cerevisiae genes characterized with genetic or protein interactions found that there is an aversion to studying lesser-known genes in networks. As well, in a study of three human protein families, many of the genes that have recently been discovered to have association with complex disease, through methods such as genome wide association studies (GWAS), are understudied in the present compared to the small number of historically heavily studied genes. In this study we explore possible causes of and diversion from this preferential bias with gene conservation and human genes being disease-associated. We find there is some evidence of conservation driving biases in research effort for essential genes in Saccharomyces cerevisiae, but inconclusive evidence in other organisms. We look for effects of disease association through Mendelian and complex diseases in a historical, pre-GWAS, and contemporary, post-GWAS, context. Within both contexts we find that Mendelian disease genes may drive preferential study bias. For contemporary research effort we utilize a model of publication rates and find that there are individual GWAS genes that tend to be investigated more than predicted compared to non-GWAS genes. It appears that the proportion of GWAS genes that had highly unexpected increases in publication rate compared to model predictions rose fairly quickly but has been declining. Our analysis suggests that GWAS has had a small impact on what genes some scientists study despite preferential study biases. However GWAS gene-disease association’s impact on research effort appears to be declining, possibly due to scientists not being as interested in GWAS results as time goes on.

5 Research Effort and Evolutionary Properties of Genes

Travis J. Struck1 Brian K. Mannakee2 Ryan N. Gutenkunst1,*

1 Department of Molecular and Cellular Biology, University of Arizona, Tucson, Arizona, United States 2 Division of Epidemiology and , Mel and Enid Zuckerman College of , Tucson, Arizona, United States

* [email protected]

Introduction

Within biology there is a preferential study bias towards genes discovered early on [1, 2, 3], which may drive disproportionate knowledge among genes. This could lead to a situation in which highly studied genes are perceived as functionally important to an organism while others go understudied and their functional importance remains obscure. In molecular evolution it is a common belief that conserved genes are impor- tant [4, 5], despite the fact that conservation is largely uncorrelated with attempts at estimating functional importance through knockout screens [5]. Recent findings from Mannakee and Gutenkunst have suggested that there is a correlation between evolutionary conservation and genes with rate reactions that biochemical models are sensitive to perturbation of [6]. Because there is an aversion to studying less well known genes characterized to have interactions [1], it is possible that the conserved genes in the systems biology models that Mannakee and Gutenkunst found to be sensitive are sensitive due to the models being built around the well studied conserved genes in the model, because more is known about them compared to understudied genes. It has been suggested that disease association is a large driver of genetic studies in humans [7, 8]. However, preferential study biases in molecular biology [2, 3] have possibly led to genes important for un- derstanding human disease being understudied [9, 10]. Most studies geared towards diseases have possibly been on Mendelian diseases due to their nature of a single gene mutation leading to the disease phenotype being more easily identified. Recent advances in genomic technology have allowed for the study of more complex diseases through methods such as genome-wide association studies (GWAS). Previous studies sug- gest that most genes associated with complex diseases through genomic methods, such as GWAS, are not studied more after new knowledge about disease-association is discovered for these genes [2, 3]. These understudied complex disease genes could be viable targets for treating or further understanding of complex diseases [9, 10]. Here we use research effort on genes, measured in number of publications for each individual gene, to explore possible biases in the study of conserved genes and disease association. We find that while high conservation in some genes, especially essential Saccharomyces cerevisiae genes, is correlated with research effort, this correlation is not the case in all organisms. We further explore the idea that disease association may drive research effort, expanding on previous studies looking at kinases, hormone receptors, and nuclear receptors in humans [2, 3], by looking at the research effort of the whole protein coding genome for hu- mans. We look at historical research effort on genes before GWAS would have had influence over research effort and find Mendelian diseases genes, as defined by the Online Mendelian Inheritance in Man (OMIM) database, were preferentially studied while complex disease genes did not receive any particular attention.

6 We then explore contemporary research effort after GWAS would have possibly affected research effort on complex disease genes. Utilizing a model of publication rates, we find there is a small but significant increase in the number of publications when a gene is disease associated through GWAS. These findings suggest the research effort on many genes in organisms may be geared towards Mendelian disease genes rather than conservation level. Also our findings suggest that new technology has a small effect on diverting this preferential study bias. However, the effect of GWAS itself seems to be in decline, possibly due to a loss of interest. Separately from research effort and biases in biology, we will discuss a study building on Pandya and Gutenkunst’s work on positive selection against tyrosine residue content in organisms proteomes [11]. Our study and the Pandya and Gutenkunst study were based on finding that there is a negative correlation be- tween tyrosine content in an organism’s proteome and the number of tyrosine kinases in that organism. We found negative results in agreement with Pandya and Gutenkunst using an alternative method for detecting exposure of tyrosine residues to possibly being spuriously phosphorylated. Our method here involved a pipeline for assembling homology models of 16 metazoans with structural information in the Protein Data Bank (PDB) estimated by the DSSP database.

Results/Discussion

Research Effort and Evolutionary Conservation Findings by Mannakee and Gutenkunst have suggested that the more conserved a gene is, the more sensitive biochemical networks are to perturbation in rate constants of these conserved genes [6]. However, there is a preferential study bias within biology [1, 2, 3], which may be geared towards conserved genes due to theoretical importance [4, 5]. As such, findings from Mannakee and Gutenkunst may be an artifact of preferential study biases for conserved genes as it is possible that theses models may be so sensitive to these conserved genes because there is a disproportionate amount of knowledge about them compared to other gene species in the models. In order to assess if gene research effort, quantified by the number of PubMed publications on a gene, was biased towards evolutionarily conserved genes, we explored the evolution rate of genes, determined by PAML (dN/dS), for four organisms: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and humans (Figure 1). Research effort was fairly correlated with evolution conservation in Sac- charomyces cerevisiae (Spearman rank ⇢ = -0.33) and Drosophila melanogaster (Spearman rank ⇢ = -0.30). There was a weaker correlation in Caenorhabditis elegans (Spearman rank ⇢ = -0.15) and humans (Spear- man rank ⇢ = -0.25). Particularly, we find that essential genes in Saccharomyces cerevisiae have even more of a correlation between research effort and evolutionary conservation (Spearman rank ⇢ = -0.40). Because there is a bias towards preferentially studying older genes [1, 2, 3] and many of Saccharomyces cerevisiaes essential genes were discovered earlier on [1], our findings suggest that preferential study bias of conserved genes may have a role in the study of Saccharomyces cerevisiae. However, due to this conclusion not being as convincing in other organisms, a question remains: what would drive the study of genes in these organisms?

Historical Research Effort and Disease Association Organisms such as Drosophila melanogaster and Caenorhabditis elegans are model organisms of human diseases. As such, it is possible that disease association is a large driver of the study of genes, as has been previously suggested [7, 8]. To explore this, we looked at research effort of Mendelian disease, using Online

7 A 0 20 40 60 80 C 0 100 200 300 400 500 600 Count Count 1e+00 0.200 1e-01 0.050 1e-02 Evolution Rate Evolution Rate Evolution 0.010 1e-03 1e-04

0.002 5 10 20 50 100 200 10 20 50 100 200 500 1000 Publications Publications

B 50 100 150 200 D 0 100 200 300 400 500 600 Count Count 0.500 0.50 0.100 0.20 0.020 Evolution Rate Evolution Rate Evolution 0.05 0.005 0.01 0.001 1 2 5 10 20 50 100 1 5 10 50 500 5000 Publications Publications

Figure 1: Evolutionary rate analysis. A: For Saccharomyces cerevisiae we see there is a Spearman rank ⇢ of -0.33 between the research effort of genes and their conservation level. This is stronger (⇢ of -0.4) when considering only essential genes. B: For Drosophila melanogaster we find a Spearman rank ⇢ of -0.3 between research effort and gene conservation. C: For Caenorhabditis elegans we find a Spearman rank ⇢ of -0.15 between research effort and gene conservation. D: For human we find a Spearman rank ⇢ of -0.25 between research effort and gene conservation.

8 Mendelian Inheritance in Man (OMIM) to identify Mendelian disease-associated genes, and complex dis- eases that have been identified by genome-wide association studies (GWAS), based on the National Human Genome Research Institute (NHGRI) - European Institute (EBI) GWAS catalog. Because it has been suggested that new technology such as GWAS has little effect on preferential study biases, to begin understanding what, if any, effect GWAS would have on research effort, we first explored historical research effort on genes before GWAS would have had an effect on research effort. To do this we looked at research effort up to 2005, the year of the first GWAS, as there would have likely not be much influence on research effort from GWAS during this year [12]. We utilize a Lorenz curve, which is common in economics for analysis of inequalities in income, and find that there is heavy skewing of historical research effort, where only a few genes receive the majority of research effort (Figure 2A). The inequality of genes can be quantified with a Gini coefficient, which we find to be 0.909. When looking at the distribution of research effort given a disease association, we find that historical research effort was skewed towards Mendelian disease genes (Figure 2B, blue). While past the lowest levels of research effort complex disease-associated genes were equally distributed (Figure 2B, red). This suggests that historical research effort has been focused on Mendelian disease genes and that scientists did largely not attend to complex disease genes. Knowing how the historical research effort is distributed would allow us to now understand if contemporary research effort has been influenced by GWAS.

Contemporary Effects of GWAS To find out if there was an effect on contemporary, post-GWAS, research effort we examined the research effort on genes after 2005. We found the distribution of post-GWAS research effort (Figure 2C) to be very similar to pre-GWAS (Figure 2B), with research effort for Mendelian disease gene skewed towards higher levels and research effort on complex disease genes mostly being equal throughout the distribution. Because previous findings had shown that some genes can overcome preferential study biases [2, 3], we utilized a model that would allow us to estimate how research effort on individual genes would have been affected in each year if GWAS had never occurred and compare that to the actual research effort for genes that were found in a GWAS. The model we used came from Pfeiffer and Hoffman (Equation 1) [7], which assumes research effort is based largely on trends. As such, the model predicts the rates of publications for a gene in a given year in an organism, based on a gene’s own publication history up to a given year and from other genes in the organism being published. We fit this model’s five parameters to all non-GWAS genes using a maximum likelihood estimator, and these five parameters would be used to predict the rate of publications for all genes in years from 1950 to 2014 in humans. We found that our model under predicts gene-year research effort for CFH, which was the first gene in a published GWAS in 2005 [12], and that the research effort on some years following the GWAS hit are statistically significantly different from the model’s predictions (Figure 3A). Compared to a gene, PSEN1, that has yet to be associated through GWAS and possessing a similar number of publications as CFH the model fits fairly well with no years being statistically significantly different from the model’s predictions (Figure 3B). Thus this model is viable for the purposes of detecting the effect of GWAS across all GWAS genes for post-GWAS years. To assess the immediate impact that GWAS might have on the research effort of complex disease genes, we looked at a three-year window from when the gene was a GWAS hit and the following two years. We quantified the difference between the model and our data using the average difference between the actual research effort and the predicted research effort over the three-year window. We found a slight shift in 10 the distribution of GWAS genes compared to control genes (Figure 3C, Mann-Whitney U p<10 ), which contained a large range of values. We quantified this distribution by the average difference between

9 A B 1.0

0.8

0.6

0.4

Percentage of Publications Percentage 0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of Genes C

D

Figure 2: The distribution of Research Effort is highly skewed towards Mendelian disease genes in humans. A: Lorenz plot of the entire distribution of gene publications for years up to 2005. The inequality of this distribution can be quantified by a Gini coefficient of 0.909. B: We binned our genes into sets of 2323 based on similar range of research effort for years up to 2005. In each bin we looked at what percentage of the genes were found to be a Mendelian disease gene through OMIM (blue), a complex disease gene through GWAS (red), or either type of disease gene (purple) to approximate a probability of a gene being associated with a disease at a given level of research effort. There is a high concentration of Mendelian disease genes at higher levels of research effort while GWAS past the lowest level is equally distributed. C: Same as B, but research effort for years 2006 to 2014. There are not any noticeable changes in the trends.

10 GWAS genes and control genes in this distribution, which resulted in 0.44 more publications for GWAS genes on average over this three-year window. This suggests that GWAS genes have received a small but statistically significant boost in research effort shortly after their association to a complex disease through GWAS compared to non-GWAS genes. Some GWAS genes have very large differences between what their actual research effort is and what the model predicts. To assess the trend of the immediate impact that GWAS has on genes over time, we only looked at all the genes and years where research effort was statistically significantly different from predicted for the gene-year, after a Bonferroni correction for statistical significance. It appears that the immediate impact of GWAS on research effort increased quickly over time, but has been steadily declining (Figure 3D). It appears that this may be due to GWAS having less of an impact on the amount of publications that a gene will receive from being a GWAS hit, as the overall number of new genes that GWAS associates to complex diseases appears to be increasing each year since 2005 (Figure 3E). Overall, while Mendelian disease genes were highly focused on both historically (Figure 2B, blue) and presently (Figure 2C, blue), it is possible for some genes to overcome this preferential study bias in human genes due to technology that can discover complex disease associations (Figure 3A and 3C). However, this effect may decline over time (Figure 3D), possibly due to a decline in interest in new genes associated with complex diseases through GWAS (Figure 3E).

Structural biology and evolution Tan et al. has suggested that because of the strong negative correlation between the number of tyrosine kinases and the amount of tyrosine in an organism’s proteome, there is possibly positive selection against tyrosine residue content in organism proteomes to reduce the chance of spurious phosphorylation that could be deleterious [13]. Previous work in our group by Pandya and Gutenkunst using SPINE-X [14] to assess the accessibility of tyrosine residues to tyrosine kinases led to negative results regarding Tan et al.’s spuri- ous phosphorylation hypothesis [11]. SPINE-X results estimated that there is very little difference in how accessible tyrosine would be to tyrosine kinases across 16 metazoan organisms. This opposed the spurious phosphorylation hypothesis, which would have predicted that organisms with more tyrosine kinases would have less exposed tyrosine to avoid spurious phosphorylation. Because of the aversion to negative results in science [15], we utilize a pipeline to assemble homology models for all 16 metazoans with the sequences of proteins with solved structures from the Protein Data Bank (PDB) [16] in the DSSP [17] in order to obtain an alternate source of accessibility scores for tyrosine residues. Our results are in agreement with Pandya and Gutenkunst, finding that all metazoans seem to have very similar amounts of tyrosine accessibility across their proteomes (Figure 4). An additional test that Pandya and Gutenkunst performed on the spurious phosphorylation hypothesis was to examine the tyrosine content in ordered and disorder regions between orthologous proteins in hu- mans and Saccharomyces cerevisiae. If the spurious phosphorylation hypothesis held, it would be expected that there would be a greater loss of tyrosine content in disordered regions of human proteins compared to Saccharomyces cerevisiae due to the exposure of disordered regions to tyrosine kinases compared to or- dered. However Pandya and Gutenkunst found a fairly similar loss of tyrosine residues in human orthologs compared to Saccharomyces cerevisiae orthologs in both ordered and disordered regions. We preformed an additional control comparing Drosophila melanogaster orthologs to human orthologs to test for simi- lar results to Saccharomyces cerevisiae. Indeed, we found similar results where the tyrosine content loss was fairly equal in ordered and disorded regions in human orthologs compared to Drosophila melanogaster orthologs (data not shown). Overall, our results were in agreement with Pandya and Gutenkunst that the spurious phosphorylation

11 A B C

D E

Figure 3: GWAS effect on contemporary research. A: The publication rate data (solid line) vs. the model’s predicted publication rate (dashed line) for CFH, the first GWAS hit. The data statistically significantly differs from the model after a Bonferroni correction in 2006 and 2008. B: The publication rate data (solid line) vs. the model’s predicted publication rate (dashed line) for PSEN1, a non-GWAS gene with very similar number of publications to CFH, which never statistically significantly differs from the model. C: Probability density distribution of differences between actual number of publications and model predictions for the year each GWAS gene (red) was associated with a disease in GWAS and the following two years. For each year a GWAS gene was a GWAS hit a random non-GWAS gene with a similar total number of publications up to the year the GWAS gene was a hit was used as a control taking the difference between the control gene’s actual publications and predicted publication. The distribution of GWAS genes is slightly shifted to the right of the control distribution. D: The distribution of the proportion of statistically significantly different GWAS genes within the three-year window of being a GWAS hit. This distribution increases quickly but is also declining quickly. E: Distribution of new GWAS hits over time between 2005 and 2014. There is a steady increase in the number of GWAS genes being discovered.

12 Figure 4: The solvent accessibility of all tyrosine residues across 16 metazoans is highly similar. Cumulative distributions of tyrosine residues solvent accessibility, determined by DSSP homology models, are highly similar among 16 metazoans. Results are highly similar to those found by Pandya and Gutenkunst. hypothesis was not the correct hypothesis for the decrease in tyrosine residues. It is possible that the loss of tyrosine was due to the increase in GC content [18]. Another possible explanation is that loss of tyrosine in proteins may be due to the evolution of metabolic pathways utilizing tyrosine for neurotransmitter and hormone synthesis made less tyrosine being taken up to synthesis proteins more favorable [19].

Materials and Methods

Evolutionary rate data and analysis Saccharomyces cerevisiae evolution rate data was retrieved from Wall et al.’s ”Functional genomic analysis of the rates of protein evolution”, supplementary table 4 [20]. Drosophila melanogaster evolution rate data was retrieved from FlyBase’s PAML summary data from 12 species analysis from Clark Eisen et al.’s genome analysis [21]. Caenorhabditis elegans evolution data was retrieved from Castillo-Davis et al.’s ”The Functional Ge- nomic Distribution of Protein Divergence in Two Animal Phyla: Coevolution, Genomic Conflict, and Con- straint” supplementary file 2 [22]. Human evolution data was retrieved from Lindblad-Toh et al.’s ”A high-resolution map of human evo- lutionary constraint using 29 mammals” supplementary info [23].

Publication data Entrez GeneIDs and HGNC symbols were collected for all human protein-coding genes from NCBI Gene [24]. Publication data for all organisms was retrieved from NCBI Gene FTP site’s Gene2PubMed file (ftp: //ftp.ncbi.nlm.nih.gov/gene/DATA/), downloaded on November 25th, 2015. This data is man- ually curated by the Index Section in the National Library of Medicine when annotating genes for function

13 based off publications. Addition publications for genes on NCBI gene may be drawn from organism specific databases, Gene Ontology, and available curated systems biology data. Publication information was pulled from PubMed by BioPython [25].

GWAS & OMIM data GWAS data was retrieved November 25th, 2015 from the NHGRI-EBI’s Catalog of Published Genome- Wide Association Studies [26]. We filtered for genes that were considered disease genes according to the Experimental Functional Ontology [27]. The GWAS catalog contains GWAS hits based on what the authors of the GWAS publication reports and based on a pipeline that maps the reported SNPs to gene through NCBI Gene and NCBI dbSNP. We pooled the genes identified by the authors and the NHGRI-EBI into one set. If the SNP was within 500 bases of a gene we considered that gene to be a GWAS hit. OMIM data was retrieved July 8th, 2015 from OMIM FTP site [28]. We filtered out genes that were associated with complex disease, were not diseases, and did not have more than one confirmed source of mapping the disease to a gene. When distinguishing any crossover between OMIM and GWAS, we went with whichever had the earliest marked year for the gene entry. In cases where an OMIM gene had an ID below 600000, it was marked as being from 1994, as it was from before the online version of OMIM. In cases where years matched between GWAS and OMIM, the gene was considered a GWAS gene since our findings suggest that most Mendelian disease genes were discovered early. More info for these filters can be found on the OMIM FAQ. This resulted in 4917 GWAS genes and 2013 OMIM genes in our datasets.

Publication Rate Model The model of publication rates comes from Pfeiffer and Hoffman (2007) [7].

k1Pt⇤ + k2Pi,t + k3 Pi,t+1 = ↵ (1) 1+(Pt⇤/PS) The publication rate model estimates the mean number of publications assuming they are Poisson dis- tributed Pi,t+1, which is the expected number of publications for gene i in year t+1 based on all publi- cations for gene i up to and including to year t (Pi,t) and the average number of publications for all genes in an organism up to and including year t (Pt⇤). The parameters in the numerator describe the exponential growth of publications. k1, the rate of publication on gene i due to publications on all other genes up to year t,k2, the rate at which publications on gene i occur due to prior publications on gene i, and k3, the rate of spontaneous publications to describe the initial publication on gene i. The parameters in the denominator, PS and ↵, are saturation terms to represent a point in time publications will start to saturate and how quickly that saturation occurs. It is assumed that the actual number of publications will follow a Poisson distribution n giving the function f (;n) = e /n!. We fit the parameters to publication data from 1949 to 2014 for all genes never associated to disease through GWAS using a maximum likelihood estimator with SciPy [29] on the product of the function f ( = Pi,t+1; n = actual number of publications for gene i in year t+1) for each gene i and each year t. Values for our parameters can be found in Table 1. To test for gene-years where the number of observed publications was statistically significant from the models predicted number of publications was determined by calculating the probability of getting k or more publications given the expected number of publications in a Poisson distribution; where k is the number of observed publications for gene i on year t and is the predicted number of publications for gene i on year t. A Bonferroni correction was used to set our cutoff at 0.05/(NgNy), where Ng = 20907 genes and

14 Table 1 Parameter Value k1 0.036 k2 0.22 k3 0.002 PS 13.126 ↵ 1.619

N = 65 years. Here our statistical significance cutoff is 3.67x10 8 after the Bonferroni correction. This y resulted in 1129 gene-years having a statistically significant number of publications among 550 genes.

DSSP Homology Model The proteomes of the 16 species used for analysis of the spurious phosphorylation hypothesis were down- loaded from Ensembl release 72 [30]. In cases of multiple splice forms of a gene, the longest protein was chosen. Orthologs in the DSSP database [17, 31] were retrieved using NCBI BLAST 2.2.26 [32]. Matching tyrosine sites was done by aligning orthologs with MUSCLE 3.8.31 [33]. If 40% of bases in orthologs matched when aligned then they were used in our analysis. DSSP accessibility score was assigned only to matching residues.

References

1. Xionglei He and Jianzhi Zhang. On the growth of scientific knowledge: Yeast biology as a case study. PLoS , 5(3):1–12, 2009.

2. Aled M Edwards, Ruth Isserlin, Gary D Bader, Stephen V Frye, Timothy M Willson, and Frank H Yu. Too many roads not taken. Nature, 470(7333):163–165, 2011.

3. Ruth Isserlin, Gary D. Bader, Aled Edwards, Stephen Frye, Timothy Willson, and Frank H. Yu. The human genome and drug discovery after a decade. Roads (still) not taken. page 14, 2011.

4. Csaba Pal,´ Balazs´ Papp, and Martin J Lercher. An integrated view of protein evolution. Nature reviews. Genetics, 7(5):337–48, may 2006.

5. Zhi Wang and Jianzhi Zhang. Why is the correlation between gene importance and gene evolutionary rate so weak? PLoS genetics, 5(1):e1000329, jan 2009.

6. Brian K Mannakee and Ryan N Gutenkunst. Selection on network dynamics drives differential rates of protein domain evolution. bioRxiv, page 026658, 2015.

7. Thomas Pfeiffer and Robert Hoffmann. Temporal patterns of genes in scientific publications. Pro- ceedings of the National Academy of Sciences of the United States of America, 104(29):12052–6, 2007.

8. Robert Hoffmann and Alfonso Valencia. Life cycles of successful genes. Trends in Genetics, 19(2):79–81, 2003.

15 9. J K White, A K Gerdin, N A Karp, E Ryder, M Buljan, J N Bussell, J Salisbury, S Clare, N J Ingham, C Podrini, R Houghton, J Estabel, J R Bottomley, D G Melvin, D Sunter, N C Adams, D Tannahill, D W Logan, D G Macarthur, J Flint, V B Mahajan, S H Tsang, I Smyth, F M Watt, W C Skarnes, G Dougan, D J Adams, R Ramirez-Solis, A Bradley, and K P Steel. Genome-wide Generation and Systematic Phenotyping of Knockout Mice Reveals New Roles for Many Genes. Cell, 154(2):452– 464, 2013.

10. David C. Hess, Chadl Myers, Curtis Huttenhower, Matthew A. Hibbs, Alicia P. Hayes, Jadine Paw, John J. Clore, Rosa M. Mendoza, Bryan San Luis, Corey Nislow, Guri Giaever, Michael Costanzo, Olga G. Troyanskaya, and Amy A. Caudy. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genetics, 5(3), 2009.

11. S. Pandya and R. Gutenkunst. Directional Selection on Tyrosine Frequences in Eukaryotes Versus Solvent Accessibility. Electronic thesis, University of Arizona, 2013.

12. Robert J Klein, Caroline Zeiss, Emily Y Chew, Jen-Yue Y Tsai, Richard S Sackler, Chad Haynes, Alice K Henning, John Paul SanGiovanni, Shrikant M Mane, Susan T Mayne, Michael B Bracken, Frederick L Ferris, Jurg Ott, Colin Barnstable, and Josephine Hoh. Complement Factor H Polymor- phism in Age-Related Macular Degeneration. Science, 308(5720):385–389, 2005.

13. C. Tan, A. Pasculescu, W. Lim, and T. Pawson. Positive Selection of Tyrosine Loss in Metazoan Evolution. Science, 332(6032):917–917, 2009.

14. Eshel Faraggi, Tuo Zhang, Yuedong Yang, Lukasz Kurgan, and Yaoqi Zhou. SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent acces- sible surface area and backbone torsion angles. Journal of Computational Chemistry, 33(3):259–267, 2012.

15. Natalie Matosin, Elisabeth Frank, Martin Engel, Jeremy S Lum, and Kelly a Newell. Negativity to- wards negative results: a discussion of the disconnect between scientific worth and scientific culture. Disease models & mechanisms, 7(2):171–3, 2014.

16. F C Bernstein, T F Koetzle, G J Williams, E F Jr Meyer, M D Brice, J R Rodgers, O Kennard, T Shi- manouchi, and M Tasumi. The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of molecular biology, 112(3):535–542, may 1977.

17. Wolfgang Kabsch and Christian Sander. Dictionary of protein secondary structure: Pattern recogni- tion of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577–2637, 1983.

18. Zhixi Su. Comment on Positive Selection of. 917(September 2009):1–3, 2011.

19. Katrina M. Kutchko and Jessica Siltberg-Liberles. Metazoan innovation: From aromatic amino acids to extracellular signaling. Amino Acids, 45(2):359–367, 2013.

20. Dennis P Wall, Aaron E Hirsh, Hunter B Fraser, Jochen Kumm, Guri Giaever, Michael B Eisen, and Marcus W Feldman. Functional genomic analysis of the rates of protein evolution. Proceedings of the National Academy of Sciences, 102(15):5483–5488, 2005.

21. Evolution of genes and genomes on the Drosophila phylogeny. Nature, 450(7167):203–218, nov 2007.

16 22. Cristian I Castillo-Davis, Fyodor a Kondrashov, Daniel L Hartl, and Rob J Kulathinal. The functional genomic distribution of protein divergence in two animal phyla: coevolution, genomic conflict, and constraint. Genome research, 14(5):802–11, may 2004.

23. Kerstin Lindblad-Toh, Manuel Garber, Or Zuk, Michael F Lin, Brian J Parker, Stefan Washietl, Pouya Kheradpour, Jason Ernst, Gregory Jordan, Evan Mauceli, Lucas D Ward, Craig B Lowe, Alisha K Holloway, Michele Clamp, Sante Gnerre, Jessica Alfoldi,¨ Kathryn Beal, Jean Chang, Hiram Clawson, James Cuff, Federica Di Palma, Stephen Fitzgerald, Paul Flicek, Mitchell Guttman, Melissa J Hubisz, David B Jaffe, Irwin Jungreis, W James Kent, Dennis Kostka, Marcia Lara, Andre L Martins, Tim Massingham, Ida Moltke, Brian J Raney, Matthew D Rasmussen, Jim Robinson, Alexander Stark, Albert J Vilella, Jiayu Wen, Xiaohui Xie, Michael C Zody, Jen Baldwin, Toby Bloom, Chee Whye Chin, Dave Heiman, Robert Nicol, Chad Nusbaum, Sarah Young, Jane Wilkinson, Kim C Worley, Christie L Kovar, Donna M Muzny, Richard A Gibbs, Andrew Cree, Huyen H Dihn, Gerald Fowler, Shalili Jhangiani, Vandita Joshi, Sandra Lee, Lora R Lewis, Lynne V Nazareth, Geoffrey Okwuonu, Jireh Santibanez, Wesley C Warren, Elaine R Mardis, George M Weinstock, Richard K Wilson, Kim Delehaunty, David Dooling, Catrina Fronik, Lucinda Fulton, Bob Fulton, Tina Graves, Patrick Minx, Erica Sodergren, , Elliott H Margulies, Javier Herrero, Eric D Green, David Haussler, Adam Siepel, Nick Goldman, Katherine S Pollard, Jakob S Pedersen, Eric S Lander, and Manolis Kellis. A high-resolution map of human evolutionary constraint using 29 mammals. Nature, 478(7370):476–82, 2011.

24. Garth R. Brown, Vichet Hem, Kenneth S. Katz, Michael Ovetsky, Craig Wallin, Olga Ermolaeva, Igor Tolstoy, Tatiana Tatusova, Kim D. Pruitt, Donna R. Maglott, and Terence D. Murphy. Gene: A gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1):D36–D42, 2015.

25. Peter J A Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J L De Hoon. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, 2009.

26. T (EBI) Burdett, P N (NHGRI) Hall, E (EBI) Hasting, L A (NHGRI) Hindorff, H A (NHGRI) Junk- ins, A K (NHGRI) Klemm, J (EBI) MacArthur, T A (NHGRI) Manolio, J (EBI) Morales, H (EBI) Parkinson, and D (EBI) Welter. The NHGRI-EBI Catalog of published genome-wide association studies. url www.ebi.ac.uk/gwas . \ { } 27. James Malone, Ele Holloway, Tomasz Adamusiak, Misha Kapushesky, Jie Zheng, Nikolay Kolesnikov, Anna Zhukova, Alvis Brazma, and Helen Parkinson. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics, 26(8):1112–1118, 2010.

28. Joanna S. Amberger, Carol A. Bocchini, Franc¸ois Schiettecatte, Alan F. Scott, and Ada Hamosh. OMIM.org: Online Mendelian Inheritance in Man (OMIM R ), an Online catalog of human genes and genetic disorders. Nucleic Acids Research, 43(D1):D789–D798, 2015.

29. Stefan´ van der Walt, S Chris Colbert, and Gael¨ Varoquaux. The NumPy Array: A Struture for Efficient Numerical Computation. Computing in Science & Engeneering, 13:22–30, 2011. { } 30. Paul Flicek, M. Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis, Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Stephen Fitzgerald, Laurent Gil, Carlos Garc´ıa

17 Giron,´ Leo Gordon, Thibaut Hourlier, Sarah Hunt, Nathan Johnson, Thomas Juettemann, Andreas K. Kah¨ ari,¨ Stephen Keenan, Eugene Kulesha, Fergal J. Martin, Thomas Maurel, William M. McLaren, Daniel N. Murphy, Rishi Nag, Bert Overduin, Miguel Pignatelli, Bethan Pritchard, Emily Pritchard, Harpreet S. Riat, Magali Ruffier, Daniel Sheppard, Kieron Taylor, Anja Thormann, Stephen J. Tre- vanion, Alessandro Vullo, Steven P. Wilder, Mark Wilson, Amonida Zadissa, Bronwen L. Aken, Ewan Birney, Fiona Cunningham, Jennifer Harrow, Javier Herrero, Tim J P Hubbard, Rhoda Kinsella, Matthieu Muffato, Anne Parker, Giulietta Spudich, Andy Yates, Daniel R. Zerbino, and Stephen M J Searle. Ensembl 2014. Nucleic Acids Research, 42(D1):749–755, 2014.

31. Robbie P. Joosten, Tim A H Te Beek, Elmar Krieger, Maarten L. Hekkelman, Rob W W Hooft, Reinhard Schneider, Chris Sander, and Gert Vriend. A series of PDB related databases for everyday needs. Nucleic Acids Research, 39(SUPPL. 1):411–419, 2011.

32. S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, oct 1990.

33. Robert C. Edgar. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.

18