Essay Deducing Function by Forensic Integrative Cell Biology

William C. Earnshaw* Wellcome Trust Centre for Cell Biology, University of Edinburgh, ICB, Edinburgh, Scotland, United Kingdom

even essential alternative functions that are what part it plays in the particular cellular Summary currently unsuspected. process I happen to be studying. Depend- Our ability to sequence genomes The subject of this essay is an important ing on the organism, the functions of some has provided us with near-complete emerging area in cell biology research: 20%–60% of are uncertain [2]. lists of the proteins that compose how to predict the functions of unchar- As referred to here, ‘‘uncharacterised cells, tissues, and organisms, but acterised and unknown proteins and how proteins’’ are proteins that are present in this is only the beginning of the to identify and characterise novel func- annotated databases, but whose functions process to discover the functions of tions of known proteins (for earlier discus- are not determined. Proteins published as, cellular components. In the future, sions of this see [1–3]). These are areas for example, ‘‘protein up-regulated in it’s going to be crucial to develop that I predict will involve the coordination cancer’’ or ‘‘protein up-regulated in cell computational analyses that can of very different kinds of advances by two type x’’ may have names, but no one predict the biological functions of distinct cohorts of future cell biologists. actually knows what they do. In a recent uncharacterised proteins. At the The first of these will be adept at study of the proteome of mitotic chromo- same time, we must not forget producing a huge range of information somes, amongst .4,000 proteins identi- those fundamental experimental from a wide variety of ‘‘omics’’ and other fied, my colleagues and I found just over skills needed to confirm the predic- high-throughput studies and able to inte- 300 proteins like this [4]. tions or send the analysts back to grate this information to predict how What I call ‘‘unknown proteins’’ can be the drawing board to devise new proteins function. The second will devise of two classes. First, many proteins are ones. low and high-throughput biochemical tests present in databases but have not yet to prove or disprove those predictions in appeared in publications or been given the laboratory. formal names. In our anal- What does it mean for cell biologists to Before I go further, I should define my ysis, we identified 260 of these unknown be working in a ‘‘post-genomic’’ era? This terms. First, the term ‘‘function’’ means proteins. The second class comprises is more a media term than a scientific different things to different groups of proteins whose existence is unsuspected, term, but I believe that what is commonly researchers: to a classical geneticist, for that is, of course, until someone describes understood by it is that we now work in an example, it might mean turning a fly’s them. For example, by using mass era where there is claimed to be a antennae into legs; to a biochemist it spectroscopy Crispin Miller’s group found complete—or near complete—parts list might mean forming a complex with a 346 novel peptides and proteins that were for the cells of most major experimental group of other proteins known to be smaller than the minimum cut-off size organisms. Of course this belief is not involved in a particular process such as used for identifying protein-coding entirely true. We all know that the genome regulation of expression, and to a by the Project [5]. projects have not given us the complete structural chemist it might mean removing Shortly thereafter, the same group iden- sequence of all human (or any other an electron from one chemical bond and tified 39 previously unsuspected genes metazoan) DNA. For example, we still transferring it to another. As a cell encoding novel short proteins in Schizo- do not have a contiguous sequence of the biologist, I am usually content that I know saccharomyces pombe [6]). Alternative splic- highly repetitive regions of something about the function of my ing, where a single pre-mRNA can yield in and around centromeres and at a protein if I know where it is in the cell, several—or many—functional mature number of other loci. Of course, there what other proteins it interacts with, and mRNAs that can encode a range of may not be very many (or even any) important undiscovered protein-coding Citation: Earnshaw WC (2013) Deducing Protein Function by Forensic Integrative Cell Biology. PLoS Biol 11(12): genes hiding in these regions, but it is e1001742. doi:10.1371/journal.pbio.1001742 likely that there are still a substantial Published December 17, 2013 number of unknown proteins encoded by Copyright: ß 2013 William C. Earnshaw. This is an open-access article distributed under the terms of the the human genome. Furthermore, as I will Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any discuss below, there are also likely to be medium, provided the original author and source are credited. quite a few proteins whose functions we Funding: WCE is a Principal Research Fellow of The Wellcome Trust (grant number 073915). The funders had think we know, but that have important or no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: I collaborate with Prof. Juri Rappsilber and Dr. Shinya Ohta, who are thanked in the acknowledgments section, but have no competing interests with any person or entity mentioned in the text. Essays articulate a specific perspective on a topic of broad interest to scientists. Abbreviations: APC/C, anaphase promoting complex/cyclosome. * E-mail: [email protected]

PLOS Biology | www.plosbiology.org 1 December 2013 | Volume 11 | Issue 12 | e1001742 related proteins, some of which may have figure this out other than by chance Borealin/Dasra, an essential member of the very different functions, provides another observation or coincidence? chromosomal passenger complex [18,19]. very rich source of previously unknown The development of multi-classifier proteins. The Power of Making Lists combinatorial proteomics for the analysis RNA-seq and proteomics studies are By far the simplest way to predict the of the mitotic chromosome proteome [4] is just now beginning to reveal the true function of an unknown protein is to show an example of applying this type of complexity of proteins that can be gener- that it has significant sequence relatedness approach to other sorts of datasets. In this ated from the ,20,000 protein-coding to another protein whose function is case, we combined data (arrayed as genes in humans. Beyond the discovery known. This is all well and good unless classifiers) from five different types of of unknown proteins, there is a world of your favourite protein is a ‘‘pioneer’’ (i.e., proteomics experiments, all based on the largely unexplored functional diversity protein of unknown function whose amino stable incorporation of labelled amino arising from post-translational modifica- acid sequence is unrelated to any protein acids in culture (SILAC) approach [20], tions of proteins. Given the example that of known function), in which case you are to look at the association of several simple phosphorylation can change the on your own without a tried and tested thousand proteins with isolated mitotic nuclear lamins from members of a highly recipe for how to proceed. I suggest that in chromosomes. These classifiers included insoluble structural framework underpin- the future the emerging art of predicting the estimated number of copies of each ning the nuclear envelope to soluble the function of unknown proteins (and protein in mitotic chromosomes; the ratio proteins in the mitotic cytoplasm [7,8] predicting novel functions for known of the amount of each protein in isolated and the vast numbers of modifications on proteins) will be the domain of ‘‘forensic chromosomes versus the amount in the proteins such as the histones [9], the scope integrative cell biologists’’ (using ‘‘foren- cytosol after removal of the chromosomes; for functional complexity induced by post- sic’’ as defined in Wikipedia as the the extent to which proteins present in translational modifications is truly vast and investigation of ‘‘situations after the fact, cytosol bound to isolated chromosomes will keep cell biologists busy for many and to establish what occurred based on that had been incubated in crude cytosol; years to come. collected evidence’’). These researchers the extent to which the abundance of Exploring the unknown is always excit- will develop methods for integrating a particular proteins on chromosomes was ing and challenging, but occasionally wide range of approaches to look at affected by loss of the condensin complex exploring the ‘‘known’’ can yield equally protein function, breaking each approach from chromosomes; and the extent to exciting dividends. In some cases it is down to its component ‘‘classifiers’’: lists of which the abundance of particular pro- straightforward to predict proteins that all the components of a system (or teins on chromosomes was affected by loss will have multiple functions—an obvious subsystem, as defined by [3]) identified of the SKA complex from chromosomes example lies in the correctly, but perhaps by a particular experiment, each of which [4]. We constructed classifiers in which the naively named histone acetyl transferases is attributed a score based on a predefined proteins from these different experiments and histone deacetylases and their ilk, criterion and then ranked in order of that were ranked according to their SILAC which likely add and remove post-transla- score. The key is then to figure out how to ratios (e.g., whether there was more or less tional modifications to very large numbers combine and compare these classifiers to of a given protein associated with chro- of cellular proteins in addition to histones. identify patterns that can be used to mosomes in a particular experiment) and Other examples of proteins that ‘‘moon- deduce functional relationships amongst subjected those classifiers to cluster anal- light’’ with a second function are not so groups of known proteins and use those ysis, which has been widely used to analyse easy to guess in advance. One example is patterns to characterise the behaviour of patterns in microarray experiments [21] cytochrome c, whose place in the world as specific uncharacterised proteins. and has since been put to many other uses a member of the electron transport chain This process of deducing protein func- [4,22,23]. In that cluster analysis, we was comfortably established and thought tion through ‘‘guilt by association’’ has consistently observed that members of to be fully understood when I first studied worked spectacularly well in identifying functionally related protein complexes undergraduate biochemistry in the novel cell cycle genes. One such analysis tend to cluster together. In particular, a 1970s—that was until Xiaodong Wang involved large numbers of cDNA libraries protein that was uncharacterized at that and his team discovered that it also has an from a wide range of different cell types and time, called C10orf104, clustered amongst essential role in assembly of the apopto- under different growth conditions, charac- the subunits of the anaphase promoting some during apoptotic cell death [10]. terising the distribution of observed cDNAs complex (APC/C), leading us to speculate Cytochrome c is not alone in moonlight- for known cell cycle genes across those that this protein was a component of that ing. Many of the lens crystallins, for libraries, and then looking for cDNAs complex (Figure 1). This speculation was example, are related to known enzymes encoding proteins of unknown function subsequently confirmed when C10orf104 that function in metabolic pathways and that exhibited a similar distribution across was identified by others as a novel stress responses [11]. How many other the many datasets [12]. This analysis component of the APC/C–APC16 [24– proteins, whose functions have been de- identified eight proteins likely to function 26]. termined with much less precision than in the cell cycle. Among them, CDCA1 was Our cluster analysis initially used proteo- that of cytochrome c, actually have other later identified as Nuf2, a member of the mics data based solely on protein abun- essential functions in cellular processes? At NDC80 complex responsible for attaching dance in mitotic chromosomes. In future present we have no way of telling; microtubules to kinetochores [13,14]; functional analyses, however, there need be however, I suspect and predict that the CDCA2 was later identified as Repo- no requirement to limit the inputs to one number is not small. It is not at all Man, an important targeting subunit of type of data. Any dataset where proteins unreasonable to imagine that one reason protein phosphatase 1 [15,16]; CDCA5, can be assigned a numerical score based on why humans have fewer genes than one was later identified as sororin, an important some sort of systematic measurement can might guess is that quite a few proteins do regulator of sister chromatid cohesion be used to construct a classifier. For more than one job. How will we ever [17]; and CDCA8 was later identified as example, bioinformatic analysis of the

PLOS Biology | www.plosbiology.org 2 December 2013 | Volume 11 | Issue 12 | e1001742 Figure 1. Use of a multi-classifier approach to predict the function of a novel protein. C10orf104 was a novel protein associated with mitotic chromosomes whose amino acid sequence offered no clues to its functions. When its association with isolated mitotic chromosomes was investigated using five different types of proteomic experiments, the protein was shown to cluster with APC/C components [4]. This led us to predict that C10orf104 might be a novel APC/C component—a prediction that was confirmed by studies carried out independently in three other groups [24–26]. The protein is now known as APC16. doi:10.1371/journal.pbio.1001742.g001 amino acid sequence can be used to devise a information beyond the obvious (and some- [25]). This is by no means a comprehensive classifier to score large numbers of proteins. times not so obvious) comparisons of list. Any way of analysing a large cohort of In our study of mitotic chromosomes, we sequences between different proteins. proteins that gives rise to a quantitative scanned the sequences of all proteins Beyond analysis of sequence motifs, comparative ranking can provide the basis recognised in mitotic chromosomes for other types of data that could be combined for a novel classifier (Figure 2). documented amino acid sequence motifs. to make classifiers include: comparative Where significant conceptual break- We then scored those motifs for their evolutionary (often called genome context) throughs will be required in the future will distribution between known cytoplasmic data (including homologous and ortholo- be in deciding how to combine these proteins and known chromosomal proteins. gous proteins and synteny relationships— classifiers to best elucidate functional rela- This enabled us to rank each motif see [1,2]); phenotypes in genome-wide tionships between proteins. Our original depending on how associated it was with RNAi screens (for two examples, see efforts simply hijacked an algorithm for chromosomal proteins versus cytoplasmic [23,27]); expression patterns in different cluster analysis that was originally devel- proteins. This ranked list for all of the cells and tissues or in response to various oped for comparing microarray data. We proteins in the dataset was then used as a stimuli or in different phases of the cell cycle subsequently exploited the power of ma- classifier, scoring proteins on how ‘‘chro- (for one of many possible examples, see chine learning including ‘‘random forest’’ mosome-like’’ their sequence motifs were [28]); localisation within cells and protein analysis to look for wider relationships [4]. This sort of analysis potentially interaction data, including two-hybrid between groups of proteins [4]. (Random expands the utility of amino acid sequence screens and pulldown experiments (see forest analysis is a sort of machine learning

PLOS Biology | www.plosbiology.org 3 December 2013 | Volume 11 | Issue 12 | e1001742 Figure 2. A multi-classifier approach to determining protein function. In this emerging approach to determine function, a variety of very different experimental approaches are used to make lists, called classifiers, in which proteins are given numerical scores according to the parameters being examined and ranked in numerical order. These ranked lists may then be combined by using clustering analysis, as in Figure 1, or analysed by other sorts of machine learning algorithms to look for common patterns that provide clues to functional relationships. doi:10.1371/journal.pbio.1001742.g002 based on the construction and use of addresses the amazingly wasteful approach All of this is fine, but what comes next? random decision trees to examine relation- that the scientific community currently The authors may, if they are both talented ships between data. For a helpful descrip- takes to high-throughput analyses. A typical and lucky, publish their paper in a high- tion, see http://en.wikipedia.org/wiki/ high-throughput analysis—let’s take the impact journal, thereby accruing valuable Random_forest.) In the future, novel example of a proteomics analysis—will ‘‘job points’’ for the first and senior algorithms and approaches should yield conduct a large number of experiments authors. The massive dataset is available more powerful ways to reduce these generating data on literally thousands of on line, of course, typically in a huge table multidimensional datasets to specific and proteins. To publish this analysis, the that can be accessed by the well informed, testable predictions of function—a kind researchers typically focus on some specific if they have sufficient motivation. But I of multidimensional guilt-by-association readout from the experiment—for exam- will bet (and this is definitely a statement analysis. ple, identification of one or several novel made with no data to back it up), that proteins whose localisation or function is of those huge tables are accessed very seldom Excavating Abandoned Datasets interest. They then write a paper describing and in many instances not at all: not by If all this explains why I believe the the screen, after going on to ‘‘work up’’ only other competing labs and probably not by future will belong to integrative cell one or a few of their novel hits, the latter the original lab either. One reason for this biologists, why did I say ‘‘forensic’’ integra- providing the functional detail demanded may be the perception that it is quite tive cell biologists? This additional term by most top journals. unlikely that a second or third high-impact

PLOS Biology | www.plosbiology.org 4 December 2013 | Volume 11 | Issue 12 | e1001742 paper is going to come from the original org/), DAVID (http://david.abcc.ncifcrf. The scientists of the future who will dataset. After all, if that screen has been gov/), String (http://string-db.org/), and validate the predictions of protein function already been described and published, Stitch (http://stitch.embl.de/); but al- generated by multidimensional multi-clas- what novelty can remain in there? though these are extremely valuable and sifier analyses will be biochemists and cell It seems to me that one consequence of constantly being improved, they are far biologists who are trained to sink their the recent fashion for high-throughput from comprehensive. teeth into a protein and not let go until analyses by proteomics, transcriptomics, I believe that this problem needs to be they have demonstrated its function in RNAi, localisation—you name it—is the developed as a priority by funding agencies vitro, in cells, and in vivo. It does no good to production of large numbers of orphan who are currently investing large sums in predict function if we cannot test our datasets that are accessible, but sit aban- open-access publishing but not (as far as I guess. Thus, biochemistry, which is re- doned, poorly curated and ignored on know) into constructing a systematic frame- garded in some quarters as old-fashioned, servers around the scientific world. I work in which all high-throughput data can must remain strong in coming years as a maintain that these datasets constitute a be made widely available in useful form. If critical area of endeavour that will confirm truly vast untapped resource that is waiting it is wasteful for grant-holders to pay to some predictions made by the integrative to be plundered by forensic integrative cell access articles whose contents were funded cell biologists while refuting others. In my biologists to generate huge numbers of by the same granting agencies, then it is opinion, biochemistry is not a hobby, like powerful classifiers that can add to our surely equally—if not more—wasteful for cloning, that one does on the side. To knowledge of functional relationships be- those agencies to fund expensive studies express, purify, and analyse proteins and tween proteins. I believe that we need to whose major data output is effectively protein complexes is an art that requires move from the present situation of single- mothballed after a single use. dedication, training, and practice. The use disposable datasets to a future of multi- training is rigorous and not for the timid, use data. ‘‘Same data, different question’’ but the outcome is well worth the effort as should be a watchword of future forensic My Kingdom for a Biochemist mechanisms predicted by classifier analy- integrative cell biologists. Admirable efforts The most brilliant integrative computa- ses of ‘‘omics’’ and high-throughput data- have been made in this direction for some tional analysis can produce a potentially sets are tested and the workings of cellular model organisms, notably, Drosophila (Fly- paradigm-altering hypothesis (otherwise machines elucidated. Base, http://flybase.org/), budding yeast known as a guess), but what it cannot do (Saccharomyces Genome Database, http:// is design, execute, and interpret the Acknowledgments www.yeastgenome.org/), Caenorhabditis ele- experiments that are going to prove or gans (Wormbase, http://www.wormbase. disprove that hypothesis. Despite all the I thank Juri Rappsilber, Shinya Ohta, and org/#01-23-6), and zebrafish (ZFIN: The wonderful high-tech, massively parallel Jimi-Carlo Bukowski-Wills for numerous stim- ulating discussions that led to our joint Zebrafish Model Organism Database, analyses that are presently employing the development of the multi-classifier approach http://zfin.org/). There are also servers robots of the biological world, there will to studying protein function. The experiments that attempt to link multiple datasets, always be a place for that humble and analysis shown in Figure 1 were performed including GO (http://www.geneontology. workhorse, the biochemist. by Shinya Ohta.

References 1. Roberts RJ (2004) Identifying protein function—a 9. Bannister AJ, Kouzarides T (2011) Regulation of complex, is required for sister chromatid cohesion call for community action. PLoS Biol 2: e42. chromatin by histone modifications. Cell Res 21: in vertebrates. Mol Cell 18: 185–200. doi:10.1371/journal.pbio.0020042 381–395. 18. Gassmann R, Carvalho A, Henzing AJ, Ruchaud 2. Osterman A, Overbeek R (2003) Missing genes 10. Liu X, Kim CN, Yang J, Jemmerson R, Wang X S, Hudson DF, et al. (2004) Borealin: A novel in metabolic pathways: a comparative genomics (1996) Induction of apoptotic program in cell-free chromosomal passenger required for stability of approach. Curr Opin Chem Biol 7: 238– extracts: requirement for dATP and cytochrome the bipolar mitotic spindle. J Cell Biol 166: 179– 251. C. Cell 86: 147–157. 191. 3. Overbeek R, Begley T, Butler RM, Choudhuri 11. Piatigorsky J, Wistow GJ (1989) Enzyme/crystal- 19. Sampath SC, Ohi R, Leismann O, Salic A, JV, Chuang HY, et al. (2005) The subsystems lins: gene sharing as an evolutionary strategy. Cell Pozniakovski A, et al. (2004) The chromosomal approach to genome annotation and its use in the 57: 197–199. passenger complex is required for chromatin- project to annotate 1000 genomes. Nucleic Acids 12. Walker MG (2001) Drug target discovery by gene induced microtubule stabilization and spindle Res 33: 5691–5702. expression analysis: cell cycle genes. Curr Cancer assembly. Cell 118: 187–202. 4. Ohta S, Bukowski-Wills JC, Sanchez-Pulido L, Drug Targets 1: 73–83. 20. Ong SE, Blagoev B, Kratchmarova I, Kristensen Alves Fde L, Wood L, et al. (2010) The protein 13. Wigge PA, Kilmartin JV (2001) The Ndc80p DB, Steen H, et al. (2002) Stable isotope composition of mitotic chromosomes determined complex from Saccharomyces cerevisiae contains labeling by amino acids in cell culture, SILAC, using multiclassifier combinatorial proteomics. conserved centromere components and has a as a simple and accurate approach to Cell 142: 810–821. function in chromosome segregation. J Cell Biol expression proteomics. Mol Cell Proteomics 1: 5. Bitton DA, Smith DL, Connolly Y, Scutt PJ, 152. 376–386. Miller CJ (2010) An integrated mass-spectrometry 14. McCleland ML, Gardner RD, Kallio MJ, Daum 21. Eisen MB, Spellman PT, Brown PO, Botstein D pipeline identifies novel protein coding-regions in JR, Gorbsky GJ, et al. (2003) The highly (1998) Cluster analysis and display of genome- the human genome. PLoS One 5: e8949. conserved Ndc80 complex is required for kinet- wide expression patterns. Proc Natl Acad Sci U S A doi:10.1371/journal.pone.0008949 ochore assembly, chromosome congression, and 95: 14863–14868. 6. Bitton DA, Wood V, Scutt PJ, Grallert A, Yates spindle checkpoint activity. Genes Dev 17: 101– 22. Theis M, Slabicki M, Junqueira M, Paszkowski- T, et al. (2011) Augmented annotation of the 114. Rogacz M, Sontheimer J, et al. (2009) Compar- Schizosaccharomyces pombe genome reveals 15. Trinkle-Mulcahy L, Andersen J, Lam YW, ative profiling identifies C13orf3 as a component additional genes required for growth and viability. Moorhead G, Mann M, et al. (2006) Repo-Man of the Ska complex required for mammalian cell Genetics 187: 1207–1217. recruits PP1gamma to chromatin and is essential division. Embo J 28: 1453–1465. 7. Gerace L, Blobel G (1980) The nuclear envelope for cell viability. J Cell Biol 172: 679–692. 23. Neumann B, Walter T, Heriche JK, Bulkescher J, lamina is reversibly depolymerized during mitosis. 16. Vagnarelli P, Earnshaw WC (2012) Repo-Man- Erfle H, et al. (2010) Phenotypic profiling of Cell 19: 277–287. PP1: a link between chromatin remodelling and the human genome by time-lapse microscopy 8. Heald R, McKeon F (1990) Mutations of nuclear envelope reassembly. Nucleus 3: 138– reveals cell division genes. Nature 464: 721– phosphorylation sites in lamin A that prevent 142. 727. nuclear lamina disassembly in mitotis. Cell 61: 17. Rankin S, Ayad NG, Kirschner MW (2005) 24. Kops GJ, van der Voet M, Manak MS, van Osch 579–589. Sororin, a substrate of the anaphase-promoting MH,NainiSM,etal.(2010)APC16isa

PLOS Biology | www.plosbiology.org 5 December 2013 | Volume 11 | Issue 12 | e1001742 conserved subunit of the anaphase-promoting 26. Hubner NC, Bird AW, Cox J, Splettstoesser B, profiling of cell division in human tissue culture complex/cyclosome. J Cell Sci 123: 1623–1633. Bandilla P, et al. (2010) Quantitative proteomics cells. Nat Cell Biol 9: 1401–1412. 25. Hutchins JR, Toyoda Y, Hegemann B, Poser I, combined with BAC TransgeneOmics reveals in 28. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore Heriche JK, et al. (2010) Systematic analysis of vivo protein interactions. J Cell Biol 189: 739–754. T, et al. (1999) The transcriptional program in human protein complexes identifies chromosome 27. Kittler R, Pelletier L, Heninger AK, Slabicki M, the response of human fibroblasts to serum. segregation proteins. Science 328: 593–599. Theis M, et al. (2007) Genome-scale RNAi Science 283: 83–87.

PLOS Biology | www.plosbiology.org 6 December 2013 | Volume 11 | Issue 12 | e1001742