"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

Total Page:16

File Type:pdf, Size:1020Kb

An Overview of Gene Identification: UNIT 4.1 Approaches, Strategies, and Considerations Modern biology has officially ushered in a new era with the completion of the sequencing of the human genome in April 2003. While often erroneously called the “post-genome” era, this milestone truly marks the beginning of the “genome era,” a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. While complete human sequence data is now available at an overall accuracy of 99.99%, the mere availability of all of these As, Cs, Ts, and Gs still does not answer some of the basic questions regarding the human genome—how many genes actually comprise the genome, how many of these genes code for multiple gene products, and where those genes actually lie along the complement of human chromosomes. Current estimates, based on preliminary analyses of the draft sequence, place the number of human genes at ∼30,000 (International Human Genome Sequencing Consortium, 2001). This number is in stark contrast to previously-suggested estimates, which had ranged as high as 140,000. A number that is in the 30,000 range brings into question the one-gene, one-protein hypothesis, underscoring the importance of processes such as alternative splicing in the generation of multiple gene products from a single gene. Finding all of the genes and the positions of those genes within the human genome sequence—and in other model organism genome sequences as well—requires the devel- opment and application of robust computational methods, some of which are listed in Table 4.1.1. These methods provide the first, best guess not only of the number and position of all genes, but of the structure of each individual gene as well. These predictions, brought under the banner of “sequence-based annotation,” help to increase the intrinsic value of genome sequence data found within the public databases. REMEMBERING BIOLOGY IN DEDUCING GENE STRUCTURE In considering the problem of gene identification, it is important to briefly review the basic biology underlying what will become, in essence, a mathematical problem (Fig. 4.1.1). At the DNA level, upstream of a given eukaryotic gene there are promoters and other regulatory elements that control the transcription of that gene. The gene itself is discontinuous, being comprised of both introns and exons. Once this stretch of DNA is transcribed into an RNA molecule, both ends of the RNA are modified, with the 5′ end being capped and a poly(A) signal being placed at the 3′ end. The RNA molecule reaches maturity when the introns are spliced out, based on short consensus sequences found both at the intron-exon boundaries and within the introns themselves. Once splicing has occurred and the start and stop codons have been established, the mature mRNA is transported through a nuclear pore into the cytoplasm, at which point translation can take place. While the process of moving from DNA to protein is obviously more complex in eukaryotes than in prokaryotes, the mere fact that it can be described in its entirety in eukaryotes would lead one to believe that predictions can confidently be made as to the exact positions of introns and exons. Unfortunately, the signals that control the process of moving from the DNA level to the protein level are not very well defined, precluding their use as foolproof indicators of gene structure. For example, upwards of 70% of promoter regions contain a TATA box, but because the remainder do not, the presence (or absence) of a TATA box in and of itself cannot be used to assess whether a region is a Finding Genes Contributed by Andreas D. Baxevanis 4.1.1 Current Protocols in Bioinformatics (2004) 4.1.1-4.1.9 Copyright © 2004 by John Wiley & Sons, Inc. Supplement 6 Table 4.1.1 Web Sites for Common Gene Finding Programs Web site URL Banbury Cross http://igs-server.cnrs-mrs.fr/igs/banbury The Encyclopedia of DNA elements (ENCODE) http://www.genome.gov/encode FGENES http://genomic.sanger.ac.uk/gf/gf.shtml FirstEF http://rulai.cshl.org/tools/FirstEF geneid (UNIT 4.3) http://www1.imim.es/geneid.html GeneMachine http://research.nhgri.nih.gov/genemachine/ GeneMark (UNITS 4.5 & 4.6) http://opal.biology.gatech.edu/GeneMark/ GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html GENSCAN http://genes.mit.edu/GENSCAN.html Genotator http://www.fruitfly.org/~nomi/genotator/ GlimmerM (UNIT 4.4) http://www.tigr.org/software/glimmerm/ GRAIL (UNIT 4.9) http://compbio.ornl.gov/tools/index.shtml GRAIL-EXP (UNIT 4.9) http://compbio.ornl.gov/grailexp/ HMMgene http://www.cbs.dtu.dk/services/HMMgene/ MZEF (UNIT 4.2) http://www.cshl.org/genefinder PROCRUSTES http://www-hto.usc.edu/software/procrustes RepeatMasker (UNIT 4.10) http://ftp.genome.washington.edu/RM/RepeatMasker.html Sputnik http://rast.abajian.com/sputnik/http://genes.cse.wustl.edu promoter exon 1 intron 1 exon 2 intron 2 exon 3 intron 3 exon 4 DNA transcription RNA end modification cap 5' polyA GU AG GU AG GU AG splicing start stop mature cap polyA mRNA nucleus cytoplasm translation polyA Figure 4.1.1 The central dogma of molecular biology. Proceeding from the DNA through the RNA to the protein level, various sequence features and modifications can be identified that can be used An Overview of in the computational deduction of gene structure. These include the presence of promoter and Gene regulatory regions, intron-exon boundaries, and both start and stop signals. Unfortunately, these Identification signals are not always present, and when present may not always be in the same form or context. The reader is referred to the text for greater detail. 4.1.2 Supplement 6 Current Protocols in Bioinformatics promoter. Similarly, during end modification, the poly(A) tail may be present or absent, or may not contain the canonical AATAAA. Adding to these complications is the fact that an open reading frame is required but is not sufficient to identify a region as an exon. Given these and other considerations, there is at present no straightforward method that will allow 100% confidence in the prediction of introns or exons. It is true, however, that the availability of finished genome sequences from a variety of organisms has led to a better understanding of gene structure and is, in turn, leading to the development of better methods for identifying genes in what is, essentially, “anonymous” DNA. CATEGORIZING THE METHODS Briefly, gene-finding strategies can be grouped into three major categories. Content-based methods rely on the overall, bulk properties of a sequence in making their determinations. Characteristics considered here include the frequency at which particular codons are used, the periodicity of repeats, and the compositional complexity of the sequence. Because different organisms use synonymous codons with different frequency, such clues can provide insight to help determine which regions are more likely to be exons. In site-based methods, by contrast, the focus is on the presence or absence of a specific sequence, pattern, or consensus. These methods are used to detect features such as donor and acceptor splice sites, binding sites for transcription factors, poly(A) tracts, and start and stop codons. Finally, comparative methods make determinations based on sequence homology. Here, translated sequences are subjected to database searches against protein sequences (e.g., BLASTX; UNIT 3.4) in order to see whether a previously characterized coding region corresponds to a region in the query sequence. While this is conceptually the most straightforward of the three approaches, it is restrictive in that most newly discovered genes do not have gene products that match anything in the protein databases. Also, the modular nature of proteins and the fact that there are only a limited number of protein motifs (Chothia and Lesk, 1986) makes it difficult to predict anything more than just exonic regions in this way. The reader is referred to a number of excellent reviews detailing the theoretical underpinnings of these various classes of methods (Claverie, 1997a,b, 1998; Guigó, 1997; Snyder and Stormo, 1997; Stormo, 2000; Rogic et al., 2001). While many of the gene prediction methods belong strictly to one of these three classes of methods, most of those that will be discussed in this chapter combine the strength of different classes of methods in order to optimize their predictions. HOW WELL DO THE METHODS WORK? Given the complexity of the problem at hand and the range of approaches for tackling it, it is important for investigators to appreciate when and how each particular method should be applied. A recurring theme in this chapter will be the fact that each method will perform differently depending on the nature of the data. Put another way, while one method may be best for human finished sequence, another may be better for sequences (whether they be finished or unfinished) from another organism. The reader will also notice that the various methods in this chapter produce different types of results—in some cases, lists of putative exons are returned but these exons are not in a genomic context; in other cases, complete gene structures are predicted, but possibly at a cost of less reliable individual exon predictions. Returning to the cautionary note that different methods will perform better or worse, depending on the system being examined, it becomes important to be able to quantify the performance of each of these algorithms. Several studies have systematically examined the rigor of these methods using a variety of test data sets (Burset and Guigó, 1996; Claverie, 1997a; Snyder and Stormo, 1997; Stormo, 2000; Rogic et al., 2001).
Recommended publications
  • The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet Is Available from the National Academies Press, 500 Fifth Street, NW, Washington, D.C
    THE NATIONALA REPORTIN BRIEF C The New Science of Metagenomics Revealing the Secrets of Our Microbial Planet ADEMIES Although we can’t see them, microbes are essential for every part of human life— indeed all life on Earth. The emerging field of metagenomics provides a new way of viewing the microbial world that will not only transform modern microbiology, but also may revolu- tionize understanding of the entire living world. very part of the biosphere is impacted Eby the seemingly endless ability of microorganisms to transform the world around them. It is microorganisms, or microbes, that convert the key elements of life—carbon, nitrogen, oxygen, and sulfur—into forms accessible to other living things. They also make necessary nutrients, minerals, and vitamins available to plants and animals. The billions of microbes living in the human gut help humans digest food, break down toxins, and fight off disease-causing pathogens. Microbes also clean up pollutants in the environment, such as oil and Bacteria in human saliva. Trillions of chemical spills. All of these activities are carried bacteria make up the normal microbial com- out not by individual microbes but by complex munity found in and on the human body. microbial communities—intricate, balanced, and The new science of metagenomics can help integrated entities that have a remarkable ability to us understand the role of microbial commu- adapt swiftly to environmental change. nities in human health and the environment. Historically, microbiology has focused on (Image courtesy of Michael Abbey) single species in pure laboratory culture, and thus understanding of microbial communities has lagged behind understanding of their individual mem- bers.
    [Show full text]
  • Genomics and Its Impact on Science and Society: the Human Genome Project and Beyond
    DOE/SC-0083 Genomics and Its Impact on Science and Society The Human Genome Project and Beyond U.S. Department of Energy Genome Research Programs: genomics.energy.gov A Primer ells are the fundamental working units of every living system. All the instructions Cneeded to direct their activities are contained within the chemical DNA (deoxyribonucleic acid). DNA from all organisms is made up of the same chemical and physical components. The DNA sequence is the particular side-by-side arrangement of bases along the DNA strand (e.g., ATTCCGGA). This order spells out the exact instruc- tions required to create a particular organism with protein complex its own unique traits. The genome is an organism’s complete set of DNA. Genomes vary widely in size: The smallest known genome for a free-living organism (a bac- terium) contains about 600,000 DNA base pairs, while human and mouse genomes have some From Genes to Proteins 3 billion (see p. 3). Except for mature red blood cells, all human cells contain a complete genome. Although genes get a lot of attention, the proteins DNA in each human cell is packaged into 46 chro- perform most life functions and even comprise the mosomes arranged into 23 pairs. Each chromosome is majority of cellular structures. Proteins are large, complex a physically separate molecule of DNA that ranges in molecules made up of chains of small chemical com- length from about 50 million to 250 million base pairs. pounds called amino acids. Chemical properties that A few types of major chromosomal abnormalities, distinguish the 20 different amino acids cause the including missing or extra copies or gross breaks and protein chains to fold up into specific three-dimensional rejoinings (translocations), can be detected by micro- structures that define their particular functions in the cell.
    [Show full text]
  • Gene Prediction and Genome Annotation
    A Crash Course in Gene and Genome Annotation Lieven Sterck, Bioinformatics & Systems Biology VIB-UGent [email protected] ProCoGen Dissemination Workshop, Riga, 5 nov 2013 “Conifer sequencing: basic concepts in conifer genomics” “This Project is financially supported by the European Commission under the 7th Framework Programme” Genome annotation: finding the biological relevant features on a raw genomic sequence (in a high throughput manner) ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Thx to: BSB - annotation team • Lieven Sterck (Ectocarpus, higher plants, conifers, … ) • Yao-cheng Lin (Fungi, conifers, …) • Stephane Rombauts (green alga, mites, …) • Bram Verhelst (green algae) • Pierre Rouzé • Yves Van de Peer ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Annotation experience • Plant genomes : A.thaliana & relatives (e.g. A.lyrata), Poplar, Physcomitrella patens, Medicago, Tomato, Vitis, Apple, Eucalyptus, Zostera, Spruce, Oak, Orchids … • Fungal genomes: Laccaria bicolor, Melampsora laricis- populina, Heterobasidion, other basidiomycetes, Glomus intraradices, Pichia pastoris, Geotrichum Candidum, Candida ... • Algal genomes: Ostreococcus spp, Micromonas, Bathycoccus, Phaeodactylum (and other diatoms), E.hux, Ectocarpus, Amoebophrya … • Animal genomes: Tetranychus urticae, Brevipalpus spp (mites), ... ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Why genome annotation? • Raw sequence data is not useful for most biologists • To be meaningful to them it has to be converted into biological significant knowledge
    [Show full text]
  • The Human Genome Project Focus of the Human Genome Project
    TOOLS OF GENETIC RESEARCH THE HUMAN GENOME PROJECT FOCUS OF THE HUMAN GENOME PROJECT The primary work of the Human Genome Project has been to Francis S. Collins, M.D., Ph.D.; produce three main research tools that will allow investigators to and Leslie Fink identify genes involved in normal biology as well as in both rare and common diseases. These tools are known as positional cloning The Human Genome Project is an ambitious research effort aimed at deciphering the chemical makeup of the entire human (Collins 1992). These advanced techniques enable researchers to ge ne tic cod e (i.e. , the g enome). The primary wor k of the search for disease­linked genes directly in the genome without first having to identify the gene’s protein product or function. (See p ro j e c t i s t o d ev e lop t h r e e r e s e a r c h tool s t h a t w i ll a ll o w the article by Goate, pp. 217–220.) Since 1986, when researchers scientists to identify genes involved in both rare and common 2 diseases. Another project priority is to examine the ethical, first found the gene for chronic granulomatous disease through legal, and social implications of new genetic technologies and positional cloning, this technique has led to the isolation of consid­ to educate the public about these issues. Although it has been erably more than 40 disease­linked genes and will allow the identi­ in existence for less than 6 years, the Human Genome Project fication of many more genes in the future (table 1).
    [Show full text]
  • Gene Structure Prediction
    Gene finding Lorenzo Cerutti Swiss Institute of Bioinformatics EMBNet course, September 2002 Gene finding EMBNet 2002 Introduction Gene finding is about detecting coding regions and infer gene structure Gene finding is difficult DNA sequence signals have low information content (degenerated and highly unspe- • cific) It is difficult to discriminate real signals • Sequencing errors • Prokaryotes High gene density and simple gene structure • Short genes have little information • Overlapping genes • Eukaryotes Low gene density and complex gene structure • Alternative splicing • Pseudo-genes • 1 Gene finding EMBNet 2002 Gene finding strategies Homology method Gene structure can be deduced by homology • Requires a not too distant homologous sequence • Ab initio method Requires two types of information • . compositional information . signal information 2 Gene finding EMBNet 2002 Gene finding: Homology method 3 Gene finding EMBNet 2002 Homology method Principles of the homology method. Coding regions evolve slower than non-coding regions, i.e. local sequence similarity • can be used as a gene finder. Homologous sequences reflect a common evolutionary origin and possibly a common • gene structure, i.e. gene structure can be solved by homology (mRNAs, ESTs, proteins, domains). Standard homology search methods can be used (BLAST, Smith-Waterman, ...). • Include ”gene syntax” information (start/stop codons, ...). • Homology methods are also useful to confirm predictions inferred by other methods 4 Gene finding EMBNet 2002 Homology method: a simple view Gene of unknown structure Homology with a gene of known structure Exon 1 Exon 2 Exon 3 Find DNA signals ATG GT {TAA,TGA,TAG} AG 5 Gene finding EMBNet 2002 Procrustes Procrustes is a software to predict gene structure from homology found in pro- teins (Gelfand et al., 1996) Principle of the algorithm • .
    [Show full text]
  • From the Human Genome Project to Genomic Medicine a Journey to Advance Human Health
    From the Human Genome Project to Genomic Medicine A Journey to Advance Human Health Eric Green, M.D., Ph.D. Director, NHGRI The Origin of “Genomics”: 1987 Genomics (1987) “For the newly developing discipline of [genome] mapping/sequencing (including the analysis of the information), we have adopted the term GENOMICS… ‘The Genome Institute’ Office for Human Genome Research 1988-1989 National Center for Human Genome Research 1989-1997 National Human Genome Research Institute 1997-present NHGRI: Circa 1990-2003 Human Genome Project NHGRI Today: Characteristic Features . Relatively young (~28 years) . Relatively small (~1.7% of NIH) . Unusual historical origins (think ‘Human Genome Project’) . Emphasis on ‘Team Science’ (think managed ‘consortia’) . Rapidly disseminating footprint (think ‘genomics’) . Novel societal/bioethics research component (think ‘ELSI’) . Over-achievers for trans-NIH initiatives (think ‘Common Fund’) . Vibrant (and large) Intramural Research Program A Quarter Century of Genomics Human Genome Sequenced for First Time by the Human Genome Project Genomic Medicine An emerging medical discipline that involves using genomic information about an individual as part of their clinical care (e.g., for diagnostic or therapeutic decision- making) and the other implications of that clinical use The Path to Genomic Medicine ? Human Realization of Genome Genomic Project Medicine Nature Nature Base Pairs to Bedside 2003 Heli201x to 1Health A Quarter Century of Genomics Human Genome Sequenced for First Time by the Human Genome Project
    [Show full text]
  • A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods
    University of Massachusetts Medical School eScholarship@UMMS Open Access Articles Open Access Publications by UMMS Authors 2020-01-22 A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods Jill E. Moore University of Massachusetts Medical School Et al. Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/oapubs Part of the Bioinformatics Commons, Computational Biology Commons, Genetic Phenomena Commons, and the Genomics Commons Repository Citation Moore JE, Pratt HE, Purcaro MJ, Weng Z. (2020). A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods. Open Access Articles. https://doi.org/10.1186/ s13059-019-1924-8. Retrieved from https://escholarship.umassmed.edu/oapubs/4118 Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License. This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Open Access Articles by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. Moore et al. Genome Biology (2020) 21:17 https://doi.org/10.1186/s13059-019-1924-8 RESEARCH Open Access A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods Jill E. Moore, Henry E. Pratt, Michael J. Purcaro and Zhiping Weng* Abstract Background: Many genome-wide collections of candidate cis-regulatory elements (cCREs) have been defined using genomic and epigenomic data, but it remains a major challenge to connect these elements to their target genes. Results: To facilitate the development of computational methods for predicting target genes, we develop a Benchmark of candidate Enhancer-Gene Interactions (BENGI) by integrating the recently developed Registry of cCREs with experimentally derived genomic interactions.
    [Show full text]
  • Comparative Genome Evolution Working Group
    Comparative Genome Evolution Working Group COMPARATIVE GENOME EVOLUTION MODIFIED PROPOSAL Introduction Analysis of comparative genomic sequence information from a well-chosen set of organisms is, at present, one of the most effective approaches available to advance biomedical research. The following document describes rationales and plans for selecting targets for genome sequencing that will provide insight into a number of major biological questions that broadly underlie major areas of research funded by the National Institutes of Health, including studies of gene regulation, understanding animal development, and understanding gene and protein function. Genomic sequence data is a fundamental information resource that is required to address important questions about biology: What is the genomic basis for the advent of major morphogenetic or physiological innovations during evolution? How have genomes changed with the addition of new features observed in the eukaryotic lineage, for example the development of an adaptive immune system, an organized nervous system, bilateral symmetry, or multicellularity? What are the genomic correlates or bases for major evolutionary phenomena such as evolutionary rates; speciation; genome reorganization, and origins of variation? Another vital use to which genomic sequence data are applied is the development of more robust information about important non-human model systems used in biomedical research, i.e. how can we identify conserved functional regions in the existing genome sequences of important non- mammalian model systems, so that we can better understand fundamental aspects of, for example, gene regulation, replication, or interactions between genes? NHGRI established a working group to provide the Institute with well-considered scientific thought about the genomic sequences that would most effectively address these questions.
    [Show full text]
  • There Is a Lot of Research on Gene Prediction Methods
    LBNL #58992 Fungal Genomic Annotation Igor V. Grigoriev1, Diego A. Martinez2 and Asaf A. Salamov1 1US Department of Energy Joint Genome Institute, Walnut Creek, CA 94598 ([email protected], [email protected]); 2Los Alamos National Laboratory Joint Genome Institute, P.O. Box 1663 Los Alamos, NM 87545 ([email protected]). Sequencing technology in the last decade has advanced at an incredible pace. Currently there are hundreds of microbial genomes available with more still to come. Automated genome annotation aims to analyze this amount of sequence data in a high-throughput fashion and help researches to understand the biology of these organisms. Manual curation of automatically annotated genomes validates the predictions and set up 'gold' standards for improving the methodologies used. Here we review the methods and tools used for annotation of fungal genomes in different genome sequencing centers. 1. INTRODUCTION In recent years the power of DNA sequencing has dramatically increased, with dedicated centers running 24 hours a day 7 days a week able to produce as much as 2 gigabases of raw sequence or more a month. The researchers who work on a variety of fungi are fortunate, as most fungal genomes are under 50 megabases and produce high- quality draft assembly almost as easily as bacteria. This feature of fungal genomes is a key reason that the first sequenced eukaryotic genome was of the ascomycete Saccharomyces cerevisiae (Goffeau et al. 1996). As of the submission of this chapter, one can obtain draft sequences of more than 100 fungal genomes (Table 1) and the list is growing. While some are species of the same genus (e.g., Aspergillus has three members and more coming), there still remains a height of data that could confuse and bury a researcher for many years.
    [Show full text]
  • Gene Prediction Using Deep Learning
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Gene prediction using Deep Learning Pedro Vieira Lamares Martins Mestrado Integrado em Engenharia Informática e Computação Supervisor: Rui Camacho (FEUP) Second Supervisor: Nuno Fonseca (EBI-Cambridge, UK) July 22, 2018 Gene prediction using Deep Learning Pedro Vieira Lamares Martins Mestrado Integrado em Engenharia Informática e Computação Approved in oral examination by the committee: Chair: Doctor Jorge Alves da Silva External Examiner: Doctor Carlos Manuel Abreu Gomes Ferreira Supervisor: Doctor Rui Carlos Camacho de Sousa Ferreira da Silva July 22, 2018 Abstract Every living being has in their cells complex molecules called Deoxyribonucleic Acid (or DNA) which are responsible for all their biological features. This DNA molecule is condensed into larger structures called chromosomes, which all together compose the individual’s genome. Genes are size varying DNA sequences which contain a code that are often used to synthesize proteins. Proteins are very large molecules which have a multitude of purposes within the individual’s body. Only a very small portion of the DNA has gene sequences. There is no accurate number on the total number of genes that exist in the human genome, but current estimations place that number between 20000 and 25000. Ever since the entire human genome has been sequenced, there has been an effort to consistently try to identify the gene sequences. The number was initially thought to be much higher, but it has since been furthered down following improvements in gene finding techniques. Computational prediction of genes is among these improvements, and is nowadays an area of deep interest in bioinformatics as new tools focused on the theme are developed.
    [Show full text]
  • The Genome Project–Write Issues, Enabling Inclusive Decision-Making on the Topics Mentioned Above (7)
    Human chromosomes and nucleus as visualized by scanning electron microscopy. Through open and ongoing dialogue, com- mon goals can be identified. Informed consent must take local and regional val- ues into account and enable true decision- making on particularly sensitive use of cells and DNA from certain sources. Finally, the highest biosafety standards should guide project work and safety for lab workers and research participants, and ecosystems should pervade the design process. A prior- ity will be cost reduction of both genome engineering and testing tools to aid in equi- POLICY FORUM table distribution of benefits—e.g., enabling research on crop plants and infectious agents and vectors in developing nations. GENOME ENGINEERING To ensure responsible innovation and on- going consideration of ELSI, a percentage of all research funds could be dedicated to these The Genome Project–Write issues, enabling inclusive decision-making on the topics mentioned above (7). In addi- We need technology and an ethical framework tion, there should be equitable distribution of any early and future benefits in view of on July 29, 2016 for genome-scale engineering diverse and pressing needs in different global regions. The broad scope and novelty of the By Jef D. Boeke,* George Church,* large-scale synthesis and editing of genomes. project call for consideration of appropriate Andrew Hessel,* Nancy J. Kelley,* Adam To this end, we propose the Human Genome regulation alongside development of the sci- Arkin, Yizhi Cai, Rob Carlson, Aravinda Project–Write (HGP-write), named to honor ence and societal debates. National and in- Chakravarti, Virginia W. Cornish, Liam HGP-read but embracing synthesis of all ternational laws and regulations differ, and Holt, Farren J.
    [Show full text]
  • Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration
    Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration by Max Kotlyar A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Department of Medical Biophysics University of Toronto Copyright °c 2011 by Max Kotlyar Abstract Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration Max Kotlyar Doctor of Philosophy Graduate Department of Department of Medical Biophysics University of Toronto 2011 The currently known network of human protein-protein interactions (PPIs) is pro- viding new insights into diseases and helping to identify potential therapies. However, according to several estimates, the known interaction network may represent only 10% of the entire interactome – indicating that more comprehensive knowledge of the inter- actome could have a major impact on understanding and treating diseases. The primary aim of this thesis was to develop computational methods to provide increased coverage of the interactome. A secondary aim was to gain a better understanding of the link between networks and phenotype, by analyzing essential mouse genes. Two algorithms were developed to predict PPIs and provide increased coverage of the interactome: F pClass and mixed co-expression. F pClass differs from previous PPI prediction methods in two key ways: it integrates both positive and negative evidence for protein interactions, and it identifies synergies between predictive features. Through these approaches F pClass provides interaction networks with significantly improved reli- ability and interactome coverage. Compared to previous predicted human PPI networks, FpClass provides a network with over 10 times more interactions, about 2 times more pro- teins and a lower false discovery rate.
    [Show full text]