Open Wychung-Dissertation.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Open Wychung-Dissertation.Pdf The Pennsylvania State University The Graduate School UNCOVERING HIDDEN GENOMIC FEATURES USING COMPUTATIONAL APPROACHES A Dissertation in Computer Science and Engineering by Wen-Yu Chung c 2009 Wen-Yu Chung Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2009 The dissertation of Wen-Yu Chung was reviewed and approved∗ by the following: Webb Miller Professor of Computer Science and Engineering and Biology Dissertation Co-Advisor, Chair of Committee Anton Nekrutenko Assistant Professor of Biochemistry and Molecular Biology Dissertation Co-Advisor Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering Padma Raghavan Professor of Computer Science and Engineering R´eka Albert Associate Professor of Physics ∗Signatures are on file in the Graduate School. Abstract Modern genetic studies are heavily dependent on analyses of whole genome sequences that have only become available in the past decade. Technologies such as microarrays and next-generation sequencing can associate quantitive expression patterns of genes to their genomic sequences and allow the study of changes at the genome-wide level or the comparison of multiple genomes. Sequences plus expression information allow us to cap- ture an extensive and realistic overview on any given genome. Novel mathematical and computational methods are essential for managing and mining information from these large-scale data sets. I have undertaken three projects that try to answer the following biological questions using computational approaches: (1) how do duplicate genes diverge in a co-expression network? (2) how many vertebrate genes are there with alternative open reading frames? (3) how can we delineate whole genome expression patterns using new sequencing technology? Within each project, I have developed computational meth- ods and applied these to targeted data sets demonstrating the feasibility and power of these new bioinformatic approaches and addressing questions of biological significance. iii Table of Contents List of Figures vi List of Tables ix Acknowledgments x Chapter 1 Introduction 1 1.1 Global interactions and constraints . 1 1.2 Dissertation outline . 3 Chapter 2 Rapid and asymmetric divergence of duplicate genes in the human gene coexpression network 4 2.1 Background . 4 2.2 Results and discussion . 6 2.2.1 Description of the network . 6 2.2.2 Differences between duplicate genes and singletons . 9 2.2.3 Duplicate genes rapidly lose shared coexpressed partners . 9 2.2.4 Acquisition of new coexpressed partners by duplicate genes . 12 2.2.5 Asymmetric expression divergence of duplicate genes . 13 2.2.6 Robustness of the network . 15 2.3 Conclusion . 17 2.4 Methods . 19 2.4.1 Network construction . 19 2.4.2 Identification of duplicate genes . 21 2.4.3 Permutation tests . 21 2.4.4 Asymmetry analysis . 21 2.4.5 Robustness analysis of the network . 22 iv Chapter 3 A first look at ARFome: dual-coding genes in mammalian genomes 23 3.1 Introduction . 23 3.2 Results and discussion . 25 3.2.1 Dual coding is virtually impossible by chance . 25 3.2.2 Defining mammalian ARFs . 26 3.2.3 Analysis of nucleotide substitutions suggests functionality of ARFs 26 3.2.4 What may be the potential function of ARF-encoded proteins? . 29 3.2.5 Conclusions . 30 3.3 Materials and methods . 32 3.3.1 CCRT algorithm . 32 3.3.2 Codon model for overlapping reading frames . 33 Chapter 4 Transcriptome profiling by next-generation sequencing technology 36 4.1 Background . 36 4.2 Results and discussion . 38 4.2.1 Sequencing result and quality . 38 4.2.2 Mapping reads to the mouse transcriptomes . 40 4.2.3 Identifying novel splice forms . 42 4.3 Conclusion . 44 Chapter 5 Conclusion 46 5.1 Summary . 46 5.2 Future research interests . 47 Appendix A Supplementary materials for Chapter 2 49 Appendix B Supplementary materials for Chapter 3 57 Bibliography 65 v List of Figures 2.1 Degree distribution of the studied network (T ≥ 7 and R ≥ 0.7). The degree distribution of the studied network . 7 2.2 The relationship between clustering coefficient c and node de- gree k for (A) all genes, (B) ubiquitously expressed genes, and (C) nonubiquitously expressed genes. Each point represents an av- erage value for 100 genes. 8 2.3 The number of duplicate genes and singletons in every 500 genes ranked by degree. Duplicate genes are marked by triangles and single- tons are marked by circles. The genes with the highest degree are shown at the left side of the figure. 10 2.4 The schematic representation of duplicate gene evolution (A) prior to duplication event, (B) immediately after duplication, (C, D, E) after some time following gene duplication. The an- cestral singleton gene is shown with a crossed line, duplicate genes are in black, shared ancestral partners are in grey, unique ancestral partners are in stripes, and unique acquired partners are in white; ns; n1 and n2 are the numbers of partners for a singleton, first duplicate, and second du- plicate, respectively; n12 is the number of shared partners for a duplicate gene pair. 11 2.5 The change in the fraction of shared partners with evolutionary time (measured by KS). Each point represents an average value for 40 duplicate gene pairs. Dashed line indicates the fraction of shared part- ners averaged among 1000 randomly selected pairs of singletons (random selection process was repeated 1000 times). 12 2.6 The change in the total number of coexpressed partners with evolutionary time (measured by KS). Each point represents an aver- age value for 40 duplicate gene pairs. The lower dashed line is the average number of partners for a singleton and the upper dashed line is twice the average number of partners for a singleton. 14 vi 2.7 Asymmetric divergence in gene expression. (A) Plot of degree of one gene versus degree of another gene for 1,547 duplicate gene pairs with KS < 2 (inset shows pairs with both degrees below 200). (B) The same plot after numerical simulation of symmetric divergence with equal probability of loss and gain of coexpressed partners (P = 0.5). (C) The relationship between the difference in degree and time since duplication (measured by KS) for a pair of duplicate genes. Each point represents an average value for 40 duplicate gene pairs. 15 2.8 The results of in silico perturbations of the network. The effect of random removal of genes (error) on (A) the relative size of a giant cluster and (B) the average shortest path length. The effect of degree-based removal of genes (attack) on (C) the relative size of a giant cluster and (D) the average shortest path length (inset shows the fraction of edges removed). Singletons are marked by circles, duplicate genes by triangles, and all genes by squares. 17 3.1 Three known examples of mammalian dual-coding genes. (A) A transcript of the Gnas1 gene contains two reading frames and produces two structurally unrelated proteins, XLαs and ALEX, by differential uti- lization of translation start sites.(B) A newly transcribed XBP1 mRNA can only produce protein XBP1U from ORF A. Removal of a 26-bp spacer (yellow rectangle) joins the beginning of ORF A with ORF B and trans- lates into a different product called XBP1S.(C) Ink4a generates two splice variants that use different reading frames within exon E2 to produce the proteins p16Ink4a and p19ARF. 24 3.2 mRNAs from human and mouse are aligned. Mouse mRNAs are indicated by lowercase letters. Each of the two mRNAs contains an anno- tated coding region (white boxes). Our algorithm looks for ARFs (black boxes) that are shifted one (shown) or two nucleotides relative to the an- notated frame. The locations of the ARFs must be conserved between the species. Specifically, the ARFs in the two species must overlap for at least 500 bp. 28 4.1 An example of the quality score distribution. The x-axis is the length of the read (fixed-length, such as Illumina/Solexa and SOLiD reads or percentage, such as Roche/454 Life Science reads) and the y-axis is the base-calling scores. The quality score distribution showed the base-calling scores dropped below 20 after read position 28. 39 vii 4.2 A hypothetical example of the strategy used to obtain novel exon junctions. (A) Gene A had four exons, E1;E2;E3 and E4. Dashed lines connect all possible respecting order of exon-exon combination. (B) Two transcripts, T1 and T2, were alternatively splicing variants from gene A. T1 had E1;E2 and E3. T2 had E1;E2 and E4. (C) For gene A, junctions between E1 and E3, E1 and E4, E3 and E4 were novel junctions, which were not from known transcripts. (D) 20 bp on either side of every possible junction were taken and attached to form junction sequences. For gene A, there were 6 possible junctions in total, in which 3 were known junctions and 3 were novel junctions. 41 4.3 Examples of paired-reads mapped in novel junctions. (A) Invalid mapping: one end of a paired-end read mapped at J13; the other end mapped at E2. It is not reasonable to have a splicing form of this kind. (B-D) All valid mappings. (B) Valid mapping: one end of a paired- end read mapped at a novel junction (J13); the other end mapped at an exon (E3). This indicates a novel splicing form of E1 and E3. Another possible situation is the other end mapped at an exon that is not part of the junction (e.g., E4). This indicates a novel splicing form of E1;E3 and E4.
Recommended publications
  • Gene Prediction: the End of the Beginning Comment Colin Semple
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
    [Show full text]
  • Predicting Clinical Response to Treatment with a Soluble Tnf-Antagonist Or Tnf, Or a Tnf Receptor Agonist
    (19) TZZ _ __T (11) EP 2 192 197 A1 (12) EUROPEAN PATENT APPLICATION (43) Date of publication: (51) Int Cl.: 02.06.2010 Bulletin 2010/22 C12Q 1/68 (2006.01) (21) Application number: 08170119.5 (22) Date of filing: 27.11.2008 (84) Designated Contracting States: (72) Inventor: The designation of the inventor has not AT BE BG CH CY CZ DE DK EE ES FI FR GB GR yet been filed HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR (74) Representative: Habets, Winand Designated Extension States: Life Science Patents AL BA MK RS PO Box 5096 6130 PB Sittard (NL) (71) Applicant: Vereniging voor Christelijk Hoger Onderwijs, Wetenschappelijk Onderzoek en Patiëntenzorg 1081 HV Amsterdam (NL) (54) Predicting clinical response to treatment with a soluble tnf-antagonist or tnf, or a tnf receptor agonist (57) The invention relates to methods for predicting a clinical response to a therapy with a soluble TNF antagonist, TNF or a TNF receptor agonist and a kit for use in said methods. EP 2 192 197 A1 Printed by Jouve, 75001 PARIS (FR) EP 2 192 197 A1 Description [0001] The invention relates to methods for predicting a clinical response to a treatment with a soluble TNF antagonist, with TNF or a TNF receptor agonist using expression levels of genes of the Type I INF pathway and a kit for use in said 5 methods. In another aspect, the invention relates to a method for evaluating a pharmacological effect of a treatment with a soluble TNF antagonist, TNF or a TNF receptor agonist.
    [Show full text]
  • Association of Gene Ontology Categories with Decay Rate for Hepg2 Experiments These Tables Show Details for All Gene Ontology Categories
    Supplementary Table 1: Association of Gene Ontology Categories with Decay Rate for HepG2 Experiments These tables show details for all Gene Ontology categories. Inferences for manual classification scheme shown at the bottom. Those categories used in Figure 1A are highlighted in bold. Standard Deviations are shown in parentheses. P-values less than 1E-20 are indicated with a "0". Rate r (hour^-1) Half-life < 2hr. Decay % GO Number Category Name Probe Sets Group Non-Group Distribution p-value In-Group Non-Group Representation p-value GO:0006350 transcription 1523 0.221 (0.009) 0.127 (0.002) FASTER 0 13.1 (0.4) 4.5 (0.1) OVER 0 GO:0006351 transcription, DNA-dependent 1498 0.220 (0.009) 0.127 (0.002) FASTER 0 13.0 (0.4) 4.5 (0.1) OVER 0 GO:0006355 regulation of transcription, DNA-dependent 1163 0.230 (0.011) 0.128 (0.002) FASTER 5.00E-21 14.2 (0.5) 4.6 (0.1) OVER 0 GO:0006366 transcription from Pol II promoter 845 0.225 (0.012) 0.130 (0.002) FASTER 1.88E-14 13.0 (0.5) 4.8 (0.1) OVER 0 GO:0006139 nucleobase, nucleoside, nucleotide and nucleic acid metabolism3004 0.173 (0.006) 0.127 (0.002) FASTER 1.28E-12 8.4 (0.2) 4.5 (0.1) OVER 0 GO:0006357 regulation of transcription from Pol II promoter 487 0.231 (0.016) 0.132 (0.002) FASTER 6.05E-10 13.5 (0.6) 4.9 (0.1) OVER 0 GO:0008283 cell proliferation 625 0.189 (0.014) 0.132 (0.002) FASTER 1.95E-05 10.1 (0.6) 5.0 (0.1) OVER 1.50E-20 GO:0006513 monoubiquitination 36 0.305 (0.049) 0.134 (0.002) FASTER 2.69E-04 25.4 (4.4) 5.1 (0.1) OVER 2.04E-06 GO:0007050 cell cycle arrest 57 0.311 (0.054) 0.133 (0.002)
    [Show full text]
  • Reconstructing Contiguous Regions of an Ancestral Genome
    Downloaded from www.genome.org on December 5, 2006 Reconstructing contiguous regions of an ancestral genome Jian Ma, Louxin Zhang, Bernard B. Suh, Brian J. Raney, Richard C. Burhans, W. James Kent, Mathieu Blanchette, David Haussler and Webb Miller Genome Res. 2006 16: 1557-1565; originally published online Sep 18, 2006; Access the most recent version at doi:10.1101/gr.5383506 Supplementary "Supplemental Research Data" data http://www.genome.org/cgi/content/full/gr.5383506/DC1 References This article cites 20 articles, 11 of which can be accessed free at: http://www.genome.org/cgi/content/full/16/12/1557#References Open Access Freely available online through the Genome Research Open Access option. Email alerting Receive free email alerts when new articles cite this article - sign up in the box at the service top right corner of the article or click here Notes To subscribe to Genome Research go to: http://www.genome.org/subscriptions/ © 2006 Cold Spring Harbor Laboratory Press Downloaded from www.genome.org on December 5, 2006 Methods Reconstructing contiguous regions of an ancestral genome Jian Ma,1,5,6 Louxin Zhang,2 Bernard B. Suh,3 Brian J. Raney,3 Richard C. Burhans,1 W. James Kent,3 Mathieu Blanchette,4 David Haussler,3 and Webb Miller1 1Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA; 2Department of Mathematics, National University of Singapore, Singapore 117543; 3Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA; 4School of Computer Science, McGill University, Montreal, Quebec H3A 2B4, Canada This article analyzes mammalian genome rearrangements at higher resolution than has been published to date.
    [Show full text]
  • MIRA-Assisted Microarray Analysis, a New Technology for The
    Research Article MIRA-Assisted Microarray Analysis, a New Technology for the Determination of DNA Methylation Patterns, Identifies Frequent Methylation of Homeodomain-Containing Genes in Lung Cancer Cells Tibor Rauch,1 Hongwei Li,1 Xiwei Wu,2 and Gerd P. Pfeifer1 Divisions of 1Biology and 2Biomedical Informatics, Beckman Research Institute of the City of Hope, Duarte, California Abstract hypermethylation generally leads to inactivation of gene expres- We present a straightforward and comprehensive approach sion, this epigenetic alteration is considered to be a key mechanism for DNA methylation analysis in mammalian genomes. The for long-term silencing of tumor suppressor genes. The importance methylated-CpG island recovery assay (MIRA), which is based of promoter methylation in functional inactivation of lung cancer on the high affinity of the MBD2/MBD3L1 complex for suppressor genes is becoming increasingly recognized. It is methylated DNA, has been used to detect cell type–dependent estimated that between 0.5% and 3% of all genes carrying CpG- differences in DNA methylation on a microarray platform. The rich promoter sequences (so-called CpG islands) may be silenced procedure has been verified and applied to identify a series of by DNA methylation in lung cancer (1, 11). This means that there novel candidate lung tumor suppressor genes and potential are most likely several hundred genes that are incapacitated by this DNA methylation markers that contain methylated CpG pathway. Some of these genes may be bona fide tumor suppressor islands. One gene of particular interest was DLEC1, located genes, but in other cases, the methylation event may be a at a commonly deleted area on chromosome 3p22-p21.3, consequence of gene silencing or may somehow be associated with which was frequently methylated in primary lung cancers and tumor formation rather than being a cause of tumorigenesis.
    [Show full text]
  • Duplication, Deletion, and Rearrangement in the Mouse and Human Genomes
    Evolution’s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes W. James Kent*†, Robert Baertsch*, Angie Hinrichs*, Webb Miller‡, and David Haussler§ *Center for Biomolecular Science and Engineering and §Howard Hughes Medical Institute, Department of Computer Science, University of California, Santa Cruz, CA 95064; and ‡Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802 Edited by Michael S. Waterman, University of Southern California, Los Angeles, CA, and approved July 11, 2003 (received for review April 9, 2003) This study examines genomic duplications, deletions, and rear- depending on details of definition and method. The length rangements that have happened at scales ranging from a single distribution of synteny blocks was found to be consistent with the base to complete chromosomes by comparing the mouse and theory of random breakage introduced by Nadeau and Taylor (8, human genomes. From whole-genome sequence alignments, 344 9) before significant gene order data became available. In recent large (>100-kb) blocks of conserved synteny are evident, but these comparisons of the human and mouse genomes, rearrangements are further fragmented by smaller-scale evolutionary events. Ex- of Ն100,000 bases were studied by comparing 558,000 highly cluding transposon insertions, on average in each megabase of conserved short sequence alignments (average length 340 bp) genomic alignment we observe two inversions, 17 duplications within 300-kb windows. An estimated 217 blocks of conserved (five tandem or nearly tandem), seven transpositions, and 200 synteny were found, formed from 342 conserved segments, with deletions of 100 bases or more. This includes 160 inversions and 75 length distribution roughly consistent with the random breakage duplications or transpositions of length >100 kb.
    [Show full text]
  • BIOINFORMATICS ISCB NEWS Doi:10.1093/Bioinformatics/Btp280
    Vol. 25 no. 12 2009, pages 1570–1573 BIOINFORMATICS ISCB NEWS doi:10.1093/bioinformatics/btp280 ISMB/ECCB 2009 Stockholm Marie-France Sagot1, B.J. Morrison McKay2,∗ and Gene Myers3 1INRIA Grenoble Rhône-Alpes and University of Lyon 1, Lyon, France, 2International Society for Computational Biology, University of California San Diego, La Jolla, CA and 3Howard Hughes Medical Institute Janelia Farm Research Campus, Ashburn, Virginia, USA ABSTRACT Computational Biology (http://www.iscb.org) was formed to take The International Society for Computational Biology (ISCB; over the organization, maintain the institutional memory of ISMB http://www.iscb.org) presents the Seventeenth Annual International and expand the informational resources available to members of the Conference on Intelligent Systems for Molecular Biology bioinformatics community. The launch of ECCB (http://bioinf.mpi- (ISMB), organized jointly with the Eighth Annual European inf.mpg.de/conferences/eccb/eccb.htm) 8 years ago provided for a Conference on Computational Biology (ECCB; http://bioinf.mpi- focus on European research activities in years when ISMB is held inf.mpg.de/conferences/eccb/eccb.htm), in Stockholm, Sweden, outside of Europe, and a partnership of conference organizing efforts 27 June to 2 July 2009. The organizers are putting the finishing for the presentation of a single international event when the ISMB touches on the year’s premier computational biology conference, meeting takes place in Europe every other year. with an expected attendance of 1400 computer scientists, The multidisciplinary field of bioinformatics/computational mathematicians, statisticians, biologists and scientists from biology has matured since gaining widespread recognition in the other disciplines related to and reliant on this multi-disciplinary early days of genomics research.
    [Show full text]
  • Open Thesisformatted Final.Pdf
    The Pennsylvania State University The Graduate School The Huck Institutes of the Life Sciences COMPUTATIONAL APPROACHES TO PREDICT PHENOTYPE DIFFERENCES IN POPULATIONS FROM HIGH-THROUGHPUT SEQUENCING DATA A Dissertation in Integrative Biosciences in Bioinformatics and Genomics by Oscar Camilo Bedoya Reina 2014 Oscar Camilo Bedoya Reina Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2014 i The dissertation of Oscar Camilo Bedoya Reina was reviewed and approved* by the following: Webb Miller Professor of Biology and Computer Science and Engineering Dissertation Advisor Chair of Committee Ross Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology George Perry Assistant Professor of Anthropology and Biology Kamesh Madduri Assistant Professor of Computer Science and Engineering Peter Hudson Willaman Professor of Biology Head of the Huck Institutes of the Life Sciences *Signatures are on file in the Graduate School iii ABSTRACT High-throughput sequencing technologies are changing the world. They are revolutionizing the life sciences and will be the foundation of a promising century of innovations. In recent years, the development of new sequencing technologies has dramatically decreased the cost of genome sequencing. Less than twenty years ago, sequencing the human genome cost 3 billion dollars, and took about a decade to complete. Today, high-quality 30X full-genome coverage can be obtained in just one day for US$ 5,000, while sequencing just the ~21,000 human genes to the same depth costs only about US$ 500. The latter is sufficient for detecting most of the rare variants, along with other sources of genetic variability such as indels, copy- number variations, and inversions that are characteristic of complex diseases.
    [Show full text]
  • ENCODE Analysis Working Group and Data Analysis Centre Rick Myers
    ENCODE Analysis Working Group and Data Analysis Centre Rick Myers Ewan Birney Motivation for mandated DAC y Genesis from the experience of the pilot project y Everyone looking at the ceiling when a key piece of annoying analysis needs to happen y A set of people who are funded to ensure that critical integrative analysis occurs (consistently and timely) y In no way exclusive y Everyone is invited in analysis y DAC should fit around things which are happening at the consortium level y Porous (no distinction expected between DAC members and other consortium members) except… y …the cleaning of the Aegean stables moment (eg, creating repeat libraries, consistently remapping everyone’s chip-seq data) y Interplay with DCC deliberate (trade off where things occur) y When there are too many things on the DAC to-do list - ask AWG to prioritise. AWG Participates in Rick Myers discussion Chair of AWG Birney BickelBickel Project Manager Haussler EBI (Ian Dunham) Bickel Directed Analysis Methods development EBI UCSC Yale BU EBI UCSC Yale BU U. Wash Penn Berkeley U. Wash Penn Berkeley DAC - federated, embedded y Ewan Birney/Paul Flicek/Ian Dunham (EBI)- comparative genomics, short read technology methods y Mark Gerstein (Yale) - chip-seq, link to genes/transcripts, link to modENCODE, P y Zhiping Weng (BU) - chip-chip, chip-seq, motif finding, bayesian analysis y Ross Hardison/Webb Miller (PSU) - comparative genomics, regulatory regions y Jim Kent/David Haussler (UCSC) - comparative genomics, DCC y Peter Bickel (UC Berkeley) - statistician y Bill Nobel (UW) - machine learning - HMMs, change point analysis, wavelets, SVMs New analysis tasks from AWG or community Results Provided Triage and Back to AWG Initial prioritisation Converting Priortisation Active ad hoc of all projects tasks analysis to by AWG handled pipelines by EDAC AWG prioritisation EDAC suggest pipelining tasks Experimental Data exploration, DCC group, in house Normalisation, coordination methods Sanity checking Feedback to AWG and expt.
    [Show full text]
  • 439: PALM: Probabilistic Area Loss Minimization for Protein Sequence
    PALM: Probabilistic Area Loss Minimization for Protein Sequence Alignment Fan Ding*1 Nan Jiang∗1 Jianzhu Ma2 Jian Peng3 Jinbo Xu4 Yexiang Xue1 1Department of Computer Science, Purdue University, West Lafayette, Indiana, USA 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA 4Toyota Technological Institute at Chicago, Illinois, USA Abstract origin LRP S Protein sequence alignment is a fundamental prob- lem in computational structure biology and popu- L lar for protein 3D structural prediction and protein homology detection. Most of the developed pro- A grams for detecting protein sequence alignments Match are based upon the likelihood information of amino Insertion at S Insertion at T acids and are sensitive to alignment noises. We S: S _ L _ A present a robust method PALM for modeling pair- gt T: _ L R P _ wise protein structure alignments, using the area S: _ pred S L A distance to reduce the biological measurement 1 T: _ R PL noise. PALM generatively learn the alignment of S: __ _ S L A pred __ _ two protein sequences with probabilistic area dis- 2 T: LPR tance objective, which can denoise the measure- ment errors and offsets from different biologists. Figure 1: Illustration of protein sequence alignment and the During learning, we show that the optimization is area distance. (Bottom) The task is to align two amino acids computationally efficient by estimating the gradi- sequences S and T , where one amino acid from sequence ents via dynamically sampling alignments. Empiri- S can be aligned to either one amino acid from sequence cally, we show that PALM can generate sequence T (match), or to a gap (insertion, marked by “−”).
    [Show full text]
  • Comparative Genomics
    17 Aug 2004 15:49 AR AR223-GG05-02.tex AR223-GG05-02.sgm LaTeX2e(2002/01/18) P1: IKH 10.1146/annurev.genom.5.061903.180057 Annu. Rev. Genomics Hum. Genet. 2004. 5:15–56 doi: 10.1146/annurev.genom.5.061903.180057 Copyright c 2004 by Annual Reviews. All rights reserved First published online as a Review in Advance on April 15, 2004 COMPARATIVE GENOMICS Webb Miller, Kateryna D. Makova, Anton Nekrutenko, and Ross C. Hardison The Center for Comparative Genomics and Bioinformatics, The Huck Institutes of Life Sciences, and the Departments of Biology, Computer Science and Engineering, and Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania; email: [email protected], [email protected], [email protected], [email protected] KeyWords whole-genome alignments, evolutionary rates, gene prediction, gene regulation, sequence conservation, Internet resources, genome browsers, genome databases, bioinformatics ■ Abstract The genomes from three mammals (human, mouse, and rat), two worms, and several yeasts have been sequenced, and more genomes will be completed in the near future for comparison with those of the major model organisms. Scientists have used various methods to align and compare the sequenced genomes to address critical issues in genome function and evolution. This review covers some of the major new insights about gene content, gene regulation, and the fraction of mammalian genomes that are under purifying selection and presumed functional. We review the evolution- ary processes that shape genomes, with particular attention to variation in rates within genomes and along different lineages. Internet resources for accessing and analyzing the treasure trove of sequence alignments and annotations are reviewed, and we dis- cuss critical problems to address in new bioinformatic developments in comparative genomics.
    [Show full text]
  • The Myth of Junk DNA
    The Myth of Junk DNA JoATN h A N W ells s eattle Discovery Institute Press 2011 Description According to a number of leading proponents of Darwin’s theory, “junk DNA”—the non-protein coding portion of DNA—provides decisive evidence for Darwinian evolution and against intelligent design, since an intelligent designer would presumably not have filled our genome with so much garbage. But in this provocative book, biologist Jonathan Wells exposes the claim that most of the genome is little more than junk as an anti-scientific myth that ignores the evidence, impedes research, and is based more on theological speculation than good science. Copyright Notice Copyright © 2011 by Jonathan Wells. All Rights Reserved. Publisher’s Note This book is part of a series published by the Center for Science & Culture at Discovery Institute in Seattle. Previous books include The Deniable Darwin by David Berlinski, In the Beginning and Other Essays on Intelligent Design by Granville Sewell, God and Evolution: Protestants, Catholics, and Jews Explore Darwin’s Challenge to Faith, edited by Jay Richards, and Darwin’s Conservatives: The Misguided Questby John G. West. Library Cataloging Data The Myth of Junk DNA by Jonathan Wells (1942– ) Illustrations by Ray Braun 174 pages, 6 x 9 x 0.4 inches & 0.6 lb, 229 x 152 x 10 mm. & 0.26 kg Library of Congress Control Number: 2011925471 BISAC: SCI029000 SCIENCE / Life Sciences / Genetics & Genomics BISAC: SCI027000 SCIENCE / Life Sciences / Evolution ISBN-13: 978-1-9365990-0-4 (paperback) Publisher Information Discovery Institute Press, 208 Columbia Street, Seattle, WA 98104 Internet: http://www.discoveryinstitutepress.com/ Published in the United States of America on acid-free paper.
    [Show full text]