Comparative Genomicsgenomics Comparativecomparative Genegene Predictionprediction Inin Thethe Humanhuman Genomegenome

Total Page:16

File Type:pdf, Size:1020Kb

Comparative Genomicsgenomics Comparativecomparative Genegene Predictionprediction Inin Thethe Humanhuman Genomegenome ComparativeComparative GenomicsGenomics ComparativeComparative GeneGene PredictionPrediction inin thethe HumanHuman GenomeGenome MaribelMaribel HernandezHernandez RosalesRosales WhatWhat isis ComparativeComparative Genomics?Genomics? Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species hav e evolved and to determine the function of genes and noncoding regions of the genome. Researchers have learned a great deal about the function of huma n genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the lengt h and number of coding regions (called exons ) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans. Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them. WhatWhat areare thethe comparativecomparative genomegenome sizessizes ofof humanshumans andand otherother organismsorganisms beingbeing studied?studied? estimatedaverage gene chromosome organism estimated size gene numberdensity number Homo sapiens 1 gene per 100,000 (human) 2900 million bases ~30,000 bases 46 Rattus norvegicus 1 gene per 100,000 (rat) 2,750 million bases ~30,000 bases 42 Mus musculus 1 gene per 100,000 (mouse) 2500 million bases ~30,000 bases 40 Drosophila melanogaster 1 gene per 9,000 (fruit fly) 180 million bases 13,600 bases 8 Arabidopsis thaliana 1 gene per 4000 (plant) 125 million bases 25,500 bases 5 Caenorhabditis elegans 1 gene per 5000 (roundworm) 97 million bases 19,100 bases 6 Saccharomyces cerevisiae 1 gene per 2000 (yeast) 12 million bases 6300 bases 16 Escherichia coli 1 gene per 1400 (bacteria) 4.7 million bases 3200 bases 1 H. influenzae 1 gene per 1000 (bacteria) 1.8 million bases 1700 bases 1 EukaryoticEukaryotic GeneGene FindingFinding ComparativeComparative GeneGene PredictionPrediction GenScanGenScan :: abab initioinitio genegene prediction.prediction. GeneWiseGeneWise ,, ProcrustesProcrustes :: homologyhomology guided.guided. RossetaRosseta ,, SGP1SGP1 (( SynteticSyntetic GeneGene Prediction),Prediction), CEMCEM (Conserved(Conserved ExonExon Method)Method) :: genegene predictionprediction andand sequencesequence alignmentalignment areare clearlyclearly separated.separated. GenomeScanGenomeScan :: AbAb InitioInitio modifiedmodified byby BLASTBLAST homologies.homologies. SGPSGP --22,, TwinScanTwinScan ,, SLAM,SLAM, DoubleScanDoubleScan :: modificationmodification ofof GenScanGenScan scoringscoring schemaschema toto incorporateincorporate similaritysimilarity toto knownknown proteins.proteins. GeneScanGeneScan AA generalgeneral probabilisticprobabilistic modelmodel forfor thethe genegene structurstructuree ofof humanhuman genomicgenomic sequences.sequences. GeneGene identificationidentification byby identifyingidentifying completecomplete exon/intronexon/intron structuresstructures ofof genesgenes inin genomicgenomic DNA.DNA. IncludeInclude dede capacitycapacity toto predictpredict multiplemultiple genesgenes inin aa sequence,sequence, toto dealdeal withwith partialpartial asas wellwell asas completecomplete genes,genes, andand toto predictpredict consistentconsistent setssets ofof genesgenes occurringoccurring onon eithereither oror botbothh DNADNA strands.strands. MarkovMarkov ModelModel ofof codingcoding regions:regions: predictionspredictions dodo notnot dependdepend onon presencepresence ofof aa similarsimilar genegene inin thethe proteinprotein sequensequencece databasesdatabases andand complementcomplement thethe informationinformation providedprovided bbyy homologyhomology --basedbased genegene identificationidentification methodsmethods (BLASTX).(BLASTX). MaximalMaximal DependenceDependence DecompositionDecomposition (MDD):(MDD): newnew statisticalstatistical modelmodel ofof donordonor andand acceptoracceptor splicesplice sitesitess whichwhich capturecapture importantimportant dependenciesdependencies betweenbetween signalsignal positpositions.ions. exonic repressor Pre Pre 5 ’ splice signal U1 snRNP intronic enhancers - - intron definition mRNA Splicing mRNA Splicing branch signal polyY U2 snRNP 65 3 (assembly of spliceosome, catalysis) U2AF ’ splice signal U2AF35 exonic enhancers exon definition ... SR proteins 5 ’ splice signal intronic repressor U1 snRNP ... HiddenHidden semisemi --MarkovMarkov ModelModel (HMM)(HMM) GenScanGenScan HMMHMM N - intergenic region P - promoter F - 5’ untranslated region Esngl – single exon (intronless ) (translation start -> stop codon ) Einit – initial exon (translation start - > donor splice site) Ek – phase k internal exon (acceptor splice site -> donor splice site) Eterm – terminal exon (acceptor splice site -> stop codon ) Ik – phase k intron : 0 – between codons ; 1 – after the first base of a codon ; 2 – after the second base of a codon GenScanGenScan FeaturesFeatures ModelModel bothboth strandsstrands atat onceonce EachEach statestate maymay outputoutput aa stringstring ofof symbolssymbols (according(according toto somesome probabilityprobability distribution).distribution). ExplicitExplicit intron/exonintron/exon lengthlength modelingmodeling AdvancedAdvanced splicesplice sitesite modelingmodeling ParametersParameters learnedlearned fromfrom annotatedannotated genesgenes PredictionPrediction ofof multiplemultiple genesgenes inin aa sequencesequence (partial(partial oror complete).complete). GenomeScanGenomeScan WeWe cancan enhanceenhance ourour genegene predictionprediction byby usingusing externalexternal information:information: DNADNA regionsregions withwith homologyhomology toto knownknown proteinsproteins areare moremore likelylikely toto bebe codingcoding exonsexons .. CombineCombine probabilisticprobabilistic ‘extrinsic’‘extrinsic’ informationinformation (BLAST(BLAST hits)hits) withwith aa probabilisticprobabilistic modelmodel ofof genegene structure/compositionstructure/composition (( GenScanGenScan ).). FocusFocus onon ‘typical‘typical case’case’ whenwhen homologoushomologous butbut notnot identicalidentical proteinsproteins areare available.available. AbAb InitioInitio modifiedmodified byby BLASTBLAST homologieshomologies AbAb InitioInitio modifiedmodified byby BLASTBLAST homologieshomologies GeneWiseGeneWise MotivationMotivation :: UseUse goodgood DBDB ofof proteinprotein worldworld (PFAM)(PFAM) toto helphelp usus annotateannotate genomicgenomic DNADNA GeneWiseGeneWise algorithmalgorithm alignsaligns aa profileprofile HMMHMM directlydirectly toto thethe DNADNA GeneWiseGeneWise StartStart withwith aa PFAMPFAM domaindomain HMMHMM ReplaceReplace AAAA emissionsemissions withwith codoncodon emissionsemissions Allow for sequencing errors (deletions/ insertions) Add a 3-state intron model GeneWiseGeneWise ModelModel GeneWiseGeneWise IntronIntron ModelModel PY tract central spacer 5’ site 3’ site GeneWiseGeneWise FeaturesFeatures && ProblemsProblems ““BestBest ”” alignmentalignment ofof DNADNA toto proteinprotein domaindomain AlignmentAlignment givesgives exactexact exonexon --intronintron boundariesboundaries ParametersParameters learnedlearned fromfrom speciesspecies --specificspecific statisticsstatistics OnlyOnly providesprovides partialpartial prediction,prediction, andand onlyonly wherewhere thethe homologyhomology lieslies DoesDoes notnot findfind ““moremore ”” genesgenes PseudogenesPseudogenes ,, RetrotransposonsRetrotransposons pickedpicked upup CPUCPU intensiveintensive Solution:Solution: PrePre --filterfilter withwith BLASTBLAST RosettaRosetta Gene prediction is separated from sequence alignment. First , the alignment is obtained between two homologous genomic sequences using sequence global alignment Glass. Then , gene structures (splice sites, exon number and length, etc.) are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions. SyntenicSyntenic GeneGene PredictionPrediction This approach does not require the comparison of two homologous genomic sequences. A query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms. Gene prediction and sequence alignment are separated. SGPSGP --22 tblastx HSPs HSPs Projections Query Sequence geneid Exons SGP Exons GeneGene predicitionpredicition programsprograms predictpredict aa largelarge numbernumber ofof genesgenes almost every mouse gene has the human orthologue counterpart TwinScan SGP 48462 total 47055 17562 novel 21942 10987 multiexonic 12158 long no low 3171 complexity 4543 954 2983 human ts human sgp 317 637 1931 1052 intron human ts human sgp intron aligned aligned OrthologousOrthologous humanhuman mousemouse genesgenes havehave conservedconserved exonicexonic structurestructure .. 85% of the orhologous pairs have identical number of exons 91% of the orthologous exons have identical length 99.5% of the orthologous exons have identical phase there are a few cases of intron insertion/deletion (22) SummarySummary GenesGenes areare complexcomplex structuresstructures whichwhich areare difficultdifficult toto predictpredict withwith thethe requiredrequired levellevel ofof accuracy/accuracy/ confidenceconfidence DifferentDifferent approachesapproaches toto genegene findingfinding improveimprove accuracy/confidenceaccuracy/confidence ofof thethe predictions:predictions: Ab Initio : GenScan Ab Initio modified by BLAST homologies: GenomeScan Homology guided: GeneWise Gene prediction and sequence alignment separately: Rosseta Ab initio with similarity in known proteins: SGP -2 MerciMerci pourpour votrevotre attention!attention!.
Recommended publications
  • Model-Based Integration of Genomics and Metabolomics Reveals SNP Functionality in Mycobacterium Tuberculosis
    Model-based integration of genomics and metabolomics reveals SNP functionality in Mycobacterium tuberculosis Ove Øyåsa,b,1, Sonia Borrellc,d,1, Andrej Traunerc,d,1, Michael Zimmermanne, Julia Feldmannc,d, Thomas Liphardta,b, Sebastien Gagneuxc,d, Jörg Stellinga,b, Uwe Sauere, and Mattia Zampierie,2 aDepartment of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland; bSIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland; cDepartment of Medical Parasitoloy and Infection Biology, Swiss Tropical and Public Health Institute, 4051 Basel, Switzerland; dUniversity of Basel, 4058 Basel, Switzerland; and eInstitute of Molecular Systems Biology, ETH Zurich, 8093 Zurich, Switzerland Edited by Ralph R. Isberg, Tufts University School of Medicine, Boston, MA, and approved March 2, 2020 (received for review September 12, 2019) Human tuberculosis is caused by members of the Mycobacterium infection of macrophages (29–32). Beyond analyses of individual tuberculosis complex (MTBC) that vary in virulence and transmis- laboratory strains, however, no systematic characterization and sibility. While genome-wide association studies have uncovered comparative analysis of intrinsic metabolic differences across several mutations conferring drug resistance, much less is known human-adapted MTBC clinical strains has been performed. about the factors underlying other bacterial phenotypes. Variation If the metabolic and other phenotypic diversity between in the outcome of tuberculosis infection and diseases has been MTBC strains contributes to and modulates pathogenicity, an attributed primarily to patient and environmental factors, but obvious question is: Which elements of the limited genetic di- recent evidence indicates an additional role for the genetic diver- versity in the MTBC are responsible for phenotypic strain di- sity among MTBC clinical strains.
    [Show full text]
  • 6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures
    6.047/6.878 Lecture 4: Comparative Genomics I: Genome Annotation Using Evolutionary Signatures Mark Smith (Partially adapted from notes by: Angela Yen, Christopher Robde, Timo Somervuo and Saba Gul) 9/18/12 1 Contents 1 Introduction 4 1.1 Motivation and Challenge......................................4 1.2 Importance of many closely{related genomes...........................4 2 Conservation of genomic sequences5 2.1 Functional elements in Drosophila .................................5 2.2 Rates and patterns of selection...................................5 3 Excess Constraint 6 3.1 Causes of Excess Constraint.....................................7 3.2 Modeling Excess Constraint.....................................8 3.3 Excess Constraint in the Human Genome.............................9 3.4 Examples of Excess Constraint................................... 10 4 Diversity of evolutionary signatures: An Overview of Selection Patterns 10 4.1 Selective Pressures On Different Functional Elements....................... 11 5 Protein{Coding Signatures 12 5.1 Reading{Frame Conservation (RFC)................................ 13 5.2 Codon{Substitution Frequencies (CSFs).............................. 14 5.3 Classification of Drosophila Genome Sequences.......................... 16 5.4 Leaky Stop Codons.......................................... 17 6 microRNA (miRNA) genes 19 6.1 Computational Challenge...................................... 20 6.2 Unusual miRNA Genes........................................ 21 7 Regulatory Motifs 22 7.1 Computationally Detecting
    [Show full text]
  • The Distribution and Evolution of Arabidopsis Thaliana Cis Natural Antisense Transcripts Johnathan Bouchard, Carlos Oliver and Paul M Harrison*
    Bouchard et al. BMC Genomics (2015)16:444 DOI 10.1186/s12864-015-1587-0 RESEARCH ARTICLE Open Access The distribution and evolution of Arabidopsis thaliana cis natural antisense transcripts Johnathan Bouchard, Carlos Oliver and Paul M Harrison* Abstract Background: Natural antisense transcripts (NATs) are regulatory RNAs that contain sequence complementary to other RNAs, these other RNAs usually being messenger RNAs. In eukaryotic genomes, cis-NATs overlap the gene they complement. Results: Here, our goal is to analyze the distribution and evolutionary conservation of cis-NATs for a variety of available data sets for Arabidopsis thaliana, to gain insights into cis-NAT functional mechanisms and their significance. Cis-NATs derived from traditional sequencing are largely validated by other data sets, although different cis-NAT data sets have different prevalent cis-NAT topologies with respect to overlapping protein-coding genes. A. thaliana cis-NATs have substantial conservation (28-35% in the three substantive data sets analyzed) of expression in A. lyrata. We examined evolutionary sequence conservation at cis-NAT loci in Arabidopsis thaliana across nine sequenced Brassicaceae species (picked for optimal discernment of purifying selection), focussing on the parts of their sequences not overlapping protein- coding transcripts (dubbed ‘NOLPs’). We found significant NOLP sequence conservation for 28-34% NATs across different cis-NAT sets. This NAT NOLP sequence conservation versus A. lyrata is generally significantly correlated with conservation of expression. We discover a significant enrichment of transcription factor binding sites (as evidenced by CHIP-seq data) in NOLPs compared to randomly sampled near-gene NOLP-like DNA , that is linked to significant sequence conservation.
    [Show full text]
  • Transcriptional Interferences in Cis Natural Antisense Transcripts of Humans and Mice
    Copyright Ó 2007 by the Genetics Society of America DOI: 10.1534/genetics.106.069484 Transcriptional Interferences in cis Natural Antisense Transcripts of Humans and Mice Naoki Osato, Yoshiyuki Suzuki, Kazuho Ikeo and Takashi Gojobori1 Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Mishima 411-8540, Japan Manuscript received December 9, 2006 Accepted for publication March 21, 2007 ABSTRACT For a significant fraction of mRNAs, their expression is regulated by other RNAs, including cis natural antisense transcripts (cis -NATs) that are complementary mRNAs transcribed from opposite strands of DNA at the same genomic locus. The regulatory mechanism of mRNA expression by cis -NATs is unknown, although a few possible explanations have been proposed. To understand this regulatory mechanism, we conducted a large-scale analysis of the currently available data and examined how the overlapping arrange- ments of cis -NATs affect their expression level. Here, we show that for both human and mouse the expression level of cis -NATs decreases as the length of the overlapping region increases. In particular, the proportions of the highly expressed cis -NATs in all cis -NATs examined were 36 and 47% for human and mouse, respectively, when the overlapping region was ,200 bp. However, both proportions decreased to virtually zero when the overlapping regions were .2000 bp in length. Moreover, the distribution of the expression level of cis -NATs changes according to different types of the overlapping pattern of cis -NATs in the genome. These results are consistent with the transcriptional collision model for the regulatory mechanism of gene expression by cis -NATs.
    [Show full text]
  • Application of Comparative Genomics for Detection of Genomic Features and Transcriptional Regulatory Elements Hong Lu Iowa State University
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2011 Application of comparative genomics for detection of genomic features and transcriptional regulatory elements Hong Lu Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Cell and Developmental Biology Commons, and the Genetics and Genomics Commons Recommended Citation Lu, Hong, "Application of comparative genomics for detection of genomic features and transcriptional regulatory elements" (2011). Graduate Theses and Dissertations. 12151. https://lib.dr.iastate.edu/etd/12151 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Application of comparative genomics for detection of genomic features and transcriptional regulatory elements by Hong Lu A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Bioinformatics and Computational Biology Program of Study Committee: Volker Brendel, Co-major Professor Roger Wise, Co-major Professor Julie Dickerson Nick Lauter Adam Bogdanove Iowa State University Ames, Iowa 2011 Copyright © Hong Lu, 2011. All rights reserved. ii TABLE OF CONTENTS CHAPTER 1. INTRODUCTION
    [Show full text]
  • Comparative Genomics of Gossypium and Arabidopsis: Unraveling the Consequences of Both Ancient and Recent Polyploidy
    Article Comparative genomics of Gossypium and Arabidopsis: Unraveling the consequences of both ancient and recent polyploidy Junkang Rong,1 John E. Bowers,1 Stefan R. Schulze,1 Vijay N. Waghmare,1 Carl J. Rogers,1 Gary J. Pierce,1 Hua Zhang,2 James C. Estill,1 and Andrew H. Paterson1,3 1Plant Genome Mapping Laboratory and 2Biochemistry and Molecular Biology, Life Sciences Building, University of Georgia, Athens, Georgia 30602, USA Both ancient and recent polyploidy, together with post-polyploidization loss of many duplicated gene copies, complicates angiosperm comparative genomics. To explore an approach by which these challenges might be mitigated, genetic maps of extant diploid and tetraploid cottons (Gossypium spp.) were used to infer the approximate order of 3016 loci along the chromosomes of their hypothetical common ancestor. The inferred Gossypium gene order corresponded more closely than the original maps did to a similarly inferred ancestral gene order predating an independent paleopolyploidization (␣)inArabidopsis. At least 59% of the cotton map and 53% of the Arabidopsis transcriptome showed correspondence in multilocus gene arrangements based on one or both of two software packages (CrimeStatII, FISH). Genomic regions in which chromosome structural rearrangement has been rapid (obscuring gene order correspondence) have also been subject to greater divergence of individual gene sequences. About 26%–44% of corresponding regions involved multiple Arabidopsis or cotton chromosomes, in some cases consistent with known, more ancient, duplications. The genomic distributions of multiple-locus probes provided early insight into the consequences for chromosome structure of an ancient large-scale duplication in cotton. Inferences that mitigate the consequences of ancient duplications improve leveraging of genomic information for model organisms in the study of more complex genomes.
    [Show full text]
  • Comparative Genomics for Reliable Protein-Function Prediction from Genomic Data
    340 Update TRENDS in Genetics Vol.20 No.8 August 2004 24 Yu, K. et al. (2003) R-loops at immunoglobulin class switch regions in 31 Michael, N. et al. (2003) The E box motif CAGGTG enhances somatic the chromosomes of stimulated B cells. Nat. Immunol. 4, 442–451 hypermutation without enhancing transcription. Immunity 19, 25 Reaban, M.E. and Griffin, J.A. (1990) Induction of RNA-stabilized 235–242 DNA conformers by transcription of an immunoglobulin switch region. 32 Ito, S. et al. (2004) Activation-induced cytidine deaminase shuttles Nature 348, 342–344 between nucleus and cytoplasm like apolipoprotein B mRNA 26 Daniels, G.A. and Lieber, M.R. (1995) RNA:DNA complex formation editing catalytic polypeptide 1. Proc. Natl. Acad. Sci. U. S. A. 101, upon transcription of immunoglobulin switch regions: implications for 1975–1980 the mechanism and regulation of class switch recombination. Nucleic 33 McBride, K.M. et al. (2004) Somatic hypermutation is limited by Acids Res. 23, 5006–5011 CRM1-dependent nuclear export of activation-induced deaminase. 27 Shinkura, R. et al. (2003) The influence of transcriptional orientation J. Exp. Med. 199, 1235–1244 on endogenous switch region function. Nat. Immunol. 4, 435–441 34 Brar, S. et al. Activation-induced cytidine deaminase (AID) is actively 28 Longacre, A. and Storb, U. (2000) A novel cytidine deaminase affects exported out of the nucleus but retained by the induction of DNA antibody diversity. Cell 102, 541–544 breaks. J. Biol. Chem. (in press) 29 Yoshikawa, K. et al. (2002) AID enzyme-induced hypermutation in an actively transcribed gene in fibroblasts.
    [Show full text]
  • Comparative Genomics and Transcriptomics Analysis Reveals Evolution Patterns of Selection in the Salix Phylogeny
    Zhao et al. BMC Genomics (2019) 20:253 https://doi.org/10.1186/s12864-019-5627-z RESEARCH ARTICLE Open Access Comparative genomics and transcriptomics analysis reveals evolution patterns of selection in the Salix phylogeny You-jie Zhao1,2,3, Xin-yi Liu2, Ran Guo2, Kun-rong Hu2, Yong Cao2* and Fei Dai2* Abstract Background: Willows are widely distributed in the northern hemisphere and have good adaptability to different living environment. The increasing of genome and transcriptome data provides a chance for comparative analysis to study the evolution patterns with the different origin and geographical distributions in the Salix phylogeny. Results: Transcript sequences of 10 Salicaceae species were downloaded from public databases. All pairwise of orthologues were identified by comparative analysis in these species, from which we constructed a phylogenetic tree and estimated the rate of diverse. Divergence times were estimated in the 10 Salicaceae using comparative transcriptomic analysis. All of the fast-evolving positive selection sequences were identified, and some cold-, drought-, light-, universal-, and heat- resistance genes were discovered. Conclusions: The divergence time of subgenus Vetrix and Salix was about 17.6–16.0 Mya during the period of Middle Miocene Climate Transition (21–14 Mya). Subgenus Vetrix diverged to migratory and resident groups when the climate changed to the cool and dry trend by 14 Mya. Cold- and light- stress genes were involved in positive selection among the resident Vetrix, and which would help them to adapt the cooling stage. Universal- stress genes exhibited positive selection among the migratory group and subgenus Salix. These data are useful for comprehending the adaptive evolution and speciation in the Salix lineage.
    [Show full text]
  • Chapter 6. Comparative Genomics Contents 6
    Chapter 6. Comparative Genomics Contents 6. Comparative Genomics 6.1. Nucleotide and Amino Acid Substitutions 6.1.1. Sequence Similarity 6.1.2. Sequence Comparison by Alignment 6.1.3. Jukes-Cantor Model for Base Substitution 6.2. Comparative Genomic Analysis 6.2.1. Components of Comparative Genomic Analysis 6.2.2. Molecular Clocks 6.3. Molecular Phylogeny 6.3.1. Phylogenetic Trees 6.3.2. Gene Versus Species Trees 6.3.3. Methods of Reconstructing Phylogenetic Trees 6.4. Tree of Life 6.5. Genome Evolution 6.5.1. Multigene Families 6.5.2. Gene Duplication and Gene Conversion 6.5.3. Domain (Exon) Shuffling CONCEPTS OF GENOMIC BIOLOGY Page 6- 1 Before we begin our discussion of comparative genomics we may need a few discipline specific words CHAPTER 6. COMPARATIVE GENOMICS defined. These are commonly misused, and this can lead (RETURN) to confusion. As populations phenotypically change over evolutionary time, so too does their genetic structure. Molecular evolution examines DNA and proteins, Definition of HOMOLOG, ORTHOLOG AND PARALOG addressing two types of questions: 1) How have DNA and found in the NCBI Glossary • homolog/homologous – Homologous genes (homologs) are protein molecules evolved; and 2) How are genes and related to by descent from a common ancestral DNA sequence. organisms evolutionarily related? As we have seen in The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the section 2.7, population genetics focuses on changes in relationship between genes separated by the event of genetic population genetic structure between generations.
    [Show full text]
  • Efficient Comparative Genomics with Low Coverage Data Using PALADIN
    University of New Hampshire University of New Hampshire Scholars' Repository Master's Theses and Capstones Student Scholarship Spring 2020 Efficient comparative genomics with low coverage data using PALADIN Rachel Cates University of New Hampshire, Durham Follow this and additional works at: https://scholars.unh.edu/thesis Recommended Citation Cates, Rachel, "Efficient comparative genomics with low coverage data using PALADIN" (2020). Master's Theses and Capstones. 1338. https://scholars.unh.edu/thesis/1338 This Thesis is brought to you for free and open access by the Student Scholarship at University of New Hampshire Scholars' Repository. It has been accepted for inclusion in Master's Theses and Capstones by an authorized administrator of University of New Hampshire Scholars' Repository. For more information, please contact [email protected]. EFFICIENT COMPARATIVE GENOMICS WITH LOW COVERAGE DATA USING PALADIN BY RACHEL CATES THESIS Submitted to the University of New Hampshire in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science December. 2019 This thesis was examined and approved in partial fulfillment of the requirements for the degree of Master of Science in Computer Science by: Thesis director, Dr. R. Daniel Bergeron, Professor Emeritus of Computer Science Dr. W. Kelley Thomas, Professor of Molecular, Cellular, and Biomedical Sciences Anthony Westbrook, Computational Scientist On December 10th 2019 Approval signatures are on file with the University of New Hampshire Graduate School. ii I wish to extend thanks to all those involved in the research and development of the PB&J tool, including: Dr. R. Daniel Bergeron, Anthony Westbrook, and Dr. W.
    [Show full text]
  • Integrated Omics: Tools, Advances and Future Approaches
    62 1 Journal of Molecular B B Misra et al. Approaches and tools in 62:1 R21–R45 Endocrinology integrated omics REVIEW Integrated omics: tools, advances and future approaches Biswapriya B Misra1, Carl Langefeld1,2, Michael Olivier1 and Laura A Cox1,3 1Center for Precision Medicine, Section on Molecular Medicine, Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, North California, USA 2Department of Biostatistics, Wake Forest School of Medicine, Winston-Salem, North California, USA 3Southwest National Primate Research Center, Texas Biomedical Research Institute, San Antonio, Texas, USA Correspondence should be addressed to L A Cox: [email protected] Abstract With the rapid adoption of high-throughput omic approaches to analyze biological Key Words samples such as genomics, transcriptomics, proteomics and metabolomics, each f integrated analysis can generate tera- to peta-byte sized data files on a daily basis. These data file f omics sizes, together with differences in nomenclature among these data types, make the f genomics integration of these multi-dimensional omics data into biologically meaningful context f transcriptomics challenging. Variously named as integrated omics, multi-omics, poly-omics, trans-omics, f proteomics pan-omics or shortened to just ‘omics’, the challenges include differences in data f metabolomics cleaning, normalization, biomolecule identification, data dimensionality reduction, f network biological contextualization, statistical validation, data storage and handling, sharing and f statistics data archiving. The ultimate goal is toward the holistic realization of a ‘systems biology’ f Bayesian understanding of the biological question. Commonly used approaches are currently f machine learning limited by the 3 i’s – integration, interpretation and insights.
    [Show full text]
  • Plant Paleopolyploidy James C
    Plant Paleopolyploidy James C. Schnable1 and Eric Lyons3,4 1Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, USA 2Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, USA 3School of Plant Sciences, University of Arizona, Tucson, AZ, USA 4iPlant Collaborative, Bio5 Institute, University of Arizona, Tucson, AZ, USA ABSTRACT Identifying and characterizing plant paleopolyploidies is remains an ongoing area of investigation. Ancient whole genome duplications can be identified through whole genome comparisons using a combination of the data derived from genomic structure (e.g. syntenic dotplots) and evolutionary distances (e.g. synonymous mutation rates). These methods require large genomic and/or transcriptomic datasets, and our understanding identity, placement, and timing of these ancient events continues to change as new datasets become available. The images and texzt presented here represent our best estimates using the data available at this time, but, like many things, must be seen as tentative and subject to revision as new datasets and types of analyses become available. Keywords: Comparative Genomics, Plant Comparative Genomics, Whole Genome Duplication, Plant Paleopolyploidy INTRODUCTION Figure 1. Phylogenetic tree of plant species, predominantly those with published genome sequences with known ancient whole genome duplications marked. Tree is current as of April 2013. 1 EUDICOT HEXAPLOIDY Synonyms: Arabidopsis Gamm This hexaploidy (genome tripling) is shared by the core eudicots (the rosids and asterids), and may be present in additional, basal eudicots, although it will not be possible to reach this conclusion until the genomes of species from basal lineages are sequenced and assembled to the pseudomolecule level. This whole genome duplication was first identified as the most ancient of three whole genome duplication present in the genome of Arabidopsis thaliana, and assigned the name ”gamma” by in 2003 by (Bowers et al., 2003).
    [Show full text]