Informatics and Clinical Genome Sequencing: Opening the Black Box

Total Page:16

File Type:pdf, Size:1020Kb

Informatics and Clinical Genome Sequencing: Opening the Black Box ©American College of Medical Genetics and Genomics REVIEW Informatics and clinical genome sequencing: opening the black box Sowmiya Moorthie, PhD1, Alison Hall, MA 1 and Caroline F. Wright, PhD1,2 Adoption of whole-genome sequencing as a routine biomedical tool present an overview of the data analysis and interpretation pipeline, is dependent not only on the availability of new high-throughput se- the wider informatics needs, and some of the relevant ethical and quencing technologies, but also on the concomitant development of legal issues. methods and tools for data collection, analysis, and interpretation. Genet Med 2013:15(3):165–171 It would also be enormously facilitated by the development of deci- sion support systems for clinicians and consideration of how such Key Words: bioinformatics; data analysis; massively parallel; information can best be incorporated into care pathways. Here we next-generation sequencing INTRODUCTION Each of these steps requires purpose-built databases, algo- Technological advances have resulted in a dramatic fall in the rithms, software, and expertise to perform. By and large, issues cost of human genome sequencing. However, the sequencing related to primary analysis have been solved and are becom- assay is only the beginning of the process of converting a sample ing increasingly automated, and are therefore not discussed of DNA into meaningful genetic information. The next step of further here. Secondary analysis is also becoming increasingly data collection and analysis involves extensive use of various automated for human genome resequencing, and methods of computational methods for converting raw data into sequence mapping reads to the most recent human genome reference information, and the application of bioinformatics techniques sequence (GRCh37), and calling variants from it, are becom- for the interpretation of that sequence. The enormous amount ing standardized. The major bottleneck for wider clinical appli- of data generated by massively parallel next-generation sequenc- cation of NGS is the interpretation of sequence data, which is ing (NGS) technologies has shifted the workload away from still a nascent field in terms of developing algorithms, appro- upstream sample preparation and toward downstream analy- priate analytical tools and effective evidence bases of human sis processes. This means that, along with the development of genotype–phenotype associations. Although the steps involved sequencing technologies, concurrent development of appropri- in interpretation and application of results will vary depending ate informatics solutions is urgently needed to make clinical on the specific clinical setting, the purpose of the testing, and interpretation of individual genomic variation a realistic goal.1,2 the clinical question, it is likely that there will be commonali- ties in both the basic analysis pipeline and tools used. Similarly, THE DATA ANALYSIS PIPELINE although some of the details may change with the introduction The data analysis process can be broadly divided into the of third-generation sequencing technologies, such as those that following three stages (Figure 1). involve real-time detection of single molecules, the data analysis challenge posed by massively parallel sequencing technologies Primary analysis: base calling will remain essentially the same. Below, we provide an overview That is, converting raw data, based on changes in light intensity of the secondary and tertiary steps involved in analyzing raw or electrical current, into short sequences of nucleotides.3 sequence reads from NGS technologies. Secondary analysis: alignment and variant calling SECONDARY ANALYSIS: VARIANT CALLING That is, mapping individual short sequences of nucleotides, or AND ANNOTATION reads, to a reference sequence and determining variation from Following initial base calling, the next step toward generating that reference.4 useful genetic information from sequencing reads (i.e., short sequences of nucleotides) involves assembly of the reads into Tertiary analysis: interpretation a complete genome sequence by comparison of multiple over- That is, analyzing variants to assess their origin, uniqueness, lapping reads and the reference sequence. Longer read lengths and functional impact.5 would make this process much simpler, as each read would be 1PHG Foundation, Cambridge, UK; 2Current address: Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK. Correspondence: Caroline F. Wright ([email protected]) Submitted 6 June 2011; accepted 6 August 2012; advance publication online 13 September 2012. doi:10.1038/gim.2012.116 GENETICS in MEDICINE | Volume 15 | Number 3 | March 2013 165 REVIEW MOORTHIE et al | Informatics and clinical genome sequencing Signal analysis • Variation may represent true genetic variation in the sam- Primary analysis ple. As each human genome is estimated to differ from the BASE-CALLING Base-calling reference sequence in ~3–4 million sites,9,10 including sin- Quality scoring (raw) gle-nucleotide changes and structural variants, mapping Reference sequence these variants is a challenge. Determining which variants are derived from the same physical chromosome can also Mapping reads to reference be difficult from short reads, and haplotypes must either Quality scoring (consensus) be assembled11 or imputed.12 Using paired-end reads for mapping is particularly important for identifying Secondary analysis Variant calling • Read depth analysis structural variation, and is critical for cancer genome ALIGNMENT • Paired-end analysis • Haplotype inference sequencing due to the presence of extensive large struc- tural rearrangements relative to the matched germline Visualization genome, including both intra- and interchromosomal 13 Databases of variation rearrangements. Variant analysis The addition of biological information to sequence data is • Genetic filtering • Functional filtering an important step toward making it possible to interpret the potential effects of variants, and involves a mixture of automatic Tertiary analysis Validation of candidates INTERPRETATION annotation by computational prediction and manual annotation (curation) by expert interpretation of experimental data. The Clinical interpretation and decision-making main steps relate to structural information (e.g., gene location, • Diagnostic • Therapeutic structure, coding, splice sites, and regulatory motifs) and func- tional information (e.g., protein structure, function, interac- 14 Figure 1 Outline of informatics pipeline for processing and analyzing tions, and expression). A fully annotated sequence can include data from massively parallel sequencing platforms. an enormous amount of information, including common and rare variants, comparisons with other species,15 known genetic much more likely to map uniquely to a single position in the and epigenetic variants, regulatory features, transcript, and genome. The process is complicated by the extent of variation in expression data, as well as links to protein databases. However, the sample sequence from the reference sequence, and generat- annotation is currently both incomplete and imperfect in the ing a list of true variants is made challenging because there are human genome. Projects such as ENCODE, the Encyclopedia several possible explanations for differences between the refer- of DNA Elements,16,17 and the related GENCODE, encyclope- ence and sample genome: dia of genes and gene variants,18,19 aim to identify all functional elements in the human genome sequence and will ultimately be • Inaccuracies in the reference genome, which does not used to annotate all evidence-based gene features in the entire represent any single individual and is still incomplete due human genome at a high accuracy. to highly repetitive regions that have yet to be sequenced. Numerous programs have been developed specifically for Furthermore, to date, the reference sequence has been genome assembly, alignment, and variant calling based on regularly updated, which has caused specific regions to DNA sequence reads from high-throughput next-generation map to different locations between versions. sequencing platforms2,4,20,21 (Table 1). These include software • Incorrect base calls in the sample sequence due to developed for use with a particular sequencing platform, open sequencing or amplification errors.6–8 However, this can access academic software with a variety of functionalities and be mitigated through read-depth analysis (i.e., evaluat- platform compatibilities, and proprietary software designed for ing the number of times an individual base is sequenced specific purposes such as diagnostics (Table 2). A major issue in independent reads) to assess the reliability of each for standard alignment programs is the interpretation of small base call. insertions and deletions, which has been partly addressed by the • Incorrect alignment between the reference and sample development of new programs specifically for this purpose22; may arise due to the highly repetitive nature of substan- however, current technology does not yet allow for confident tial portions of the genome and (in the case of NGS plat- analysis of interpretation of small insertions and deletions as forms) short read lengths. This can result in individual long as several hundred base pairs, especially repeat sequences. reads mapping to multiple locations. The likelihood that Various dedicated software packages have also been developed a read is mapped correctly can be indicated
Recommended publications
  • Reference Genome Sequence of the Model Plant Setaria
    UC Davis UC Davis Previously Published Works Title Reference genome sequence of the model plant Setaria Permalink https://escholarship.org/uc/item/2rv1r405 Journal Nature Biotechnology, 30(6) ISSN 1087-0156 1546-1696 Authors Bennetzen, Jeffrey L Schmutz, Jeremy Wang, Hao et al. Publication Date 2012-05-13 DOI 10.1038/nbt.2196 Peer reviewed eScholarship.org Powered by the California Digital Library University of California ARTICLES Reference genome sequence of the model plant Setaria Jeffrey L Bennetzen1,13, Jeremy Schmutz2,3,13, Hao Wang1, Ryan Percifield1,12, Jennifer Hawkins1,12, Ana C Pontaroli1,12, Matt Estep1,4, Liang Feng1, Justin N Vaughn1, Jane Grimwood2,3, Jerry Jenkins2,3, Kerrie Barry3, Erika Lindquist3, Uffe Hellsten3, Shweta Deshpande3, Xuewen Wang5, Xiaomei Wu5,12, Therese Mitros6, Jimmy Triplett4,12, Xiaohan Yang7, Chu-Yu Ye7, Margarita Mauro-Herrera8, Lin Wang9, Pinghua Li9, Manoj Sharma10, Rita Sharma10, Pamela C Ronald10, Olivier Panaud11, Elizabeth A Kellogg4, Thomas P Brutnell9,12, Andrew N Doust8, Gerald A Tuskan7, Daniel Rokhsar3 & Katrien M Devos5 We generated a high-quality reference genome sequence for foxtail millet (Setaria italica). The ~400-Mb assembly covers ~80% of the genome and >95% of the gene space. The assembly was anchored to a 992-locus genetic map and was annotated by comparison with >1.3 million expressed sequence tag reads. We produced more than 580 million RNA-Seq reads to facilitate expression analyses. We also sequenced Setaria viridis, the ancestral wild relative of S. italica, and identified regions of differential single-nucleotide polymorphism density, distribution of transposable elements, small RNA content, chromosomal rearrangement and segregation distortion.
    [Show full text]
  • The Enigmatic Charcot-Leyden Crystal Protein
    C al & ellu ic la n r li Im C m f u Journal of o n l o a l n o r Clarke et al., J Clin Cell Immunol 2015, 6:2 g u y o J DOI: 10.4172/2155-9899.1000323 ISSN: 2155-9899 Clinical & Cellular Immunology Review Article Open Access The Enigmatic Charcot-Leyden Crystal Protein (Galectin-10): Speculative Role(s) in the Eosinophil Biology and Function Christine A Clarke1,3, Clarence M Lee2 and Paulette M Furbert-Harris1,3,4* 1Department of Microbiology, Howard University College of Medicine, Washington, D.C., USA 2Department of Biology, Howard University College of Arts and Sciences, Washington, D.C., USA 3Howard University Cancer Center, Howard University College of Medicine, Washington, D.C., USA 4National Human Genome Center, Howard University College of Medicine, Washington, D.C., USA *Corresponding author: Furbert-Harris PM, Department of Microbiology, Howard University Cancer Center, 2041 Georgia Avenue NW, Room #530, 20060, Washington, D.C., USA, Tel: 202-806-7722, Fax: 202-667-1686; Email: [email protected] Received date: Febuary 22, 2015, Accepted date: April 25, 2015, Published date: April 29, 2015 Copyright: © 2015 Clarke CA, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abstract Eosinophilic inflammation in peripheral tissues is typically marked by the deposition of a prominent eosinophil protein, Galectin-10, better known as Charcot-Leyden crystal protein (CLC). Unlike the eosinophil’s four distinct toxic cationic proteins and enzymes [major basic protein (MBP), eosinophil-derived neurotoxin (EDN), eosinophil cationic protein (ECP), and eosinophil peroxidase (EPO)], there is a paucity of information on the precise role of the crystal protein in the biology of the eosinophil.
    [Show full text]
  • Y Chromosome Dynamics in Drosophila
    Y chromosome dynamics in Drosophila Amanda Larracuente Department of Biology Sex chromosomes X X X Y J. Graves Sex chromosome evolution Proto-sex Autosomes chromosomes Sex Suppressed determining recombination Differentiation X Y Reviewed in Rice 1996, Charlesworth 1996 Y chromosomes • Male-restricted • Non-recombining • Degenerate • Heterochromatic Image from Willard 2003 Drosophila Y chromosome D. melanogaster Cen Hoskins et al. 2015 ~40 Mb • ~20 genes • Acquired from autosomes • Heterochromatic: Ø 80% is simple satellite DNA Photo: A. Karwath Lohe et al. 1993 Satellite DNA • Tandem repeats • Heterochromatin • Centromeres, telomeres, Y chromosomes Yunis and Yasmineh 1970 http://www.chrombios.com Y chromosome assembly challenges • Repeats are difficult to sequence • Underrepresented • Difficult to assemble Genome Sequence read: Short read lengths cannot span repeats Single molecule real-time sequencing • Pacific Biosciences • Average read length ~15 kb • Long reads span repeats • Better genome assemblies Zero mode waveguide Eid et al. 2009 Comparative Y chromosome evolution in Drosophila I. Y chromosome assemblies II. Evolution of Y-linked genes Drosophila genomes 2 Mya 0.24 Mya Photo: A. Karwath P6C4 ~115X ~120X ~85X ~95X De novo genome assembly • Assemble genome Iterative assembly: Canu, Hybrid, Quickmerge • Polish reference Quiver x 2; Pilon 2L 2R 3L 3R 4 X Y Assembled genome Mahul Chakraborty, Ching-Ho Chang 2L 2R 3L 3R 4 X Y Y X/A heterochromatin De novo genome assembly species Total bp # contigs NG50 D. simulans 154,317,203 161 21,495729
    [Show full text]
  • The Bacteria Genome Pipeline (BAGEP): an Automated, Scalable Workflow for Bacteria Genomes with Snakemake
    The Bacteria Genome Pipeline (BAGEP): an automated, scalable workflow for bacteria genomes with Snakemake Idowu B. Olawoye1,2, Simon D.W. Frost3,4 and Christian T. Happi1,2 1 Department of Biological Sciences, Faculty of Natural Sciences, Redeemer's University, Ede, Osun State, Nigeria 2 African Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer's University, Ede, Osun State, Nigeria 3 Microsoft Research, Redmond, WA, USA 4 London School of Hygiene & Tropical Medicine, University of London, London, United Kingdom ABSTRACT Next generation sequencing technologies are becoming more accessible and affordable over the years, with entire genome sequences of several pathogens being deciphered in few hours. However, there is the need to analyze multiple genomes within a short time, in order to provide critical information about a pathogen of interest such as drug resistance, mutations and genetic relationship of isolates in an outbreak setting. Many pipelines that currently do this are stand-alone workflows and require huge computational requirements to analyze multiple genomes. We present an automated and scalable pipeline called BAGEP for monomorphic bacteria that performs quality control on FASTQ paired end files, scan reads for contaminants using a taxonomic classifier, maps reads to a reference genome of choice for variant detection, detects antimicrobial resistant (AMR) genes, constructs a phylogenetic tree from core genome alignments and provide interactive short nucleotide polymorphism (SNP) visualization across
    [Show full text]
  • Lawrence Berkeley National Laboratory Recent Work
    Lawrence Berkeley National Laboratory Recent Work Title 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Permalink https://escholarship.org/uc/item/7cx5710p Journal Nature biotechnology, 35(7) ISSN 1087-0156 Authors Mukherjee, Supratim Seshadri, Rekha Varghese, Neha J et al. Publication Date 2017-07-01 DOI 10.1038/nbt.3886 Peer reviewed eScholarship.org Powered by the California Digital Library University of California RESOU r CE OPEN 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life Supratim Mukherjee1,10, Rekha Seshadri1,10, Neha J Varghese1, Emiley A Eloe-Fadrosh1, Jan P Meier-Kolthoff2 , Markus Göker2 , R Cameron Coates1,9, Michalis Hadjithomas1, Georgios A Pavlopoulos1 , David Paez-Espino1 , Yasuo Yoshikuni1, Axel Visel1 , William B Whitman3, George M Garrity4,5, Jonathan A Eisen6, Philip Hugenholtz7 , Amrita Pati1,9, Natalia N Ivanova1, Tanja Woyke1, Hans-Peter Klenk8 & Nikos C Kyrpides1 We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional interpretation. We identify numerous biosynthetic clusters and experimentally validate a divergent phenazine cluster with potential new chemical structure and antimicrobial activity.
    [Show full text]
  • Telomere-To-Telomere Assembly of a Complete Human X Chromosome W
    https://doi.org/10.1038/s41586-020-2547-7 Accelerated Article Preview Telomere-to-telomere assembly of a W complete human X chromosome E VI Received: 30 July 2019 Karen H. Miga, Sergey Koren, Arang Rhie, Mitchell R. Vollger, Ariel Gershman, Andrey Bzikadze, Shelise Brooks, Edmund Howe, David Porubsky, GlennisE A. Logsdon, Accepted: 29 May 2020 Valerie A. Schneider, Tamara Potapova, Jonathan Wood, William Chow, Joel Armstrong, Accelerated Article Preview Published Jeanne Fredrickson, Evgenia Pak, Kristof Tigyi, Milinn Kremitzki,R Christopher Markovic, online 14 July 2020 Valerie Maduro, Amalia Dutra, Gerard G. Bouffard, Alexander M. Chang, Nancy F. Hansen, Amy B. Wilfert, Françoise Thibaud-Nissen, Anthony D. Schmitt,P Jon-Matthew Belton, Cite this article as: Miga, K. H. et al. Siddarth Selvaraj, Megan Y. Dennis, Daniela C. Soto, Ruta Sahasrabudhe, Gulhan Kaya, Telomere-to-telomere assembly of a com- Josh Quick, Nicholas J. Loman, Nadine Holmes, Matthew Loose, Urvashi Surti, plete human X chromosome. Nature Rosa ana Risques, Tina A. Graves Lindsay, RobertE Fulton, Ira Hall, Benedict Paten, https://doi.org/10.1038/s41586-020-2547-7 Kerstin Howe, Winston Timp, Alice Young, James C. Mullikin, Pavel A. Pevzner, (2020). Jennifer L. Gerton, Beth A. Sullivan, EvanL E. Eichler & Adam M. Phillippy C This is a PDF fle of a peer-reviewedI paper that has been accepted for publication. Although unedited, the Tcontent has been subjected to preliminary formatting. Nature is providing this early version of the typeset paper as a service to our authors and readers. The text andR fgures will undergo copyediting and a proof review before the paper is published in its fnal form.
    [Show full text]
  • Human Lectins, Their Carbohydrate Affinities and Where to Find Them
    biomolecules Review Human Lectins, Their Carbohydrate Affinities and Where to Review HumanFind Them Lectins, Their Carbohydrate Affinities and Where to FindCláudia ThemD. Raposo 1,*, André B. Canelas 2 and M. Teresa Barros 1 1, 2 1 Cláudia D. Raposo * , Andr1 é LAQVB. Canelas‐Requimte,and Department M. Teresa of Chemistry, Barros NOVA School of Science and Technology, Universidade NOVA de Lisboa, 2829‐516 Caparica, Portugal; [email protected] 12 GlanbiaLAQV-Requimte,‐AgriChemWhey, Department Lisheen of Chemistry, Mine, Killoran, NOVA Moyne, School E41 of ScienceR622 Co. and Tipperary, Technology, Ireland; canelas‐ [email protected] NOVA de Lisboa, 2829-516 Caparica, Portugal; [email protected] 2* Correspondence:Glanbia-AgriChemWhey, [email protected]; Lisheen Mine, Tel.: Killoran, +351‐212948550 Moyne, E41 R622 Tipperary, Ireland; [email protected] * Correspondence: [email protected]; Tel.: +351-212948550 Abstract: Lectins are a class of proteins responsible for several biological roles such as cell‐cell in‐ Abstract:teractions,Lectins signaling are pathways, a class of and proteins several responsible innate immune for several responses biological against roles pathogens. such as Since cell-cell lec‐ interactions,tins are able signalingto bind to pathways, carbohydrates, and several they can innate be a immuneviable target responses for targeted against drug pathogens. delivery Since sys‐ lectinstems. In are fact, able several to bind lectins to carbohydrates, were approved they by canFood be and a viable Drug targetAdministration for targeted for drugthat purpose. delivery systems.Information In fact, about several specific lectins carbohydrate were approved recognition by Food by andlectin Drug receptors Administration was gathered for that herein, purpose. plus Informationthe specific organs about specific where those carbohydrate lectins can recognition be found by within lectin the receptors human was body.
    [Show full text]
  • Human Genome Reference Program (HGRP)
    Frequently Asked Questions – Human Genome Reference Program (HGRP) Funding Announcements • RFA-HG-19-002: High Quality Human Reference Genomes (HQRG) • RFA-HG-19-003: Research and Development for Genome Reference Representations (GRR) • RFA-HG-19-004: Human Genome Reference Center (HGRC) • NOT-HG-19-011: Emphasizing Opportunity for Developing Comprehensive Human Genome Sequencing Methodologies Concept Clearance Slides https://www.genome.gov/pages/about/nachgr/september2018agendadocuments/sept2018council_hg _reference_program.pdf. *Note that this presentation includes a Concept for a fifth HGRP component seeking development of informatics tools for the pan-genome. Eligibility Questions 1. Are for-profit entities eligible to apply? a. Only higher education institutes, governments, and non-profits are eligible to apply for the HGRC (RFA-HG-19-004). b. For-profit entities are eligible to apply for HQRG and GRR (RFA-HG-19-002 and 003), and for the Notice for Comprehensive Human Genome Sequencing Methodologies (NOT-HG- 19-011). The Notice also allows SBIR applications. 2. Can foreign institutions apply or receive subcontracts? a. Foreign institutions are eligible to apply to the HQRG and GRR announcements, and Developing Comprehensive Sequencing Methodologies Notice. b. Foreign institutions, including non-domestic (U.S.) components of U.S. organizations, are not eligible for HGRC. However, the FOA does allow foreign components. c. For more information, please see the NIH Grants Policy Statement. 3. Will applications with multiple sites be considered? Yes, applications with multiple sites, providing they are eligible institutions, will be considered. Application Questions 1. How much funding is available for this program? NHGRI has set aside ~$10M total costs per year for the Human Genome Reference Program.
    [Show full text]
  • De Novo Genome Assembly Versus Mapping to a Reference Genome
    De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer Science University of Würzburg, Germany University of Applied Sciences Western Switzerland [email protected] 1 Outline ● Genetic variations ● De novo sequence assembly ● Reference based mapping/alignment ● Variant calling ● Comparison ● Conclusion 2 What are variants? ● Difference between a sample (patient) DNA and a reference (another sample or a population consensus) ● Sum of all variations in a patient determine his genotype and phenotype 3 Variation types ● Small variations ( < 50bp) – SNV (Single nucleotide variation) – Indel (insertion/deletion) 4 Structural variations 5 Sequencing technologies ● Sequencing produces small overlapping sequences 6 Sequencing technologies ● Difference read lengths, 36 – 10'000bp (150-500bp is typical) ● Different sequencing technologies produce different data And different kinds of errors – Substitutions (Base replaced by other) – Homopolymers (3 or more repeated bases) ● AAAAA might be read as AAAA or AAAAAA – Insertion (Non existent base has been read) – Deletion (Base has been skipped) – Duplication (cloned sequences during PCR) – Somatic cells sequenced 7 Sequencing technologies ● Standardized output format: FASTQ – Contains the read sequence and a quality for every base http://en.wikipedia.org/wiki/FASTQ_format 8 Recreating the genome ● The problem: – Recreate the original patient genome from the sequenced reads ● For which we dont know where they came from and are noisy ● Solution: – Recreate the genome
    [Show full text]
  • CLC and IFNAR1 Are Differentially Expressed and a Global Immunity Score Is Distinct Between Early- and Late-Onset Colorectal Cancer
    Genes and Immunity (2011) 12, 653–662 & 2011 Macmillan Publishers Limited All rights reserved 1466-4879/11 www.nature.com/gene ORIGINAL ARTICLE CLC and IFNAR1 are differentially expressed and a global immunity score is distinct between early- and late-onset colorectal cancer TH A˚ gesen1,2, M Berg1,2, T Clancy3, E Thiis-Evensen4, L Cekaite1,2, GE Lind1,2, JM Nesland5,6, A Bakka7, T Mala8, HJ Hauss9, T Fetveit10, MH Vatn4,11, E Hovig3,12,13, A Nesbakken2,6,8, RA Lothe1,2,6 and RI Skotheim1,2 1Department of Cancer Prevention, Institute for Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway; 2Centre for Cancer Biomedicine, University of Oslo, Oslo, Norway; 3Department of Tumor Biology, Institute for Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway; 4Department for Organ Transplantation, Gastroenterology and Nephrology, Rikshospitalet, Oslo University Hospital, Oslo, Norway; 5Division of Pathology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway; 6Faculty of Medicine, The University of Oslo, Oslo, Norway; 7Department of Digestive Surgery, Akershus University Hospital, Lørenskog, Norway; 8Department of Gastrointestinal Surgery, Aker, Oslo University Hospital, Oslo, Norway; 9Department of Gastrointestinal Surgery, Sørlandet Hospital, Kristiansand, Norway; 10Department of Surgery, Sørlandet Hospital, Arendal, Norway; 11Epigen, Akershus University Hospital, Lørenskog, Norway; 12Institute of Medical Informatics, The Norwegian Radium Hospital, Oslo University
    [Show full text]
  • Practical Guideline for Whole Genome Sequencing Disclosure
    Practical Guideline for Whole Genome Sequencing Disclosure Kwangsik Nho Assistant Professor Center for Neuroimaging Department of Radiology and Imaging Sciences Center for Computational Biology and Bioinformatics Indiana University School of Medicine • Kwangsik Nho discloses that he has no relationships with commercial interests. What You Will Learn Today • Basic File Formats in WGS • Practical WGS Analysis Pipeline • WGS Association Analysis Methods Whole Genome Sequencing File Formats WGS Sequencer Base calling FASTQ: raw NGS reads Aligning SAM: aligned NGS reads BAM Variant Calling VCF: Genomic Variation How have BIG data problems been solved in next generation sequencing? gkno.me Whole Genome Sequencing File Formats • FASTQ: text-based format for storing both a DNA sequence and its corresponding quality scores (File sizes are huge (raw text) ~300GB per sample) @HS2000-306_201:6:1204:19922:79127/1 ACGTCTGGCCTAAAGCACTTTTTCTGAATTCCACCCCAGTCTGCCCTTCCTGAGTGCCTGGGCAGGGCCCTTGGGGAGCTGCTGGTGGGGCTCTGAATGT + BC@DFDFFHHHHHJJJIJJJJJJJJJJJJJJJJJJJJJHGGIGHJJJJJIJEGJHGIIJJIGGIIJHEHFBECEC@@D@BDDDDD<<?DB@BDDCDCDDC @HS2000-306_201:6:1204:19922:79127/2 GGGTAAAAGGTGTCCTCAGCTAATTCTCATTTCCTGGCTCTTGGCTAATTCTCATTCCCTGGGGGCTGGCAGAAGCCCCTCAAGGAAGATGGCTGGGGTC + BCCDFDFFHGHGHIJJJJJJJJJJJGJJJIIJCHIIJJIJIJJJJIJCHEHHIJJJJJJIHGBGIIHHHFFFDEEDED?B>BCCDDCBBDACDD8<?B09 @HS2000-306_201:4:1106:6297:92330/1 CACCATCCCGCGGAGCTGCCAGATTCTCGAGGTCACGGCTTACACTGCGGAGGGCCGGCAACCCCGCCTTTAATCTGCCTACCCAGGGAAGGAAAGCCTC + CCCFFFFFHGHHHJIJJJJJJJJIJJIIIGIJHIHHIJIGHHHHGEFFFDDDDDDDDDDD@ADBDDDDDDDDDDDDEDDDDDABCDDB?BDDBDCDDDDD
    [Show full text]
  • Y Chromosome
    G C A T T A C G G C A T genes Article An 8.22 Mb Assembly and Annotation of the Alpaca (Vicugna pacos) Y Chromosome Matthew J. Jevit 1 , Brian W. Davis 1 , Caitlin Castaneda 1 , Andrew Hillhouse 2 , Rytis Juras 1 , Vladimir A. Trifonov 3 , Ahmed Tibary 4, Jorge C. Pereira 5, Malcolm A. Ferguson-Smith 5 and Terje Raudsepp 1,* 1 Department of Veterinary Integrative Biosciences, College of Veterinary Medicine and Biomedical Sciences, Texas A&M University, College Station, TX 77843-4458, USA; [email protected] (M.J.J.); [email protected] (B.W.D.); [email protected] (C.C.); [email protected] (R.J.) 2 Molecular Genomics Workplace, Institute for Genome Sciences and Society, Texas A&M University, College Station, TX 77843-4458, USA; [email protected] 3 Laboratory of Comparative Genomics, Institute of Molecular and Cellular Biology, 630090 Novosibirsk, Russia; [email protected] 4 Department of Veterinary Clinical Sciences, College of Veterinary Medicine, Washington State University, Pullman, WA 99164-6610, USA; [email protected] 5 Department of Veterinary Medicine, University of Cambridge, Cambridge CB3 0ES, UK; [email protected] (J.C.P.); [email protected] (M.A.F.-S.) * Correspondence: [email protected] Abstract: The unique evolutionary dynamics and complex structure make the Y chromosome the most diverse and least understood region in the mammalian genome, despite its undisputable role in sex determination, development, and male fertility. Here we present the first contig-level annotated draft assembly for the alpaca (Vicugna pacos) Y chromosome based on hybrid assembly of short- and long-read sequence data of flow-sorted Y.
    [Show full text]