Downloaded Directly from the TCGA
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title Building bioinformatic tools for massive repurposing of multi-omic data in the Sequence Read Archive Permalink https://escholarship.org/uc/item/62g7m386 Author Tsui, Brian Yik Tak Publication Date 2019 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SAN DIEGO Building bioinformatic tools for massive repurposing of multi-omic data in the Sequence Read Archive A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Bioinformatics and Systems Biology by Brian Yik Tak Tsui Committee in charge: Professor Hannah Carter, Chair Professor Jill Mesirov, Co-Chair Professor Ruben Abagyan Professor Terry Gaasterland Professor Nathan Lewis 2019 Copyright Brian Yik Tak Tsui, 2019 All rights reserved. The Dissertation of Brian Yik Tak Tsui is approved, and it is acceptable in quality and form for publication on microfilm and electronically: University of California San Diego 2019 iii DEDICATION I want to thank all my friends and family members who have made my Ph.D. journey fun and enjoyable. iv TABLE OF CONTENTS SIGNATURE PAGE ...................................................................................................................... iii DEDICATION ............................................................................................................................... iv TABLE OF CONTENTS ................................................................................................................ v LIST OF FIGURES ...................................................................................................................... vii LIST OF TABLES ......................................................................................................................... ix ACKNOWLEDGEMENTS ............................................................................................................ x VITA ............................................................................................................................................. xii ABSTRACT OF THE DISSERTATION ..................................................................................... xiv INTRODUCTION ........................................................................................................................... 1 References ................................................................................................................................... 3 CHAPTER 1: Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive ................................................................................................................................... 4 1.1 Abstract .................................................................................................................................. 4 1.2 Introduction ........................................................................................................................... 5 1.3 Results ................................................................................................................................... 8 1.4 Discussion ............................................................................................................................ 15 1.5 Materials and Methods ........................................................................................................ 18 1.6 Figures ................................................................................................................................. 21 1.7 Tables .................................................................................................................................. 27 1.8 Supplementary code and data availability ........................................................................... 29 1.9 Acknowledgements ............................................................................................................. 30 1.10 References ......................................................................................................................... 31 CHAPTER 2: SkyMap JupyterHub: A cloud platform to query and analyze >900,000 re-processed public sequencing experiments from >20,000 studies .................................................................. 34 2.1 Abstract ................................................................................................................................ 34 v 2.2 Introduction ......................................................................................................................... 35 2.3 Results ................................................................................................................................. 37 2.4 Discussion ............................................................................................................................ 44 2.5 Materials and Methods ........................................................................................................ 46 2.6 Figures ................................................................................................................................. 50 2.8 Table .................................................................................................................................... 63 2.10 Acknowledgements ........................................................................................................... 64 2.11 References ......................................................................................................................... 65 CHAPTER 3: Creating a scalable deep learning based Named Entity Recognition Model for biomedical textual data by repurposing BioSample free-text annotations .................................... 68 3.1 Abstract ................................................................................................................................ 68 3.2 Introduction ......................................................................................................................... 69 3.3 Method and materials .......................................................................................................... 72 3.4 Results ................................................................................................................................. 78 3.5 Discussion ............................................................................................................................ 80 3.6 Figures ................................................................................................................................. 82 3.7 Tables .................................................................................................................................. 90 3.8 Supplemental Figures .......................................................................................................... 91 3.9 Supplementary Table ........................................................................................................... 94 3.10 Acknowledgement ............................................................................................................. 97 3.11 References ......................................................................................................................... 98 EPILOGUE .................................................................................................................................. 101 vi LIST OF FIGURES Figure 1.1. Number of sequencing runs are increasing exponentially in the SRA ........................ 21 Figure 1.2. Simple pipeline for extracting >300,000 human sequencing runs from the SRA. ..... 22 Figure 1.3. Distribution of the processed SRA data. ..................................................................... 23 Figure 1.4. Targeted reference remains accurate for sequence alignment. ................................... 24 Figure 1.5. Effect of PCR duplicates removal ............................................................................... 25 Figure 1.6. RNA-seq can recover variants extracted from WXS .................................................. 26 Figure 2.1. Overview ..................................................................................................................... 50 Figure 2.2. The landscape of reprocessed SRA data ..................................................................... 51 Figure 2.3. Variant extraction using our JuypterHub. ................................................................... 52 Figure 2.4. Expression analysis using our JupyterHub. ................................................................ 53 Figure 2.5. Large-scale meta-analysis using reprocessed RNA-seq data. ..................................... 54 Figure S2.1: SRA data landscape. ................................................................................................. 55 Figure S2.2: Microbe read lengths. ............................................................................................... 56 Figure S2.3: Workflow diagram. ................................................................................................... 57 Figure S2.4: Distribution of microbe abundance in the SRA. ....................................................... 58 Figure S2.5: Plotly cloud user interface. ....................................................................................... 59 Figure S2.6: Reference