RNA and DNA Sequence Analysis of the Human Transcriptome
Total Page:16
File Type:pdf, Size:1020Kb
University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2013 RNA and DNA Sequence Analysis of the Human Transcriptome Jonathan M. Toung University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Bioinformatics Commons, and the Genetics Commons Recommended Citation Toung, Jonathan M., "RNA and DNA Sequence Analysis of the Human Transcriptome" (2013). Publicly Accessible Penn Dissertations. 709. https://repository.upenn.edu/edissertations/709 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/709 For more information, please contact [email protected]. RNA and DNA Sequence Analysis of the Human Transcriptome Abstract The manifestation of phenotype at the cellular and organismal level is determined in large part by gene expression, or the transcription of DNA into RNA. As such, the study of the transcriptome, or the characterization and quantification of all RNA produced in the cell, is important. Recent advances in sequencing technology have allowed for unprecedented interrogation of the transcriptome at single- nucleotide resolution. In the first part of this thesis, we use RNA-Sequencing (RNA-Seq) to study the human B-cell transcriptome and determine the experimental parameters necessary for sequencing-based studies of gene expression. We discover that deep sequencing is necessary to detect fully and quantify accurately the complexity of human transcriptomes. Furthermore, we find that at high sequencing depths, the vast majority of transcribed elements in human B-cells are detected. In the second part of this thesis, we utilize the sequence information provided by RNA-Seq to analyze systematic differences between DNA and RNA sequence. The transmission of information from DNA to RNA is a critical process and is expected to occur in a one-to-one fashion. By comparing the DNA sequence to RNA sequence of the same individuals, we found all 12 types of RNA-DNA sequence differences (RDDs), the majority of which cannot be explained by known mechanisms such as RNA editing or transcriptional infidelity. We developed computational methods to robustly identify RDDs and control for false positives resulting from genotyping, sequencing, and alignment error. Finally, we explore the genetic basis of RDD levels, or the proportion of reads at a site bearing the sequence difference allele. In particular, we analyzed the levels of RNA editing in unrelated and related individuals and found that a significant portion of individual variation in A-to-G editing levels contains a genetic component. In summary, our results demonstrate that RNA-Seq is a powerful technique for comprehensive and quantitative analysis of gene expression. In addition, the resolution offered by RNA-Seq enables a detailed view of sequence differences between RNA and DNA. Future work will focus on understanding the mechanisms and factors influencing RDDs. Our esultsr suggest that RDD levels may be considered a quantitative and heritable phenotype; as such, a genetic approach may be a sensible method for finding the determinants and mechanism of RDDs. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Genomics & Computational Biology First Advisor Frederic Bushman Second Advisor Nancy Zhang Keywords Computational biology, Genome analysis, Next-generation sequencing, RNA Editing, RNA-Sequencing, Transcriptome analysis Subject Categories Bioinformatics | Genetics This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/709 RNA AND DNA SEQUENCE ANALYSIS OF THE HUMAN TRANSCRIPTOME Jonathan M. Toung A DISSERTATION in Genomics and Computational Biology Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 2013 _______________________________ _______________________________ Frederic Bushman Nancy Zhang Professor of Microbiology Professor of Statistics Supervisor of Dissertation Co-supervisor of Dissertation _______________________________ Maja Bucan Professor of Genetics Graduate Group Chairperson Dissertation Committee: Christopher D. Brown, Assistant Professor of Genetics Gregory R. Grant, Research Assistant Professor of Genetics Shane T. Jensen, Associate Professor of Statistics (chair) Thomas A. Jongens, Associate Professor of Genetics Ponzy Lu, Professor of Chemistry Dedication For Esther and my family. ii Acknowledgements While many a times it has felt that the pursuit of this degree has been a solitary and singular effort, I would be foolish to not acknowledge the many people that have helped me tremendously along the way. This journey began by Dr. Ponzy Lu intermittently badgering me to pursue a Ph.D since the first time I sought out his advice as a freshman in the fall of 2006. My undergraduate mentor, thesis committee member, and life coach – I am truly indebted to your tireless support of me. I thank Dr. Vivian Cheung for her support of my studies. I thank Michael Morley for teaching me all there is to know about bioinformatics and programming. Much of what I have accomplished today computationally would not have been possible without your help. Thank you Dr. Isabel Xiaorong Wang. I truly appreciate your willingness to sit me down in your office and help with my committee meetings, my papers, my public talks. I thank past and present members of the Cheung lab for their help and support: Wendy Ankener, Will Bernal, Lauren Brady, Alan Bruzel, Jim Devlin, Susannah Elwyn, Brittany Gregory, Anna Lee, Sen Sen Liu, Yaojuan Liu, Colleen McGarry, Allison Richards, and Elizabeth So. I would not have been able to complete this degree had it not been for individuals who saw a precarious situation and did everything in their power to rectify it. Dr. Rick Bushman, I thank you so much for your support and encouragement throughout the year. Thank you for being the first to provide a space for me in your lab to finish my studies. You have gone above and beyond your duties as an advisor to vouch and fight for me. Dr. Nancy Zhang, I thank you for your support as my advisor. I cannot forget your efforts iii throughout the situation to make sure I could progress and move on with my studies. Your statistical insights into the project have always been inspiring, and I am truly grateful to have learned so much from you. Dr. John Hogenesch, you have been a tremendous help for my last paper. Your encouragement and advice on how to deal with the situation was invaluable. Dr. Greg Grant, chapter 4 would not have been possible without you. Thank you for your advice and devotion to the project. To members of my committee, Dr. Greg Grant, Dr. Shane T. Jensen, Dr. Tom Jongens, Dr. Ponzy Lu, Dr. Nancy Zhang and members of my academic review committee, Dr. Maja Bucan, Dr. Rick Bushman, and Dr. John Hogenesch – thank you all for getting me through this wild ride. Thank you Dr. Warren Ewens and Dr. Mingyao Li for your help and guidance on my studies. Thank you to everyone in the Genomics and Computational Biology program. A special thanks to Hannah Chervitz and all that you do to make our lives easier as students. To my parents, thank you for your support and lessons to never give up. I’m glad we made it through another journey. And lastly, to my dear girlfriend Esther. You deserve this degree as much as I do. You have put up with a lot in the past three years. It has been an emotional roller coaster and you have been with me every step of the way. Here’s to better days ahead. Someone once told me the wise saying that life is never as good as you imagine it to be nor as bad as you think it is. It will be just okay. iv ABSTRACT RNA AND DNA-SEQUENCE ANALYSIS OF THE HUMAN TRANSCRIPTOME Jonathan M. Toung Frederic Bushman Nancy Zhang The manifestation of phenotype at the cellular and organismal level is determined in large part by gene expression, or the transcription of DNA into RNA. As such, the study of the transcriptome, or the characterization and quantification of all RNA produced in the cell, is important. Recent advances in sequencing technology have allowed for unprecedented interrogation of the transcriptome at single-nucleotide resolution. In the first part of this thesis, we use RNA-Sequencing (RNA-Seq) to study the human B-cell transcriptome and determine the experimental parameters necessary for sequencing-based studies of gene expression. We discover that deep sequencing is necessary to detect fully and quantify accurately the complexity of human transcriptomes. Furthermore, we find that at high sequencing depths, the vast majority of transcribed elements in human B-cells are detected. In the second part of this thesis, we utilize the sequence information provided by RNA-Seq to analyze systematic differences between DNA and RNA sequence. The transmission of information from DNA to RNA is a critical process and is expected to occur in a one-to-one fashion. By comparing the DNA sequence to RNA sequence of the v same individuals, we found all 12 types of RNA-DNA sequence differences (RDDs), the majority of which cannot be explained by known mechanisms such as RNA editing or transcriptional infidelity. We developed computational methods to robustly identify RDDs and control for false positives resulting from genotyping, sequencing, and alignment error. Finally, we explore the genetic basis of RDD levels, or the proportion of reads at a site bearing the sequence difference allele. In particular, we analyzed the levels of RNA editing in unrelated and related individuals and found that a significant portion of individual variation in A-to-G editing levels contains a genetic component. In summary, our results demonstrate that RNA-Seq is a powerful technique for comprehensive and quantitative analysis of gene expression. In addition, the resolution offered by RNA-Seq enables a detailed view of sequence differences between RNA and DNA. Future work will focus on understanding the mechanisms and factors influencing RDDs.