LENGTH TRANSCRIPTOME SEQUENCING a Dissertation

HEMATOPOIETIC CELL POPULATION SEGREGATION THROUGH FULL- LENGTH TRANSCRIPTOME SEQUENCING A Dissertation submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Tumor Biology By Anne Deslattes Mays, M.Sc., M.Sc. Washington, DC July 17, 2015 Copyright 2015 by Anne Deslattes Mays All Rights Reserved ii HEMATOPOIETIC CELL POPULATION SEGREGATION THROUGH FULL- LENGTH TRANSCRIPTOME SEQUENCING Anne Deslattes Mays, M.Sc., M.Sc. Thesis Advisor: Anton Wellstein, M.D. Ph.D. ABSTRACT “Progress in science results from new technologies, new discoveries and new ideas, probably in that order.” Nobel Laureate Sydney Brenner (1927 - ) Sequencing the human genome was a critical first step in setting the groundwork to understanding the molecular programming that is involved in transforming a cell from a healthy to a cancerous state. Cellular transcriptome complexity has become increasingly more apparent as technological advances have exposed us to its diversity. Full-length RNA sequencing is crucial for an unbiased analysis of transcriptome complexity. This complexity is due to posttranscriptional processing of primary transcripts that results in a variety of isoforms generated from the same genomic loci. Distinct cell lineages are defined by their transcript isoform expression profiles, and the annotation of cells can be derived from the expression of transcript isoforms that can result in functionally different proteins. Alternate splice site utilization provides cells with a powerful regulatory mechanism of gene expression that can impact the composition of the protein product, and influence the rate of translation of transcripts from multi-exon genes. The overall goal of this project was to delineate the hematopoietic transcriptome revealed by full-length sequencing and assess the shortcomings of transcriptome iii reconstruction using fragmented-read sequencing. The aims were to (a) evaluate the complexity of the hematopoietic transcriptome using full-length RNA sequencing, to (b) compare the full- length RNA-sequencing transcriptome with the reconstructed transcriptome from fragmented- read sequencing and to (c) evaluate whether hematopoietic cell subpopulations show distinct transcriptome patterns. Sequencing and reconstructing transcripts through transcriptome reconstruction from fragmented read sequencing have advanced our understanding of the transcriptome. Here we show that full-length transcriptome sequencing is necessary to faithfully expose the transcriptome and understand its complexities. Abundance information and pathway analysis support this. Also, full-length sequencing illustrates open reading frames that code for contiguous canonical or fusion proteins that can be validated with peptides. This transcriptome diversity is consistent with distinct phenotypes of cell subpopulations present in tissues. Accurate transcriptome measurement builds a foundation that can be relied upon to ensure higher success rates for therapeutics and lower false discovery rates for biomarkers of disease. The analysis of transcripts of a set of selected genes as well as the potential for posttranscriptional processing predicts for a highly complex transcriptome and an abundance of hitherto unknown protein isoforms. Classic approaches have not allowed full testing of this hypothesis due to limitations in sequencing lengths. Taking advantage of full-length sequencing technology provides us with an opportunity to uncover transcripts that cannot be obtained through traditional transcript reconstruction techniques. iv DEDICATION Life and time are often inconvenient partners that challenge us to keep moving forward while circumstances do their best to derail us from our chosen paths. While heading up software development for Craig Venter and his team at Celera (sequencing the human genome) my husband suffered a serious stroke. A long and ongoing recovery period has followed. During this period my father was stricken with and subsequently died of lung cancer. Time passed all too quickly as I raised my daughter through adolescence while at the same time my mother began her slow drift into dementia. It was during this period between my fathers’ death and the early stages of my mothers’ decline that I committed myself to pursing my PhD. By no-means an easy decision - and one that has tested the bounds of time, life and love for me and those around me. I feel the need to complete the work that Craig Venter had asked us to do at Celera and to show my daughter what can be accomplished despite the adversity and circumstances that life and time present us. Craig challenged to us to first, sequence the human genome, and then second, to cure cancer. We are not there yet, and I am committed to help accomplish that task. I feel strongly we must use skills and resources we have, from biological assay, to mathematical algorithm, to complex computer infrastructure to beat down the details in such a way we can dissect the signals of disease and understand its origins. We must ask the right questions, use the right tools, and work to de-obfuscate the information we have. v To my daughter Katie, thank you for your love and support throughout these years. You inspire me and challenge me in unexpected and surprising ways. Thank you for revealing what you see with your eyes, hear with your ears and create with your unbelievable imagination. I am so grateful for your understanding and patience throughout this journey. This thesis is dedicated to you. vi ACKNOWLEDGEMENTS I would like to thank my mentor, Dr. Anton Wellstein. He gave me a project that he thought was right up my alley. The journey presented challenges unseen at the beginning, yet ultimately produced results well worth the efforts. I would also like to thank Dr. Anna Tate Riegel for allowing me to enter the program. Her belief that I could do this and her willingness to help me navigate through the process helped inspire me to try despite the obstacles and challenges. I am grateful to my Thesis Committee Members - Dr. Michael Johnson, Dr. Anatoly Dritschilo, Dr. Habtom Resom, Dr. Yuri Gusev, and Dr. Christopher Loffredo. Their time, support and advice during the development of my project have been greatly appreciated. A special “thank you” to Dr. Terence Ryan for all your support throughout these years and for being my external committee member. I wish to thank Dr. Eric Schadt - a brief encounter at Cold Spring Harbor started me on the path to complete of my journey. Thank you for getting Dr. Robert Sebra to sequence that first sample for me, I wouldn't be writing these words had you not started that ball rolling for me. I wish to thank Dr. Mike Hunkapillar and Dr. Elizabeth Tseng and the Pacific Biosciences collaboration that made the completion of this work possible. Thank you Liz for your software, your friendship and your dedication. None of the work presented in this thesis would have been possible without it. I would like to also thank present and past members of the Wellstein-Riegel lab, Drs. Sonia Rosenfeld and Elena Tassi who shared differing lab corners over the years and who vii witnessed my trials and tribulations supporting me with love and kindness without which I would not have made it to this end. To Garrett Graham, thank you for being an unofficial member of the lab, I am appreciative of the late night and weekend discussions regarding GRanges, BioViz and all other things bioconductor. The spirit in this lab is without match -- the program is a special one of dedication and striving for proper and correct science. I would like to thank current and past members of KeyGene, especially I give my thanks to Dr. Arjen van Tunen, Dr. Leo Zwinkels, Dr. Mark van Haaren, Dr. An Michiels , Dr. Jan van Oeveren, Mike Cariaso and Matthew McCoy and most recently Dr. Fayaz Khazi of KeyGene -- for your support throughout the years. To the future Dr. Rutger van Bergem, it was wonderful to have a partner in this final 800 meters of the race. I am too slow a runner to beat you and Dr. Eveline Vietsch in a running race -- but I guess I got to finish just a hair ahead of you in this PhD race! Thank you for your encouragement and for your pushing me along. Finally, I would like to say a special thank you to Dr. Marcel Schmidt who taught me all I know for working on the bench and has been a staunch supporter and friend throughout these years. viii INDEX CHAPTER 1 - INTRODUCTION .................................................................................................. 1 A. Genome, Genomic Loci of Genes, mRNA and mRNA Isoforms ........................................... 1 B. Technological Advances Drive Discoveries ........................................................................... 2 C. RNA Sequencing ..................................................................................................................... 4 D. Transcriptional Measurement Limitations .............................................................................. 6 E. Cancer Discoveries and Landmarks ........................................................................................ 9 F. Hematopoietic Transcriptome ............................................................................................... 10 G. Transcript Expression, Structure and Mutational Landscape ............................................... 11 H. Hypothesis, Goal and Specific Aims ...................................................................................

LENGTH TRANSCRIPTOME SEQUENCING a Dissertation

Analysis of Trans Esnps Infers Regulatory Network Architecture

Transcriptome Analyses of Rhesus Monkey Pre-Implantation Embryos Reveal A

Mouse Letmd1 Conditional Knockout Project (CRISPR/Cas9)

Manual Annotation and Analysis of the Defensin Gene Cluster in the C57BL

Genome-Wide DNA Methylation Profiling Identifies Differential Methylation in Uninvolved Psoriatic Epidermis

Epigenetic Mechanisms of Lncrnas Binding to Protein in Carcinogenesis

Proteomic Analysis Uncovers Measles Virus Protein C Interaction with P65

Supplemental Information

Noelia Díaz Blanco

Variation in Protein Coding Genes Identifies Information

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD

Literature Mining Sustains and Enhances Knowledge Discovery from Omic Studies