A Genomic Analysis Pipeline and Its Application to Pediatric Cancers

UC Irvine UC Irvine Electronic Theses and Dissertations Title A Genomic Analysis Pipeline and Its Application to Pediatric Cancers Permalink https://escholarship.org/uc/item/2vk0r9s3 Author Zeller, Michael Publication Date 2014 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE A Genomic Analysis Pipeline and Its Application to Pediatric Cancers DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Michael Dylan Zeller Dissertation Committee: Professor Pierre Baldi, Chair Associate Professor Xiaohui Xie Professor Bogi Andersen 2014 Portions of Chapters 1-6 c 2014 IEEE Chapter 3 Section 3.1.3 c 2013 Elsevier, Inc. Chapter 3 Section 3.2.2 c 2010 Springer-Verlag Chapter 3 Section 3.2.5 c 2013 Nature Publishing Group Chapter 4 Section 4.1.1 c 2013 National Academy of Sciences All other materials c 2014 Michael Dylan Zeller DEDICATION In memory of my father, Donald G. Zeller. ii TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF TABLES vii ACKNOWLEDGMENTS viii CURRICULUM VITAE ix ABSTRACT OF THE DISSERTATION xii 1 Introduction 1 1.1 Heterogeneityofcancer. .. 2 1.2 Geneexpressionincancer . .. 3 1.3 Theriseofpersonalizedmedicine . ..... 4 1.4 Pediatriccancer................................. 5 1.5 Genomicanalysispipeline . ... 6 2 Sequence processing methods 8 2.1 DNA-seq...................................... 9 2.1.1 Genomeassembly............................. 11 2.1.2 Identifyingvariations . .. 12 2.1.3 Variantlocationandeffect . 15 2.2 RNA-seq...................................... 18 2.2.1 Smallvariationsintranscripts . .... 20 2.2.2 De novo assembly............................. 22 2.3 ChIP-seq...................................... 23 2.3.1 TCF1EChIP-seqinacoloncancercellline . ... 25 2.3.2 Sequence features of GRHL3 ChIP-seq peaks . ... 27 3 Expression analysis 30 3.1 Transcriptionfactorbindingsites . ....... 31 3.1.1 Searchingformotifs. 32 3.1.2 Incorporatingconservation . ... 32 3.1.3 Phosphorylation site prediction from motifs . ....... 33 3.1.4 Alternative target recognition in the Wnt signaling pathway ..... 37 iii 3.1.5 Enrichmentanalysis ........................... 42 3.1.6 Defining the transcriptional regulatory network of skinwounding. 43 3.2 Differentialexpressionanalysis. ....... 45 3.2.1 Microarraydifferentialanalysis . .... 46 3.2.2 Identifyingdifferentialantigens . ..... 49 3.2.3 RNA-seqdifferentialanalysis. 50 3.2.4 RNA-seq differential analysis in pediatric patients . ........ 53 3.2.5 Neuron-specific chromatin regulatory subunit BAF53b . ...... 55 3.3 Hybridapproach ................................. 59 4 Integrative analysis 62 4.1 Networkanalysis ................................. 62 4.1.1 Circadian clock regulates the host response to Salmonella....... 64 4.2 Cancer-specificintegrativeanalysis . ........ 67 4.2.1 Curatingliterature . 68 4.2.2 GEOdatasets............................... 70 4.2.3 Genelistoverlaps............................. 74 4.2.4 Normaltissueexpression . 76 4.2.5 Mitelmanfusions ............................. 76 4.2.6 Prognosis ................................. 78 4.3 Cancerpathways ................................. 79 4.4 Identifyingdrugcandidates . .... 81 5 Reporting 83 5.1 Framework..................................... 84 5.2 Ranking pathways and genes in pediatric cancer . ........ 85 5.3 Interestingfindingsinpediatriccancer . ........ 87 5.3.1 PatientCHOC23(AML)......................... 87 5.3.2 PatientCHOC33(Neuroblastoma) . 90 5.3.3 PatientsCHOC36andCHOC03(AML) . 91 6 Conclusion 95 6.1 Applicationsofpipeline. ... 96 6.2 Comparisontootherwork . 96 6.3 Importanceofautomation . .. 97 6.4 Valueofintegrativeapproach . .... 98 6.5 Futurework.................................... 99 Bibliography 100 A Supplementary Information 117 A.1 Hardwarerequirements. 117 A.2 Softwarerequirements . 118 A.3 Pediatriccancer.................................. 119 B Collaborative interface 123 iv LIST OF FIGURES Page 1.1 Heterogeneityofcancer. .. 2 2.1 Pipelineoverview ................................ 8 2.2 RNA-seq coverage for Grhl3 transcript in knock-out . ........ 20 2.3 k-medoid clustering of time-course RNA-seq reads . .... 21 2.4 dnTCF1E ChIP-seq reads in Colon cancer cells . ...... 26 2.5 Identifying enriched features in ChIP-seq peaks . .......... 28 3.1 Transcriptionalregulationoverview . ........ 31 3.2 Visual representation of position-weight matrix . .......... 33 3.3 Predicted phosphorylation sites using PWM search . ......... 34 3.4 MotifsidentifiedinTCF1ChIP-seqpeaks . ..... 39 3.5 Histogram of frequency of C-clamp motif in ChIP-seq peaks ......... 40 3.6 Identifying functional ChIP-seq peaks based on PWMs . ......... 41 3.7 Enriched transcription factors in human skin wounding response . 45 3.8 Examplemicroarrayimagefortwosamples . ...... 49 3.9 Identifyingdifferentialantigens . ....... 51 3.10 Performance of predicting antigens using protein microarrays. 52 3.11 Expression differences between patients . ......... 54 3.12 RNA-seq differential genes in learning and memory . ....... 57 3.13 Regression of rat RNA-seq with microarray for the same samples....... 60 3.14 Yeast hybrid differential analysis results . .......... 61 4.1 Transcription factor network integrated with microarraydata......... 66 4.2 Quantile normalized microarray datasets for each cancertype ........ 72 4.3 Diagram of the processes used to prioritize expression variations . 74 4.4 Distribution of genetic variants among RPKM values . ......... 75 4.5 Detecting false positive differential genes . .......... 77 4.6 Prognosticmarkersandsurvivalcurves . ....... 79 4.7 Example of a network with drug, interaction, and transcription databases . 80 4.8 PortionofKEGGAMLpathwayfor patientCHOC36 . .. 81 4.9 CHOC36drug-targetedgesinAMLpathway . .. 82 5.1 Exampleorg-modemarkupforreports . .... 85 5.2 UCSC Genome Browser displaying genetic variations in PARK2 transcripts . 91 v 5.3 PARK2geneexpressioninpatienttumors . .... 92 5.4 EnrichedtranscriptionfactorLMO2inAML . ..... 93 6.1 Technology biases in different sequence technologies . ............ 99 vi LIST OF TABLES Page 2.1 Typical distribution of the small variations . .......... 13 2.2 Distribution of extracted gene lists per patient . ........... 16 3.1 Contingencytable................................ 43 3.2 Enriched transcription factors in skin wounding . .......... 44 4.1 Distributionofcuratedgenelists . ...... 68 4.2 Numberofgenesymbolsincuratedgenelists . ..... 69 4.3 Number of microarray samples per type of cancer . ...... 71 4.4 Significance of gene list overlaps with missense mutations........... 75 4.5 Hits in Mitelman fusion database for sarcoma patient . ......... 78 5.1 Top10rankedpathwaysforCHOC23(AML) . 88 5.2 Top10curatedgenerankingforCHOC23(AML) . .. 89 vii ACKNOWLEDGMENTS I would like to acknowledge all of our past and present collaborators. Without their data, none of this work would have been possible: Olivier Cinquin Lab, Marcelo Wood Lab, Bogi Andersen Lab, Philip Felgner Lab, Lee Bardwell Lab, Paolo Corsi Lab, Alan Barbour Lab, Susan Sandmeyer Lab, Peter Kaiser Lab, and Dr. Sender and his colleagues at CHOC. In particular I would like to acknowledge Amy Hopkin, Liz and Will Gordon, Ramzi Nasr, Marina Bellet, Arlo Randall, Christophe Magnan, Matthew Kayala, Paul Rigor, Kenny Daily, Pietro di Lena, Vishal Patel, Thomas Kristensen, Annie Vogel, James Yu, Nate Hoverter, Yu Liu, Nick Ceglia, Peter Sadowski, and last but certainly not least, Stacey Borrego. I would also like to thank those entities whose funding made this work possible: 1. Bioinformatics Training Grant (BIT). NIH Grant 5T15LM007743. 2009-2013 2. UC Irvine Center for Complex Biological System’s Opportunity Award. Defining the Transcriptional Regulatory Network of Acute Human Epidermal Wound Response in Time and Space. Collaboration with William Gordon, Bogi Andersen Lab. 2011-2012 This work has been made possible due to support by a grant from the Hyundai Foundation to Dr. Leonard Sender and grants NIH LM01, NIH NLM T15 LM07, and NSF IIS-0513376 to Pierre Baldi. We acknowledge also the support of the CHOC, the UCI Institute for Genomics and Bioin- formatics, and the UCI Genomics High-Throughput Facility. Additional support of our computational infrastructure has been provided by Jordan Hayes and Yuzo Kanomata. Professors Rick H. Lathrop and Charless Fowlkes are acknowledged for providing guidance while serving on my advancement committee. Professor Xiaohui Xie for providing exceptional training for the BIT program coursework. Professor Bogi Andersen for providing insightful discussion during the course of my PhD through regular research meetings. I thank Professor Pierre Baldi for his mentorship and guidance during my graduate career and for his valuable advice on all aspects of my thesis. In addition, the inclusion of our previous articles discussed in this document was made possible by the generous policies of IEEE, Springer-Verlag, the National Academy of Sciences, Elsevier, Inc., and the Nature Publishing Group. viii CURRICULUM VITAE Michael Dylan Zeller EDUCATION DoctorofPhilosophyinComputerScience 2014 University of California, Irvine Irvine, California BachelorofScienceinComputerScience 2008 Washington University St. Louis, Missouri BachelorofScienceinBiomedicalEngineering 2008 Washington University St. Louis, Missouri RESEARCH EXPERIENCE Graduate Research Assistant 2008–2014 University of California, Irvine Irvine, California TEACHING EXPERIENCE

A Genomic Analysis Pipeline and Its Application to Pediatric Cancers

The Principled Design of Large-Scale Recursive Neural Network Architectures–DAG-Rnns and the Protein Structure Prediction Problem

Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

Deep Learning in Chemoinformatics Using Tensor Flow

Course Outline

Whole Exome Sequencing in Families at High Risk for Hodgkin Lymphoma: Identification of a Predisposing Mutation in the KDR Gene

Deep Learning in Science Pierre Baldi Frontmatter More Information

The Roles of Trim15 and UCHL3 in the Ubiquitin-Mediated Cell Cycle Regulation Katerina Jerabkova

Bringing Folding Pathways Into Strand Pairing Prediction

Nº Ref Uniprot Proteína Péptidos Identificados Por MS/MS 1 P01024

ACDL-2020-Programme-Ver10.0

I S C B N E W S L E T T

Pierre Baldi