Microarray Bioinformatics and Applications in Oncology
Total Page:16
File Type:pdf, Size:1020Kb
Microarray Bioinformatics and Applications in Oncology Justine Kate Peeters The work in this thesis was performed in the Department of Bioinformatics, Erasmus University Medical Center, Rotterdam, The Netherlands. The printing of this thesis was financially supported by J.E. Juriaanse Stichting, Rotterdam, The Netherlands The Netherlands Bioinformatics Center (NBIC) Affymetrix Europe Cover: painted by Anne Karin Pettersen Arvola, with compliments of Therese Sorlie Print: Printpartners Ipskamp, Enschede www.ppi.nl Lay-out: Legatron Electronic Publishing, Rotterdam ISBN: 978-90-9023048-1 Copyright © J.K. Peeters All right reserved. No part of this thesis may be reproduced or transmitted in any form, by any means, electronic or mechanical, without the prior written permission of the author, or where appropirate, of the publisher of the articles. Microarray Bioinformatics and Applications in Oncology Toepassingen van bioinformatica en microarray’s in oncologie Proefschrift ter verkrijging van de graad van doctor aan de Erasmus Universiteit Rotterdam op gezag van de Prof. dr. S.W.J. Lamberts en volgens besluit van het college voor Promoties De openbare verdediging zal plaatsvinden op Woensdag 11 juni 2008 om 13.45 uur door Justine Kate Peeters geboren te Melbourne, Australia Promotiecommissie Promotor: Prof.dr. P.J. van der Spek Co-promotor: Dr. A.E.M. Schutte Overige leden: Prof.dr. F.G. Grosveld Prof.dr. P.A.E. Sillevis Smitt Prof.dr. L.H.J. Looijenga This thesis is dedicated to my Oma ‘Wilhelmena Johanna Peeters’ (1918-2004). Table of Contents 1 Introduction to Microarray Bioinformatics 11 1.0 Introduction to microarray bioformatics 12 1.1 The age of science and the evolution into the ‘Omics’ era 12 1.2 Microarray technology 13 1.3 Microarray experiment design 15 1.3.1 Biological variation 16 1.3.2 Technical and system variation 16 1.4 Gene Chip technology 18 1.5 Labeling and hybridization procedure 20 1.6 Scanning expression microarrays: converting probe sets to signal intensity 21 1.6.1 Data output from the scanner 21 1.6.2 Normalization and summarization 23 1.6.2.1 Mas normalization 24 1.6.2.2 Quantile normalization 24 1.6.2.3 RMA/RMAexpress 24 1.6.2.4 Normalization by VSN 26 1.6.3 Other Transformations 27 1.6.4 Choice of normalization 28 1.7 Clustering: Unsupervised analysis 29 1.7.1 Hierarchical clustering 29 1.7.2 Partitioning clustering 30 1.7.3 Multi-dimensional clustering 32 1.7.4 Choice of clustering method 35 1.8 Visualization of gene/sample similarity: Pearson correlation matrix 36 1.9 Supervised analysis 36 1.9.1 Class comparison 37 1.9.2 Problem of multiple testing: p-values and false discovery rates 38 1.9.3 Class prediction 39 1.9.4 Cross-validation 40 1.10 Validation of results 40 1.11 Pattern discovery: ontological classification and pathway analysis 40 1.12 Various types of microarray 44 1.12.1 Exon arrays 44 1.13 Bioinformatics bibliography 47 2 Introduction to Cancer 51 2.0 Introduction to Cancer 52 2.1 Cancer 52 2.2 Breast cancer 52 2.2.1 Normal breast histology 52 2.2.2 Malignant breast histology 54 2.2.3 Incidence and risk factors 55 2.2.4 Prognosis and therapy 55 2.3 Brain tumors 56 2.3.1 Brain tumor pathology 56 2.3.2 Incidence and risk factors 56 2.3.3 Prognosis and therapy 57 2.4 The genetics of cancer 57 2.4.1 Accumulation of mutations in several genes 57 2.4.2 Somatic and germline mutations 58 2.4.3 Cancer genes 58 2.4.3.1 Oncogenes 58 2.4.3.2 Tumor Suppressor genes 59 2.4.3.3 Stability genes 59 2.4.3.4 Epigenetic regulation 60 2.4.4 Cancer genes in breast cancer and brain tumors 61 2.4.4.1 EGFR and ERBB2 61 2.4.4.2 TP53 62 2.4.4.3 E-cadherin 62 2.4.4.4 PTEN 63 2.4.4.5 BRCA1, BRCA2 and CHEK2 63 2.5 Gene expression in cancer 65 2.5.1 Gene expression, gene mutations and cell biology 65 2.5.2 Breast cancer gene expression profiles 65 2.5.3 Brain tumor gene expression profiles 68 2.6 Cancer Bibliography 70 3 Growing Applications and Advancements in Microarray Technology 75 and Analysis Tools 4 Epigenetic silencing and mutational inactivation of E-cadherin associate 99 with distinct breast cancer subtypes 5 Gene expression profiling assigns CHEK2 1100delC breast cancers to the 133 luminal intrinsic subtypes 6 Identification of differentially regulated splice-variants and novel exons in 155 glial brain tumors using exon expression arrays 7 Exon expression arrays as a tool to identify new cancer genes 173 8.0 Discussion 190 8.1 Microarray applications to oncology 190 8.2 Considerations on microarray technology 190 8.2.1 Sample variability 190 8.2.2 Technical variability: the probes 191 8.2.3 Reproducibility: different platforms and multiple array comparison 192 8.2.4 Analytical variability 193 8.3 Focused microarray analysis 194 8.4 Considerations on microarray applications in oncology 195 8.4.1 Epigenetic inactivation of E-cadherin by methylation is distinct 195 from genetic inactivation by mutation 8.4.2 A gene signature is associated with CHEK2 1000delC mutations in 196 breast cancer 8.4.3 Exon arrays identify differentially expressed splice variants in brain tumors 198 8.4.4 Exon arrays identify exon-skipping mutations in breast cancer cell 198 lines breast and brain tumors 8.5 The future of microarrays applications in oncology and final conclusions 199 8.6 Discussion bibliography 202 Samenvatting 207 Summary 209 Dankword / Acknowledgements 211 Curriculum vitae 215 List of publications 217 Abbreviations 221 Appendices 1 Natural population dynamics and expansion of pathogenic clones 222 of Staphylococcus aureus 2 http://www-bioinf.erasmusmc.nl/thesis_peeters 231 3 Published interview: Discovery of Novel Splice Variations Improves 232 GlialTumor Classificationc 4 Published interview: Global view of gene expression analysis 237 *For all large supplementary data files, tables and colour versions of figures and PDFs. please refer to the website http://www-bioinf.erasmusmc.nl/thesis_peeters. Chapter 1 Introduction to Microarray Bioinformatics 1.0 Introduction to microarray bioformatics ‘Bioinformatics’ is one of the newest fields of biological research, and should be viewed broadly as the use of mathematical, statistical, and computational methods for the processing and analysis of biologic data [1]. Over the last decade, the rapid growth of information and technology in both ‘genomics1’ and ‘omics2’ era’s has been overwhelming for the laboratory scientists to process experimental results. Traditional gene-by-gene approaches in research are insufficient to meet the growth and demand of biological research in understanding the true biology. The massive amounts of data generated by new technologies as genomic sequencing and microarray chips make the management of data and the integration of multiple platforms of high importance; this is then followed by data analysis and interpretation to achieve biological understanding and therapeutic progress. Global views of analyzing the magnitude of information are necessary and traditional approaches to labwork have steadily been changing towards a bioinformatic era. Research is moving from being restricted to a laboratory environment to working with computers in a ‘virtual lab’ environment. 1.1 The age of science and the evolution into the ‘Omics’ era The invention of the Polymerase Chain Reaction (PCR) technique was a major milestone in molecular research that transformed and revolutionized current research and diagnostics [3]. Since the introduction of PCR, linkage analysis and mutation screening became easier and the rate of identified disease genes increased dramatically. Another milestone that influenced the rate of discovery was the Human Genome project [4,5]. The main aim of this project was to create a detailed physical map of the human genome. Having genomic sequences available made the identification of disease related mutations in Mendelian single gene disorders an easier task; however, the mapping of complex diseases such as diabetes and cancer remains a challenge. The successful completion of the Human Genome Project, as well as the sequencing of the genomes of many other species, has generated a large amount of freely available information, opening the 1 The study of all of the nucleotide sequences, including structural genes, regulatory sequences, and non-coding DNA segments, in the chromosomes of an organism. Also seen as a branch of biotechnology concerned with applying the techniques of genetics and molecular biology to the genetic mapping and DNA sequencing of sets of genes or the complete genomes of selected organisms using high-speed methods, with organizing the results in databases, and with applications of the data (as in medicine or biology) 2. http://www.dictionary.com. [cited. 2 ‘Omics’ refers to a field of study in biology ending in the suffix -omics such as genomics or proteomics. The related neologism ‘omes’ addresses the objects of study of such fields, such as the genome or proteome respectively 2. Ibid. [cited.. 12 Introduction to Microarray Bioformatics Chapter door to a post-genome era where an ‘e-science’ approach can allow in silico research and further 1 mining of available data. One of the most challenging objectives of the post-genome era is to understand the complex genome and its interacting products. Various techniques to understand gene expression have been developed where it is important to remember that RNA (ribonucleic acid) expression does not directly reflect equal protein levels. Approaches such as subtractive hybridization of cDNA libraries and differential display have been used with some success; however, these techniques are laborious and are not suited for global gene analysis.