Data Mining the Genetics of Leukemia

Total Page:16

File Type:pdf, Size:1020Kb

Data Mining the Genetics of Leukemia Data Mining the Genetics of Leukemia by Geoff Morton A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen's University Kingston, Ontario, Canada January 2010 Copyright °c Geo® Morton, 2010 Library and Archives Bibliothèque et Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l’édition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre référence ISBN: 978-0-494-65049-3 Our file Notre référence ISBN: 978-0-494-65049-3 NOTICE: AVIS: The author has granted a non- L’auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l’Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L’auteur conserve la propriété du droit d’auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author’s permission. In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n’y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. Abstract Acute Lymphoblastic Leukemia (ALL) is the most common cancer in children under the age of 15. At present, diagnosis, prognosis and treatment decisions are made based upon blood and bone marrow laboratory testing. With advances in microarray technology it is becoming more feasible to perform genetic assessment of individual patients as well. We used Singular Value Decomposition (SVD) on Illumina SNP, A®ymetrix and cDNA gene-expression data and performed aggressive attribute se- lection using random forests to reduce the number of attributes to a manageable size. We then explored clustering and prediction of patient-speci¯c properties such as disease sub-classi¯cation, and especially clinical outcome. We determined that integrating multiple types of data can provide more meaningful information than individual datasets, if combined properly. This method is able to capture the cor- relation between the attributes. The most striking result is an apparent connection between genetic background and patient mortality under existing treatment regimes. We ¯nd that we can cluster well using the mortality label of the patients. Also, using a Support Vector Machine (SVM) we can predict clinical outcome with high accu- racy. This thesis will discuss the data-mining methods used and their application to biomedical research, as well as our results and how this will a®ect the diagnosis and treatment of ALL in the future. i Acknowledgments I would like to thank my supervisor Prof. David Skillicorn for the opportunity to work on this project, for all of the guidance he has given me along the way and for the chance to continue my work in Australia. The School of Computing at Queen's University has provided me not only with a wonderful education but also the funding that made this work possible and for that I am grateful. Thanks also to Dr. Daniel Catchpoole for providing a di®erent view into my work and making sure all of our results were practical and applicable, as well as for his hospitality for my time in Australia. To my friends and colleagues at the Children's Cancer Research Unit at The Children's Hospital at Westmead, I thank you for making my transition so easy and making it fun to come into work every day. And ¯nally I would like to thank my friends and family for their support throughout this whole process. Although my stories may sound like gibberish to you, you were always there to listen. ii Table of Contents Abstract i Acknowledgments ii Table of Contents iii List of Tables vi List of Figures viii Chapter 1: Introduction . 1 1.1 Problem . 1 1.2 My Contribution . 3 Chapter 2: Background . 4 2.1 Acute Lymphoblastic Leukemia (ALL) . 4 2.2 The Datasets . 7 2.3 Random Forests . 11 2.4 Singular Value Decomposition (SVD) . 13 iii 2.5 Support Vector Machine (SVM) . 15 2.6 Related Research . 16 Chapter 3: Experiments . 20 3.1 Pre-processing . 20 3.2 Normalization . 23 3.3 Experimental Model . 24 3.4 Combination of Datasets . 25 3.5 Attribute Selection . 26 3.6 Analysis of Selected Attributes . 27 3.7 Validation of Results . 28 3.8 Further SNP Analysis . 29 Chapter 4: Results . 31 4.1 SVD Results . 31 4.2 Combination of Datasets . 40 4.3 Analysis of Selected Data . 42 4.4 Attribute Selection . 61 4.5 Validation of Results . 81 4.6 Extended SNP Analysis . 82 4.7 Discussion of the Nature of the Data . 96 Chapter 5: Conclusion . 99 iv 5.1 Future Work . 101 Bibliography . 103 v List of Tables 2.1 Symptoms of ALL . 5 3.1 Random Forests Setup . 27 4.1 Combination Method 1 . 41 4.2 Combination Method 2 . 41 4.3 SNP Subset SVM Results . 42 4.4 cDNA Subset SVM Results . 48 4.5 A®y Subset SVM Results . 51 4.6 SNP and cDNA Subset SVM Results . 53 4.7 SNP and A®y Subset SVM Results . 56 4.8 cDNA and A®y Subset SVM Results . 57 4.9 SNP, cDNA and A®y Subset SVM Results . 60 4.10 Top 100 SNP Attributes . 62 4.11 Top 100 cDNA Attributes . 64 4.12 Top 100 A®y Attributes . 67 4.13 Top 100 SNP-cDNA Attributes . 70 4.14 Top 100 SNP-A®y Attributes . 72 4.15 Top 100 A®y-cDNA Attributes . 75 4.16 Top 100 SNP-cDNA-A®y Attributes . 78 vi 4.17 Label Shu²ing SVM Results . 82 4.18 SNP Relapse SVM Results . 85 4.19 Comparison of Top Attributes . 93 vii List of Figures 4.1 All SNP SVD . 32 4.2 All cDNA SVD . 33 4.3 All A®y SVD . 34 4.4 All Clinical SVD . 36 4.5 All SNP-cDNA SVD . 37 4.6 All SNP-A®y SVD . 38 4.7 All A®y-cDNA SVD . 39 4.8 SNP-cDNA-A®y SVD . 40 4.9 SNP Subset SVD . 45 4.10 SNP Subset SVD . 46 4.11 cDNA Subset SVD . 50 4.12 A®y Subset SVD . 52 4.13 SNP-cDNA Subset SVD . 55 4.14 SNP-A®y Subset SVD . 57 4.15 cDNA-A®y Subset SVD . 59 4.16 All Combined Subset SVD . 61 4.17 SNP Relapse SVD . 84 4.18 SNP Graph Analysis SVD . 87 4.19 Reformatted SNP Analysis SVD . 89 viii 4.20 250 SNP SVD for old and updated labels . 91 4.21 SNP SVD for updated labels . 92 4.22 Intersecting SNP SVD . 96 ix Chapter 1 Introduction 1.1 Problem Cancer, in all of its forms, is the second leading cause of death in the United States [15] and accounts for 13% of all deaths worldwide [25]. It is estimated that in the United States in 2009 a total of approximately 1.5 million people will have been diagnosed with cancer and of these, approximately 560,000 will die from their disease [22]. It is also estimated that approximately 30% of these cancer deaths are preventable [25]. The National Cancer Institute in Washington spends approximately $4.8 billion per year towards cancer research with most of the funding going towards breast, prostate, lung, colorectal and leukemia research [21]. Leukemia is the most common malignancy a®ecting children under the age of 15, but it also a®ects many adults. There are four subtypes of leukemia; acute lymphoblastic leukemia, chronic lymphoblastic leukemia, acute myelogenous leukemia and chronic myelogenous leukemia. It is estimated that approximately 45,000 new cases of leukemia will have been diagnosed in the United States in 2009 [16]. The 1 CHAPTER 1. INTRODUCTION 2 survival rate for persons with leukemia has dramatically increased over the past four decades. In the 1960s the ¯ve-year event-free survival rate was a mere 14%. In more recent years these ¯gures have been quoted as being as high as 80% [12, 17, 31]. Although there has been a signi¯cant improvement in the treatment of this disease, 20% of all leukemia cases result in death. With the completion of the Human Genome Project, the understanding of genetics has increased signi¯cantly. As such, many new technologies have been developed to study the genome in many di®erent forms. One of these technologies is the microarray, which is a high-throughput device allowing for the analysis of thousands of gene expression levels simultaneously. As a result, there is a wealth of data being generated every day for many di®erent purposes. The microarray has become a useful research tool and has allowed researchers to begin looking at problems on a much larger scale.
Recommended publications
  • Bacillus Anthracis' Lethal Toxin Induces Broad Transcriptional Responses In
    Chauncey et al. BMC Immunology 2012, 13:33 http://www.biomedcentral.com/1471-2172/13/33 RESEARCH ARTICLE Open Access Bacillus anthracis’ lethal toxin induces broad transcriptional responses in human peripheral monocytes Kassidy M Chauncey1, M Cecilia Lopez2, Gurjit Sidhu1, Sarah E Szarowicz1, Henry V Baker2, Conrad Quinn3 and Frederick S Southwick1* Abstract Background: Anthrax lethal toxin (LT), produced by the Gram-positive bacterium Bacillus anthracis, is a highly effective zinc dependent metalloprotease that cleaves the N-terminus of mitogen-activated protein kinase kinases (MAPKK or MEKs) and is known to play a role in impairing the host immune system during an inhalation anthrax infection. Here, we present the transcriptional responses of LT treated human monocytes in order to further elucidate the mechanisms of LT inhibition on the host immune system. Results: Western Blot analysis demonstrated cleavage of endogenous MEK1 and MEK3 when human monocytes were treated with 500 ng/mL LT for four hours, proving their susceptibility to anthrax lethal toxin. Furthermore, staining with annexin V and propidium iodide revealed that LT treatment did not induce human peripheral monocyte apoptosis or necrosis. Using Affymetrix Human Genome U133 Plus 2.0 Arrays, we identified over 820 probe sets differentially regulated after LT treatment at the p <0.001 significance level, interrupting the normal transduction of over 60 known pathways. As expected, the MAPKK signaling pathway was most drastically affected by LT, but numerous genes outside the well-recognized pathways were also influenced by LT including the IL-18 signaling pathway, Toll-like receptor pathway and the IFN alpha signaling pathway.
    [Show full text]
  • Unravelling the Cell Adhesion Defect in Meckel-Gruber Syndrome
    Unravelling the Cell Adhesion Defect in Meckel-Gruber Syndrome Submitted by Benjamin Roland Alexander Meadows as a thesis for the degree of Doctor of Philosophy in Biological Sciences in September 2016 This thesis is available for Library use on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. I certify that all material in this thesis which is not my own work has been identified and that no material has previously been submitted and approved for the award of a degree by this or any other University. ……………………………………………………………… 1 2 Acknowledgements The first people who must be thanked are my fellow Dawe group members Kate McIntosh, Kat Curry, and half of Holly Hardy, as well as all past group members and Helen herself, who has always been a supportive and patient supervisor with a worryingly encyclopaedic knowledge of the human proteome. Some (in the end, distressingly small) parts of this project would not have been possible without my western blot consultancy team, including senior western blot consultant Joe Costello and junior consultants Afsoon Sadeghi-Azadi, Jack Chen, Luis Godinho, Tina Schrader, Stacey Scott, Lucy Green, and Lizzy Anderson. James Wakefield is thanked for improvising a protocol for actin co- sedimentation out of almost thin air. Most surprisingly, it worked. Special thanks are due to the many undergraduates who have contributed to this project, without whose hard work many an n would be low: Beth Hickton, Grace Howells, Annie Toynbee, Alex Oldfield, Leonie Hawksley, and Georgie McDonald. Peter Splatt and Christian Hacker are thanked for their help with electron microscopy.
    [Show full text]
  • Suppementary Table 9. Predicted Targets of Hsa-Mir-181A by Targetscan 6.2
    Suppementary Table 9. Predicted targets of hsa-miR-181a by TargetScan 6.2. Total Aggregate Cancer Target Gene ID Gene name context+ P gene score CT ABL2 NM_001136000 V-abl Abelson murine leukemia viral oncogene homolog 2 > -0.03 0.31 √ ACAN NM_001135 Aggrecan -0.09 0.12 ACCN2 NM_001095 Amiloride-sensitive cation channel 2, neuronal > -0.01 0.26 ACER3 NM_018367 Alkaline ceramidase 3 -0.04 <0.1 ACVR2A NM_001616 Activin A receptor, type IIA -0.16 0.64 ADAM metallopeptidase with thrombospondin type 1 ADAMTS1 NM_006988 > -0.02 0.42 motif, 1 ADAM metallopeptidase with thrombospondin type 1 ADAMTS18 NM_199355 -0.17 0.64 motif, 18 ADAM metallopeptidase with thrombospondin type 1 ADAMTS5 NM_007038 -0.24 0.59 motif, 5 ADAMTSL1 NM_001040272 ADAMTS-like 1 -0.26 0.72 ADARB1 NM_001112 Adenosine deaminase, RNA-specific, B1 -0.28 0.62 AFAP1 NM_001134647 Actin filament associated protein 1 -0.09 0.61 AFTPH NM_001002243 Aftiphilin -0.16 0.49 AK3 NM_001199852 Adenylate kinase 3 > -0.02 0.6 AKAP7 NM_004842 A kinase (PRKA) anchor protein 7 -0.16 0.37 ANAPC16 NM_001242546 Anaphase promoting complex subunit 16 -0.1 0.66 ANK1 NM_000037 Ankyrin 1, erythrocytic > -0.03 0.46 ANKRD12 NM_001083625 Ankyrin repeat domain 12 > -0.03 0.31 ANKRD33B NM_001164440 Ankyrin repeat domain 33B -0.17 0.35 ANKRD43 NM_175873 Ankyrin repeat domain 43 -0.16 0.65 ANKRD44 NM_001195144 Ankyrin repeat domain 44 -0.17 0.49 ANKRD52 NM_173595 Ankyrin repeat domain 52 > -0.05 0.7 AP1S3 NM_001039569 Adaptor-related protein complex 1, sigma 3 subunit -0.26 0.76 Amyloid beta (A4) precursor protein-binding, family A, APBA1 NM_001163 -0.13 0.81 member 1 APLP2 NM_001142276 Amyloid beta (A4) precursor-like protein 2 -0.05 0.55 APOO NM_024122 Apolipoprotein O -0.32 0.41 ARID2 NM_152641 AT rich interactive domain 2 (ARID, RFX-like) -0.07 0.55 √ ARL3 NM_004311 ADP-ribosylation factor-like 3 > -0.03 0.51 ARRDC3 NM_020801 Arrestin domain containing 3 > -0.02 0.47 ATF7 NM_001130059 Activating transcription factor 7 > -0.01 0.26 ATG2B NM_018036 ATG2 autophagy related 2 homolog B (S.
    [Show full text]
  • Bacillus Anthracis' Lethal Toxin Induces Broad Transcriptional Responses In
    Chauncey et al. BMC Immunology 2012, 13:33 http://www.biomedcentral.com/1471-2172/13/33 RESEARCH ARTICLE Open Access Bacillus anthracis’ lethal toxin induces broad transcriptional responses in human peripheral monocytes Kassidy M Chauncey1, M Cecilia Lopez2, Gurjit Sidhu1, Sarah E Szarowicz1, Henry V Baker2, Conrad Quinn3 and Frederick S Southwick1* Abstract Background: Anthrax lethal toxin (LT), produced by the Gram-positive bacterium Bacillus anthracis, is a highly effective zinc dependent metalloprotease that cleaves the N-terminus of mitogen-activated protein kinase kinases (MAPKK or MEKs) and is known to play a role in impairing the host immune system during an inhalation anthrax infection. Here, we present the transcriptional responses of LT treated human monocytes in order to further elucidate the mechanisms of LT inhibition on the host immune system. Results: Western Blot analysis demonstrated cleavage of endogenous MEK1 and MEK3 when human monocytes were treated with 500 ng/mL LT for four hours, proving their susceptibility to anthrax lethal toxin. Furthermore, staining with annexin V and propidium iodide revealed that LT treatment did not induce human peripheral monocyte apoptosis or necrosis. Using Affymetrix Human Genome U133 Plus 2.0 Arrays, we identified over 820 probe sets differentially regulated after LT treatment at the p <0.001 significance level, interrupting the normal transduction of over 60 known pathways. As expected, the MAPKK signaling pathway was most drastically affected by LT, but numerous genes outside the well-recognized pathways were also influenced by LT including the IL-18 signaling pathway, Toll-like receptor pathway and the IFN alpha signaling pathway.
    [Show full text]
  • Biomarker Discovery for Asthma Phenotyping: from Gene Expression to the Clinic
    UvA-DARE (Digital Academic Repository) Biomarker discovery for asthma phenotyping: From gene expression to the clinic Wagener, A.H. Publication date 2016 Document Version Final published version Link to publication Citation for published version (APA): Wagener, A. H. (2016). Biomarker discovery for asthma phenotyping: From gene expression to the clinic. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:25 Sep 2021 CHAPTER 3 Supporting Information File The impact of allergic rhinitis and asthma on human nasal and bronchial epithelial gene expression methods Primary epithelial cell culture Primary cells were obtained by first digesting the biopsies and brushings with collage- nase 4 (Worthington Biochemical Corp., Lakewood, NJ, USA) for 1 hour in Hanks’ bal- anced salt solution (Sigma-Aldrich, Zwijndrecht, The Netherlands).
    [Show full text]
  • Identification of Conserved Genes Triggering Puberty in European Sea Bass Males ( Dicentrarchus Labrax ) by Microarray Expression Profiling
    Retinoic acid signaling pathway: gene regulation during the onset of puberty in the European sea bass Ruta de señalización del ácido retinoico: regulación génica durante el inicio de la pubertad en la lubina Europea Paula Javiera Medina Henríquez Thesis presented to obtain the Ph.D. degree at the Polytechnic University of Catalonia (UPC) Ph. D. Program in Marine Sciences 2019 Thesis supervisor: Dra. Mercedes Blázquez Peinado (ICM-CSIC) Barcelona, 2019 Esta Tesis ha sido realizada en el Departamento de Recursos Marinos Renovables del Instituto de Ciencias del Mar de Barcelona (ICM-CSIC). Parte de la experimentación también se ha llevado a cabo en el Instituto de Acuicultura Torre de la sal (IATS-CSIC), a través de una estrecha colaboración con el grupo de Fisiología de la Reproducción de Peces. La financiación económica para la e xperimentación se ha recibido a través de los siguientes Proyectos de Investigación: Mejora de la producción en acuicultura mediante el uso de herramientas biotecnológicas CSD2007-00002 (AQUAGENOMICS). Bases moleculares, celulares y endocrinas de la pubertad y del desarrollo gonadal en lubina ( Dicentrarchus labrax ). Desarrollo de tecnologías ambientales, hormonales, y de manipulación genética para su control PROMETEO/2010/003 (REPROBASS). Regulación Hormonal y desarrollo del eje hipotalámico-hipofisario-gonadal durante la diferenciación sexual y la pubertad en peces teleósteos PIE 200930I037(TERMOBASS). Regulación endocrina y paracrina de la diferenciación sexual y el desarrollo gonadal en la lubina AGL2011-28890 (REPROSEX). Acción complementaria al proyecto: regulación endocrina y paracrina de la diferenciación sexual y desarrollo gonadal en la lubina (REPROSEX). Avances en el control de la pubertad y ciclo reproductor de la lubina y su aplicación tecnológica a otras especies de peces PROMETEO II/2014/051 (REPOBASS II).
    [Show full text]