Challenges Associated with Clinical Studies and the Integration of Gene Expression Data

UNIVERSITY OF SYDNEY Challenges associated with clinical studies and the integration of gene expression data by Anna Elizabeth Campain A thesis submitted in fulfillment for the degree of Doctor of Philosophy in the Faculty of Science Department of Mathematics and Statistics October 2012 'What's one and one and one and one and one and one and one and one and one and one and one and one?' 'I don't know', said Alice, 'I lost count. ' 'She can't do addition!' said the Red Queen. Lewis Carroll, Through the Looking Glass (1872) Dedication This thesis is dedicated to my family; My mother for raising me to believe anything is possible, and all things will work out in the end. My father for teaching me that the reason that all things are possible is through planning and preparation. My sisters, who showed me that being the youngest, dimmest, clumsiest and ugliest is not vexing when you are dearly loved. I dedicate this thesis also to my beloved husband, Stephen. This journey was not a burden, as I was loved and cared for throughout by you, with your belief in me, your gentleness, patience and prayers. Thank you for encouraging me to continue with my studies and for all our plans that you put on hold while I was working. And finally this thesis is dedicated to our little darling, who through God's own timing, we still eagerly await. Your beautiful kicking, proding, hiccups and fidgets have kept me company these last eight and a half months. iv Acknowledgements I would like to thank and acknowledge my wonderful, inspiring and generous PhD su pervisor, Dr Jean Yang. I am completely indebted to her for everything that she has done for me during this post-graduate marathon. I have 'highs' to celebrate because of her, and have pushed through the 'lows' believing in her wisdom. At the beginning of my PhD I was excited to work with such a fantastic statistician and mirror at least some of her skill. Now I see her as a brilliantly gifted friend. I have also had a generous and proficient associate supervisor, Dr Samuel Miiller. He has graciously shared his wisdom with me, especially regarding the clinical data side of my work. His wonderful, timely and accurate comments have helped shaped this work into something special, which I could not have done on my own. He has spent copious amounts of time shaping me into a better and more careful statistician. This journey has been made more special with the work and dedication of Professor Terry Speed. Working with such a distinguished statistician was a great privilege. It has been a wonderful and fun experience collaborating closely with him. He spent so much time and energy on me, encouraging me to become a better statistician and bioinformatician. I have had the great privilege of completing my PhD alongside many talented students in a range of scientific disciplines. To this end I would like to acknowledge and thank Ms Francine Marques and Ms Sarah-Jane Schramm. Francine has humbled me with her dedication to research, as she has delved and explored the genetic recesses of hy pertension. As her PhD comes to a close too, I wish her enthusiasm and passion in her post-doc research. I would also like to thank Francine's supervisor, Professor Brian Morris, whose passion for biology infuses his laboratory. The beautiful Sarah-Jane has been a pleasure to work with, she motivates and encourages me with her scientific in tegrity and thoroughness as well as her deep understanding of what is truly important in life. Professor' Graham Mann and Clinical Professor Richard Scolyer have been incredible. To watch both straddling the clinical world and research arena with such passion has been amazing. It has been a magnificent opportunity to collaborate with these two doctors, while working with the melanoma data. I would like to thank Associate Professor George Condous and Dr Jennifer Oats for their collaboration on the Early Pregnancy data, this large clinical data set has allowed me to explore many statistical challenges. Their collaboration has allowed me to grow extensively throughout my candidature. These past few years has been a terrific joy, filled with fun and laughter along with hard work. My fellow statisticians, undergraduates, post-graduates and lectures have helped to make this time very special. In particular I would like to acknowledge Ellis Patrick, we have become great friends as we have competed for Jean's attention and affection. He is a gifted statistician who will be a great asset to the research community and I look forward to working with him in the future. I would also like to thank Ellis and John Ormerod for their time and kind words while reading, edit and helping me in the final stages of this mammoth project. v Abstract Doctor of Philosophy by Anna Elizabeth Campain The progress of high-throughput biotechnologies has generated a myriad of large and complex data sets in many areas of medical research. The development of such data has had a terrific impact in statistics, producing new challenges and encouraging methodo logical development. In recent years, medical research has been able to generate both clinical and genomic level data for the same patients. The integration of such data may enable scientists to glean a greater understanding of complex diseases. Before such data types can be combined effectively, there are still many statistical problems associated with the analysis of these two data types separately. This thesis addresses some of these challenges and contributes to the advancement of their solutions. The very common challenge of missing observations within clinical data is addressed using multiple imputation. However, multiple imputation algorithms are not routinely compared. To this end, a novel framework for the comparison of multiple imputation methods is developed within this thesis. Three popular multiple imputation methods are compared through a simulation study, highlighting strengths and weaknesses within them all. Model stability is a statistical problem that is especially prevalent when regression classes are unbalanced or observed clinical variables are rare. An original solution to model instability in a regression context is provided with stability gained through stratified bootstrap sampling. Prior to the integration of clinical and gene expression data, further development of statistical methods for the integration of multiple expression studies is required. This thesis examines approaches used to combine expression data sets. Current expression integration approaches are compared with a newly developed method, 'meta Differential Expression via Distance Synthesis'. This thesis also highlights the two main ways data can be combined. The first is the integration of statistics obtained from individual expression data analyses. The second is the integration of the unanalysed expression data, allowing for a united data set in downstream analysis. Both paradigms are explored and advocated in different contexts. Finally, a melanoma case study is used to highlight the importance of careful analysis of the individual data types, clinical and expression, prior to data integration. The solutions to several of the problems addressed are implemented within this study in combination with some rudimentary integration methods for combining the two types of data. This case study highlights the potential benefits of careful individual analyses of clinical and genomic data as well as the integration of this information when making survival time predictions. Publications Publications • A.E. Campain and Y.H. Yang (2010) Comparison study of microarray meta analysis methods, 11:408 BMC Bioinformatics. • A.E. Campain, F.Z. Marques, Y.H. Yang and B.J. Morris (2010) Met~analysis of genome-wide gene expression differences in onset and maintenance phase of genetic hypertension, 56:319-324 Hypertension. • F.Z. Marques, A.E Campain P.J Davern, Y.H Yang, G.A Head and B.J. Morris (2011) Genes influencing circadian differences in blood pressure in hypertensive mice, 6:4 e19203 PLoS One. • F.Z. Marques, A.E. Campain, P.J Davern, Y.H Yang, G.A Head and B.J. Morris (2011) Global identification of the genes and pathways differentially expressed in hypothalamus in early and established neurogenic hypertension, 43:766-771 Physiological Genomics. • F.Z. Marques, A.E. Campain, E. Zukowska-Szczechowska, M. Tomaszewski, Y.H Yang, F.J. Charchar and B.J Morris (2011) Gene expression profiling reveals renin mRNA overexpression in human hypertensive kidneys and a role for microRNAs, 58:1093-8, Hypertension. • S.-J. Schramm, A.E. Campain, R. Scolyer, Y.H. Yang and G.J. Mann (2011) Review and cross-validation of gene expression signatures and melanoma prognosis. 132:274-283 Journal of Investigative Dermatology. Manuscripts under review and in preparation • A.E. Campain, S. Miiller, G. Condous and Y.H. Yang (2011) Stable logistic regression models in the presence of missing values and class imbalances. Under review, Biostatistics. • G.J. Mann, G.M. Pupo, A.E. Campain, C.A. Carter, S.-J. Schramm, A. Pianova, S. Gerega, C. De Silva, K. Lai, J. Wilmott, M. Synott, P. Hersey, R.F. Kelford, J.F. Thompson, Y.H. Yang and R.A. Scolyer (2011) BRAF mutation, NRAS mutation and absence of an immune-related expressed gene profile predict poor outcome in stage III melanoma. Under review, Journal of Clinical Oncology. • Y.H. Yang, A. E. Campain and T.P. Speed (2011) Finding differentially expressed genes in microarray data. Preprint. • J. Riemke, A.E. Campain, T. Bignardi, I. Casikar, D. Alhamdan, D. Fanchon, R. Benzie, S. Miiller, Y.H. Yang, M. Mongelli and G. Condous (2011) Development of a new model to predict viability at the end of the 1st trimester after a single visit to an Early Pregnancy Unit.

Challenges Associated with Clinical Studies and the Integration of Gene Expression Data

The Title of the Article

Instability of the Pseudoautosomal Boundary in House Mice

Predicting Clinical Response to Treatment with a Soluble Tnf-Antagonist Or Tnf, Or a Tnf Receptor Agonist

Structural and Numerical Changes of Chromosome X in Patients with Esophageal Atresia

Analyses of PDE-Regulated Phosphoproteomes Reveal Unique and Specific Camp Signaling Modules in T Cells

Dosage Regulation of the Active X Chromosome in Human Triploid Cells

(12) United States Patent (10) Patent No.: US 9.228,204 B2 Pulst Et Al

A Factor Model to Analyze Heterogeneity in Gene Expression BMC Bioinformatics 2010, 11:368

Vliv Sekvenční, Expresní a Repetiční Variability Genu

REVIEW Specificity and Spatial Dynamics of Protein Kinase A

( 12 ) United States Patent

Sex-Chromosome Dosage Effects on Gene Expression in Humans