Genetic Association Analysis of Complex Diseases Through Information Theoretic Metrics and Linear Pleiotropy
Total Page:16
File Type:pdf, Size:1020Kb
GENETIC ASSOCIATION ANALYSIS OF COMPLEX DISEASES THROUGH INFORMATION THEORETIC METRICS AND LINEAR PLEIOTROPY Dissertation submitted for the degree of Doctor of Philosophy in Biomedical Engineering PhD Student: Helena Brunel Montaner PhD Advisor: Dr. Alexandre Perera i Lluna Universitat Polit`ecnicade Catalunya Programa de Doctorat en Enginyeria Biom`edica Barcelona, 2013 Abstract The main goal of this thesis was to help in the identification of genetic variants that are responsible for complex traits, combining both linear and nonlinear approaches. First, two one-locus approaches were proposed. The first one defined and characterized a novel nonlinear test of genetic association, based on the mutual information measure. This test takes into account the genetic structure of the population. It was applied to the GAW17 dataset and compared to the standard linear test of association. Since the solution of the GAW17 simulation model was known, this study served to characterize the performance of the proposed nonlinear methods in comparison to the linear one. The proposed nonlinear test was able to recover the results obtained with linear methods but also detected an additional SNP in a gene related with the phenotype. In addition, the performance of both tests in terms of their accuracy in classification (AUC) was similar. In contrast, the second approach was an exploratory study on the relationship between SNP variability among species and SNP association with disease, at different genetic regions. Two sets of SNPs were compared, one containing deleterious SNPs and the other defined by neutral SNPs. Both sets were stratified depending on the region where the polymorphisms were located, a feature that may have influenced their conservation across species. It was observed that, for most functional regions, SNPs associated to diseases tend to be significantly less variable across species than neutral SNPs. Second, a novel nonlinear methodology for multiloci genetic association was proposed with the goal of detecting association between combinations of SNPs and a phenotype. The proposed method was based on the mutual information of statistical significance, called MISS. This approach was com- pared with MLR, the standard linear method used for genetic association based on multiple linear regressions. Both were applied as a relevance crite- rion of a new multi-solution floating feature selection algorithm (MSSFFS), proposed in the context of multi-loci genetic association for complex disea- ses. Both were also compared with MECPM, an algorithm for searching predictive multi-loci interactions with a criterion of maximum entropy. The three methods were tested on the SNPs of the F7 gene, and the FVII levels in blood, with the data from the GAIT project. The proposed nonlinear i ii method (MISS) improved the results of traditional genetic association me- thods, detecting new SNP-SNP interactions. Most of the obtained sets of SNPs were in concordance with the functional results found in the litera- ture where the obtained SNPs have been described as functional elements correlated with the phenotype. Third, a linear methodological framework for the simultaneous study of several phenotypes was proposed. The methodology consisted in building new phenotypic variables, named metaphenotypes, that capture the joint activity of sets of phenotypes involved in a metabolic pathway. These new variables were used in further association tests with the aim of identifying ge- netic elements related with the underlying biological process as a whole. As a practical implementation, the methodology was applied to the GAIT project dataset with the aim of identifying genetic markers that could be related to the coagulation process as a whole and thus to thrombosis. Three mathema- tical models were used for the definition of metaphenotypes, corresponding to one PCA and two ICA models. Using this novel approach, already known associations were retrieved but also new candidates were proposed as regu- latory genes with a global effect on the coagulation pathway as a whole. Acknowledgements Writing these words is one of the most emotional moments in my life, because it closes an intense and meaningful period. Such is the gratitude that I feel that I would never have enough space to express it, but I'll try to convey it as concise as possible. I hope everyone will feel appreciated as it deserves and if there are some feelings that I have not had time to organize, I hope to get them somehow. I feel extremely lucky that I enjoyed the constant support of many people during these years so that this thesis is a prize that deserves to be shared with all who have been part of it. Before expressing my gratitude to all those who have offered me help, support and assistance throughout the development of this thesis, let me skip the protocol for a more personal appreciation, without which I do not understand this section. For me, to finish this manuscript is also a proof that I am taking the second chance given to me. And I owe it to three main groups of people. First of all, I do not even know how to express my gratitude and admiration to the infinite generosity of an anonymous family which in a painful situation were able to donate the organs of a deceased relative. I also want to thank and acknowledge the work of the several medical teams that took care of me during the last ten years, In particular, thanks to the cardiologists that treated me before the surgery for keeping me able to carry a normal life for nine years, and to the Cardiac Transplant Unit of the Hospital de Sant Pau which with their excellence have made my recovery as easy and quick as possible. Finally I want to recognize the courage and bravery of my family who have been with me at all times. Their constant support and affection have been essential for me to overcome this process and without them I would never have been able to face it. Having said that, I can now proceed to thank all the people who have contributed directly or indirectly to the development of this thesis. First of all I would like to express my gratitude to my supervisor, Dr. Alexandre Perera Lluna, for his guidance and support. These last six years working under his supervision have made me grow as a researcher and as a person. It has been a privilege to enjoy his scientific contributions and advices and to translate his exceptional ideas into my work. For me, it has been a great honor but also a big responsibility to be his first PhD student and I sincerely hope not having let him down. I am aware that the circumstances iii iv have often been unexpected, unpredictable and even awkward so that I am very grateful to him for his patience and his comprehension at all times. I would also like to thank him for his conviction on the value of this work and for encouraging me and helping me to get over my existential crises and to believe in the culmination of this manuscript. I would also acknowledge him for his concern with my employment and with my future as a researcher. So, thanks to Alex for being my supervisor and, not less important, for being a friend at the same time. I want to express my gratitude to all the people in the ESAII depart- ment at UPC, from the administrative staff to the directors, for their warm welcome and support during this years. I feel proud to have become part of the SISBIO work team, composed of people who stand out for their hu- manity and kindness. A special thank to Prof. Josep Amat for introducing me to the department and to biomedical engineering with the final project of my degree in Mathematics. I also thank Dr. Montserrat Vallverd´ufor introducing me into the world of biomedical signals, and for teaching me in my first research work. It was a very instructive period for me and I have tried to retain her teachings in my work in this thesis. Finally, I would like to thank Prof. Pere Caminal for his constant support and interest in my work and in my health but above all for conferring me the opportunity of carrying out this thesis at his research group with a grant from the IBEC Institute. The financial support of the IBEC (Institut de Bioenginyeria de Catalunya) has been essential for me to be able to devote time and energy to this thesis. I would like to thank Dr. Jos´eManuel Soria and all his team at the Unitat de Gen`omica de Malalties Complexes at Hospital de Sant Pau for his warm welcome at his research group and for his priceless comprehension and pa- tience. I am very grateful to him for waiting for me during my convalescence and I hope I will have time to compensate him with a productive work at his research group. I would also like to thank him and Angel Mart´ınez-P´erez for their straight collaboration in the studies leading to chapter 7 and 8. I would also thank the remaining members of the team for their support in the last months. I specially thank Yorgos for his guidance and support and for correcting parts of this my manuscript. Thanks also to Dr. Joan-Josep Gallardo-Chac´onfor his contribution in chapter 6. I also want to thank him for his guidance in biology. I would like to thank all the remaining colleagues in the g-sisbio group. It has been a pleasure to work with all of them and to receive their valuable comments and viewpoints about my work which in many occasions have helped to my work. I specially thank Raimon and Joan for their collabo- ration in both chapters 6 and 8.