Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data
Total Page:16
File Type:pdf, Size:1020Kb
Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data Xiaokang Zhang Thesis for the degree of Philosophiae Doctor (PhD) University of Bergen, Norway 2020 Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data Xiaokang Zhang ThesisAvhandling for the for degree graden of philosophiaePhilosophiae doctorDoctor (ph.d (PhD). ) atved the Universitetet University of i BergenBergen Date of defense:2017 30.10.2020 Dato for disputas: 1111 © Copyright Xiaokang Zhang The material in this publication is covered by the provisions of the Copyright Act. Year: 2020 Title: Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data Name: Xiaokang Zhang Print: Skipnes Kommunikasjon / University of Bergen Scientific environment The work presented in the thesis was funded by University of Bergen and carried out at the Computational Biology Unit (CBU) at the Department of Informatics of University of Bergen. Half a year of my PhD was carried out in the Systems Biology Research Group at the Department of Bioengineering of University of California, San Diego. This thesis was supervised by Professor Inge Jonassen employed at the CBU and Professor Anders Goksøyr employed at the Department of Biological Sciences of Uni- versity of Bergen. I have been affiliated with the National Research School in Bioinformatics, Bio- statistics and Systems Biology (NORBIS), the Digital Life Norway Research School (DLNRS), and the Molecular and Computational Biology Research School (MCB). ii Scientific environment Acknowledgements I am the type of person who does not always have a plan and it is very easy for me to change my mind even if I got one. This PhD was totally out of my plan. Just as Forrest Gump’s mom said, "life was like a box of chocolates. You never know what you’re gonna get". Now I feel very thankful for my decision four years ago. It has been a wonderful time in my life! My PhD is affiliated with project dCod 1.0: decoding the systems toxicology of At- lantic cod (Gadus morhua), which is funded by the Research Council of Norway (grant number 248840). dCod is a big project which involves people from different depart- ments, universities and countries. Collaboration with people from diverse backgrounds was difficult at the beginning but more and more fun came after. I want to thank all the dCoders in Bergen, and special thanks go to Marta Eide, Eileen M. Hanna, and Fekadu Yadetie. Most of my PhD courses are actually from the research schools, NORBIS, DLNRS and MCB. Without those courses (the knowledge and the credits), I could not finish my PhD. Especially NORBIS of which I am also the student representative, it provides me with enormous resources that I have learnt a lot and made lots of friends. I feel so lucky to be part of the CBU, with a friendly atmosphere, helpful colleagues, and broad research topics. With experts in different fields, I can always find help what- ever research or technical problems I am faced with. Special thanks go to Fatemeh Z. Ghavidel. The Department of Informatics and University of Bergen are also very important to my PhD in providing me such a perfect research environment. I would like to shout to my supervisor, Inge Jonassen, thank you so much! You are the best! With many titles on the shoulder, he is very busy with many different projects. But he would always ensure enough time for me, following and helping on my research, caring and solving my problems whether research related or not. And my thanks also go to my co-supervisor, Anders Gokøøyr, who is also the leader of dCod iv Acknowledgements 1.0 project. Thanks to him, the project went well and so did my PhD. I am grateful to get to know so many amazing friends in the past four years in Norway and San Diego, US. You made my life wonderful and joyful. And I also want to thank the old friends who have accompanied me through different stages of my life. Especially those few best friends, you make me believe that no matter where I am, no matter what difficulty I am facing, you are always there for me. Only with that feeling of safety, can I explore my life more bravely. Last but not least, I would like to thank my family for their generous love. Thanks to my sister taking care of our mom, so that I can study and work far away from home without worry. Special thanks to my dad, who had always wanted me to do a PhD, but also respected my choice when I told him that I did not want to. Even though it is impossible, but I wish he would know that I later on changed my mind and actually did the PhD, part of the reason for him. (最后我要感谢我家人给予我的无私的爱。因 为有姐姐照看母亲,我才能在千里之外安心学习工作。特殊的感谢致于我的父 亲。他一直以来都想让我读博士,但是当我提出并不想继续读博士时,他完全 尊重我的个人选择。虽已成空,但我多希望他能知道我后来改变了主意,如他 所愿读了博士,其中部分原因来自于对他的爱。) Xiaokang Zhang Bergen, July 2020 Summary My PhD is affiliated with the dCod 1.0 project (https://www.uib.no/en/dcod): decoding the systems toxicology of Atlantic cod (Gadus morhua), which aims to better under- stand how cods adapt and react to the stressors in the environment. One of the research topics is to discover the biomarkers which discriminate the fish under normal biological status and the ones that are exposed to toxicants. A biomarker, or biological marker, is an indicator of a biological state in response to an intervention, which can be for example toxic exposure (in toxicology), disease (for example cancer), or drug response (in precision medicine). Biomarker discovery is a very important research topic in toxicology, cancer research, and so on. A good set of biomarkers can give insight into the disease / toxicant response mechanisms and be useful to find if the person has the disease / the fish has been exposed to the toxicant. On the molecular level, a biomarker could be "genotype" - for instance a single nu- cleotide variant linked with a particular disease or susceptibility; another biomarker could be the level of expression of a gene or a set of genes. In this thesis we focus on the latter one, aiming to find out the informative genes that can help to distinguish sam- ples from different groups from the gene expression profiling. Several transcriptomics technologies can be used to generate the necessary data, and among them, DNA mi- croarray and RNA sequencing (RNA-Seq) have become the most useful methods for whole transcriptome gene expression profiling. Especially RNA-Seq has become an attractive alternative to microarrays since it was introduced. Prior to analysis of gene expression, the RNA-Seq data needs to go through a series of processing steps, so a workflow which can automate the process is highly required. Even though many workflows have been proposed to facilitate this process, their ap- plication is usually limited to such as model organisms, high-performance computers, computer fluent users, and so on. To fill these gaps, we developed a maximally general RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow), which is applicable to a wide range of applications and requires little programming skills. It takes the sequencing data as input, and maps them to either transcriptome or genome for quantification, and after that the gene expression profile can be achieved vi Summary which afterwards goes through normalization and statistical tests to find out the differ- entially expressed genes. This work was presented in Paper I and Paper II. Differential expression analysis used in RASflow, together with other univariate methods are widely used in biomarker discovery for their simplicity and interpretabil- ity. But they rely on a hypothesis that variables are independent, so they can only iden- tify variables that possess significant information by themselves. However, biological processes usually involve many variables that have complex interactions. Multivariate methods which take the interactions between variables into consideration are therefore also popular for biomarker discovery. To study whether there is a significant advantage of one over the other, we conducted a comparative study of various methods from these two categories and evaluated these methods on two aspects: stability and prediction ac- curacy, we found that a method’s performance is quite data-dependent. This work was presented in Paper III. Since the biomarker discovery methods perform quite differently on different datasets, then how to choose the most appropriate one for a particular dataset? One solution is to use the function perturbation strategy to combine the outputs from multi- ple methods. Function perturbation is capable of maintaining prediction accuracy com- pared with the original individual methods, but its stability is not satisfactory enough. On the other hand, data perturbation uses a similar ensemble learning logic: it firstly generates multiple datasets by resampling the original dataset and then combines the re- sults from those datasets. Data perturbation has been proven to improve the stability of the biomarker discovery method. We therefore proposed a framework which combines function perturbation with data perturbation: Ensemble Feature Selection Integrating Stability (EFSIS) which achieves both high prediction accuracy and stability. This work was presented in Paper IV. List of Publications Publications included in the thesis (I) Yadetie, F., Zhang, X., Hanna, E. M., Aranguren-Abadía, L., Eide, M., Blaser, N., Brun, M., Jonassen, I., Goksøyr, A., & Karlsen, O. A. (2018). RNA-Seq analysis of transcriptome responses in Atlantic cod (Gadus morhua) precision- cut liver slices exposed to benzo[a]pyrene and 17a-ethynylestradiol. Aquatic Toxicology, 201, 174-186. (II) Zhang, X., & Jonassen, I. (2020). RASflow: an RNA-Seq analysis workflow with Snakemake. BMC Bioinformatics, 21(1), 1-9. (III) Zhang, X., & Jonassen, I. (2019). A Comparative Analysis of Feature Selection Methods for Biomarker Discovery in Study of Toxicant-Treated Atlantic Cod (Gadus morhua) Liver.