<<

Mälardalen University Doctoral Dissertation286 Doctoral University Mälardalen Graph theoryGraph for based approaches geneprioritization in biological networks Application cancer gene in detection medulloblastoma to Hrafn Weishaupt Holger

Hrafn Holger Weishaupt GRAPH BASED APPROACHES FOR GENE PRIORITIZATION IN BIOLOGICAL NETWORKS 2019 ISBN 978-91-7485-420-6 ISSN 1651-4238 P.O. Box 883, SE-721 23 Västerås. Sweden 883, SE-721 23 Västerås. Box Address: P.O. Sweden 325, SE-631 05 Eskilstuna. Box Address: P.O. www.mdh.se E-mail: [email protected] Web: 1

                   

                                 

                         

  !"#$%&   !"#$%&

'()* '()*

      !!         !!   2

" #$% %# & " '() " #$% %# & " '() *+),)(,-.-' *+),)(,-.-' *(.(-/ *(.(-/   0(3ULQW 6WRFNKROP 5    0(3ULQW 6WRFNKROP 5  3

                   

  !" #"$    !" #"$  % %%&% %!% ' %#'( )" % %%&% %!% ' %#'( )" '%#%  ###% % '' !'"  '%#%  ###% % '' !'" 

* + ( ,-. * + ( ,-.

/ 0 / ,  + / 0 / ,  +

0*1 ++   * * / 20  0 0 /3 0.0 0 / 0*1 ++   * * / 20  0 0 /3 0.0 0 / / 0 *1-4  +5/--6,/00- /  /00 **  +  / 0 *1-4  +5/--6,/00- /  /00 **  +  *1 * +  078958:8; 005 VK|JVNROD5< = *1 * +  078958:8; 005 VK|JVNROD5< =

$/- .. >*  /! +60-**5 $/- .. >*  /! +60-**5 "?  ,  *+ 6--"6 6 5 .. "?  ,  *+ 6--"6 6 5 ..

/ 0 *1-4  +5/--6,/00- /  / 0 *1-4  +5/--6,/00- /  4

46 46  ?/.    -   , +,.4 0  0    , .4 ? 4@ 6  ?/.    -   , +,.4 0  0    , .4 ? 4@ 6 (,   0, 0 6+.,5, 4 60 *-, 0 4 . ,*0, 0 6 (,   0, 0 6+.,5, 4 60 *-, 0 4 . ,*0, 0 6 .   ,?   -*, -   +   ,-5   -.  +, .   ,?   -*, -   +   ,-5   -.  +, ?/,   . 0  0 ,* A +-6,    *  6, ?/,   . 0  0 ,* A +-6,    *  6, * ? 5? , 6  +60. 2 *, - .4 05.. 6  * ?/0  + * ? 5? , 6  +60. 2 *, - .4 05.. 6  * ?/0  + 4 60 0 6, + +". 6 * 65+ .6 4 - 5B C?, 6,  6    4 60 0 6, + +". 6 * 65+ .6 4 - 5B C?, 6,  6    0.    ,? 6  ,  4  0  5 B C ,? 6     , . 4  *   *0 60. 2   0.    ,? 6  ,  4  0  5 B C ,? 6     , . 4  *   *0 60. 2   .    5 B C?, 6,0 ,,-4 -  ,.,   ? ,     .    5 B C?, 6,0 ,,-4 -  ,.,   ? ,     D-   E, ,   2. , 6 6 . 6, + * ?/   , 6  2*?  D-   E, ,   2. , 6 6 . 6, + * ?/   , 6  2*?   * .. 6   5  , .  6  *6 6 + *04 + 6 ?/5? , .. 6    * .. 6   5  , .  6  *6 6 + *04 + 6 ?/5? , .. 6   0 -40  6, 0 -40  6,  -40 .  , 0600 0 +  4 -0 6,  #- 4-F7G  -40 .  , 0600 0 +  4 -0 6,  #- 4-F7G *  . - 54-, * -** *0. 0 6+   D-   -40 *  . - 54-, * -** *0. 0 6+   D-   -40 ,. -4 ,? ,4 *-  60 6--4+-.  - ,  ,. -4 ,? ,4 *-  60 6--4+-.  - ,   +    6  - -    +*,  +  6 4    6   ? ,!-4+-.  +    6  - -    +*,  +  6 4    6   ? ,!-4+-. ? 5  -6,*  +    0. , ..  5*-,   +,  D-   ? 5  -6,*  +    0. , ..  5*-,   +,  D-   ,?,  +- +   6? ,,  *, 6 - 05,?-6,6/ ,?,  +- +   6? ,,  *, 6 - 05,?-6,6/ 6  -0 .0 5 ,?,   +-0 + 6.6  6 4 +  4-+ 6  -0 .0 5 ,?,   +-0 + 6.6  6 4 +  4-+ 4 , +-6,-    + D-    +  ,6  4 + 6.6  0  4 , +-6,-    + D-    +  ,6  4 + 6.6  0   0?   5/,6 4 ..6, ,-+,, -*6 - 00, 0 6  0?   5/,6 4 ..6, ,-+,, -*6 - 00, 0 6 ?/*0 6-  6   ?/*0 6-  6   ,  ,    6-  ,    * 6   * 6 6  +  *0  ?/ . . 6 5?,   . 6 * 6 ,  ,    6-  ,    * 6   * 6 6  +  *0  ?/ . . 6 5?,   . 6 * 6 *6- .6   . 6-. * ?/5  6 +  +- ?/,0  *6- .6   . 6-. * ?/5  6 +  +- ?/,0     , .4 ? + ,  2.    , ,  - , 4 + 4 ? 4 + 6     , .4 ? + ,  2.    , ,  - , 4 + 4 ? 4 + 6  0, 0 6 ?/6 6 .". 6 * 65, 60.-  6, + * *  +-6, ?/ 0, 0 6 ?/6 6 .". 6 * 65, 60.-  6, + * *  +-6, ?/ *00 6- .   , 0 6..6, * A +,   ?/ -  *00 6- .   , 0 6..6, * A +,   ?/ -      2. ,?-6,0 ,0 +,4 ** 6 4 ?/ *  6 $-, *6- .6       2. ,?-6,0 ,0 +,4 ** 6 4 ?/ *  6 $-, *6- .6     +? ,, 6, + * 4 , +- 4 +  2.    * ?/ *  6     +? ,, 6, + * 4 , +- 4 +  2.    * ?/ *  6   !$ 5, ,   6 6- ? , .. 6  *  - ?/..6,  ,.,   !$ 5, ,   6 6- ? , .. 6  *  - ?/..6,  ,.,    - !5 ?, 6,  -  6   + ?  .   A I  - !5 ?, 6,  -  6   + ?  .   A I

%"!9F98FH;H7 %"!9F98FH;H7 %""8;8H: %""8;8H: 5

Für Margarete & Harald Für Margarete & Harald 6

So much universe, and so little time. So much universe, and so little time. - Sir Terry Pratchett - Sir Terry Pratchett 7

Acknowledgments Acknowledgments

I wish to express my sincere gratitude to all who have supported me during this I wish to express my sincere gratitude to all who have supported me during this venture; without you this work would not have been possible. venture; without you this work would not have been possible.

First and foremost, I would like to convey my special appreciation and thanks to First and foremost, I would like to convey my special appreciation and thanks to my supervisors: To my main supervisor Sergei Silvestrov, this work would not my supervisors: To my main supervisor Sergei Silvestrov, this work would not have been possible without you. Thank you for accepting me as a PhD student have been possible without you. Thank you for accepting me as a PhD student and for providing me with the fantastic opportunity to pursue our shared research and for providing me with the fantastic opportunity to pursue our shared research and to grow as a scientist. I am very grateful for all your support, guidance, and and to grow as a scientist. I am very grateful for all your support, guidance, and patience; it was certainly not easy with me being located in Uppsala and preoc- patience; it was certainly not easy with me being located in Uppsala and preoc- cupied with my regular work duties. I am looking forward to a continued collab- cupied with my regular work duties. I am looking forward to a continued collab- oration and interesting interdisciplinary research projects. To my co-supervisor oration and interesting interdisciplinary research projects. To my co-supervisor Anatoliy Malyarenko, we did not have much contact during my PhD studies, Anatoliy Malyarenko, we did not have much contact during my PhD studies, but I am very thankful that you supported my education and for the security of but I am very thankful that you supported my education and for the security of knowing that you would have been there in case of any complications. A very knowing that you would have been there in case of any complications. A very special appreciation goes to my co-supervisor Fredrik Swartling - you have been special appreciation goes to my co-supervisor Fredrik Swartling - you have been a tremendous mentor for me. Thank you for welcoming me into your lab, for a tremendous mentor for me. Thank you for welcoming me into your lab, for allowing me to pursue the PhD studies in , for encouraging my di- allowing me to pursue the PhD studies in mathematics, for encouraging my di- verse research interests and for helping me to grow as a research scientist. I am verse research interests and for helping me to grow as a research scientist. I am immensely grateful for all that you have taught me about medulloblastoma, brain immensely grateful for all that you have taught me about medulloblastoma, brain , and research in general; your advice and guidance during these past years biology, and research in general; your advice and guidance during these past years have been priceless. Last but not least, Christopher Engström, thank you for have been priceless. Last but not least, Christopher Engström, thank you for your unceasing help on all the small and big issues during my studies, for all the your unceasing help on all the small and big issues during my studies, for all the valuable scientific collaborations, for sharing so many adventures during confer- valuable scientific collaborations, for sharing so many adventures during confer- ences, and for being a wonderful colleague. ences, and for being a wonderful colleague.

Special thanks goes also to all collaborators, who allowed me to participate in Special thanks goes also to all collaborators, who allowed me to participate in their research: Olle Sangfelt, Aldwin Suryo Rahmanto, and Andrä Brunner, their research: Olle Sangfelt, Aldwin Suryo Rahmanto, and Andrä Brunner, thank you for many fruitful discussions and shared work on SOX9 and FBW7. thank you for many fruitful discussions and shared work on SOX9 and FBW7. Karin Forsberg Nilsson and Anqi Xiong, thank you for including me in an Karin Forsberg Nilsson and Anqi Xiong, thank you for including me in an interesting collaboration about candidate gene screening in glioma. Margareta interesting collaboration about candidate gene screening in glioma. Margareta Wilhelm, thank you for an interesting collaboration on Gorlin syndrome and Wilhelm, thank you for an interesting collaboration on Gorlin syndrome and medulloblastoma. Lars-Gunnar Larsson and Wesam Bazzar, I am grateful for medulloblastoma. Lars-Gunnar Larsson and Wesam Bazzar, I am grateful for our many discussions about MYC proteins and cancer and that you allowed me to our many discussions about MYC proteins and cancer and that you allowed me to participate in your research; I am looking forward to a continued and prosperous participate in your research; I am looking forward to a continued and prosperous collaboration. Elena Tchougounova and Ananya Roy, thank you for including collaboration. Elena Tchougounova and Ananya Roy, thank you for including me in your research and for the interesting collaboration on mast cells. William me in your research and for the interesting collaboration on mast cells. William Weiss and Miller Huang, thank you for entrusting me with your RNA-seq ana- Weiss and Miller Huang, thank you for entrusting me with your RNA-seq ana- lyses; I am looking forward to a continued collaboration and scientific exchange. lyses; I am looking forward to a continued collaboration and scientific exchange.

vii vii 8

Helena Jernberg Wiklund and Antonia Kalushkova Nair, thank you so much Helena Jernberg Wiklund and Antonia Kalushkova Nair, thank you so much for all your advice and support on ChIP-seq. Last but not least, Lene Uhrbom, for all your advice and support on ChIP-seq. Last but not least, Lene Uhrbom, Smitha Sreedharan, Naga Pratyusha Maturi, and Yuan Xie thank you for an Smitha Sreedharan, Naga Pratyusha Maturi, and Yuan Xie thank you for an interesting collaboration on cells of origin in glioma. interesting collaboration on cells of origin in glioma.

To all members of the Mathematics and (MAM) research To all members of the Mathematics and Applied Mathematics (MAM) research environment at Mälardalen University: thank you for creating such a fantastic environment at Mälardalen University: thank you for creating such a fantastic platform for science and education. If not for my work duties in Uppsala, I would platform for science and education. If not for my work duties in Uppsala, I would have loved to be more involved in your research, seminars and conferences. I have loved to be more involved in your research, seminars and conferences. I would like to express my appreciation to all the teachers that have guided me would like to express my appreciation to all the teachers that have guided me through the post graduate studies at Mälardalen University and who have facilit- through the post graduate studies at Mälardalen University and who have facilit- ated such a wonderful learning environment. Last but not least, thank you Karl ated such a wonderful learning environment. Last but not least, thank you Karl Lundengård and Jonas Österberg for all the support and shared experiences dur- Lundengård and Jonas Österberg for all the support and shared experiences dur- ing our PhD studies; you have been great colleagues. ing our PhD studies; you have been great colleagues.

I would like to express my deep gratitude to all past and present members of the I would like to express my deep gratitude to all past and present members of the Swartling group, Vasil, Sara, Sanna, Matko, Gabriela, Anna, Anders, Sonja, Swartling group, Vasil, Sara, Sanna, Matko, Gabriela, Anna, Anders, Sonja, Oliver, Géraldine, Tobias, and Karl. You have been wonderful colleagues and Oliver, Géraldine, Tobias, and Karl. You have been wonderful colleagues and friends and I have greatly appreciated all the shared time in the lab, at retreats, din- friends and I have greatly appreciated all the shared time in the lab, at retreats, din- ners, meetings, and conferences. You have been the best colleagues that I could ners, meetings, and conferences. You have been the best colleagues that I could have wished for and have become dear friends. Sara, I think we still had some fish- have wished for and have become dear friends. Sara, I think we still had some fish- ing trips planned, right? Matko, we joined the group almost at the same time and ing trips planned, right? Matko, we joined the group almost at the same time and you have been a dear friend ever since; thank you for all the fun we have shared; you have been a dear friend ever since; thank you for all the fun we have shared; I am going to miss you in the lab, but we will definitely spent more time outside I am going to miss you in the lab, but we will definitely spent more time outside of work. Anders, thank you for being a wonderful friend and colleague and for of work. Anders, thank you for being a wonderful friend and colleague and for always lending me some support, whenever I was drowning in workload. always lending me some support, whenever I was drowning in workload.

A special thanks goes further to Sven Nelander’s group and particularly Patrik. A special thanks goes further to Sven Nelander’s group and particularly Patrik. This endeavorer would not have been possible without our fantastic collabora- This endeavorer would not have been possible without our fantastic collabora- tion. I am immensely grateful for the continuous support and advice that you tion. I am immensely grateful for the continuous support and advice that you have lend to me during these past years. Anders and Patrik, it has been a great time have lend to me during these past years. Anders and Patrik, it has been a great time sharing office (djungelrummet) with you. Thank you for the wonderful working sharing office (djungelrummet) with you. Thank you for the wonderful working environment and for always being open for discussing questions and providing environment and for always being open for discussing questions and providing aid. Ida, we have not known each other for a long time, but it was really pleasant aid. Ida, we have not known each other for a long time, but it was really pleasant sharing office with you. I wish you all the best for your PhD studies. Satishku- sharing office with you. I wish you all the best for your PhD studies. Satishku- mar, thank you for all the great times when sharing office and for teaching me mar, thank you for all the great times when sharing office and for teaching me some Indian food recipes. some Indian food recipes.

I would also like to express my gratitude to all the past and present members of the I would also like to express my gratitude to all the past and present members of the neurooncology section at IGP. Thank you for creating such a welcoming work neurooncology section at IGP. Thank you for creating such a welcoming work

viii viii 9

environment, for all the valuable during our NO seminars, and simply environment, for all the valuable feedback during our NO seminars, and simply for being wonderful colleagues and friends. for being wonderful colleagues and friends.

A special note of appreciation goes to my friends outside of work. Daniel Sor- A special note of appreciation goes to my friends outside of work. Daniel Sor- obetea, I derely miss our times inside and outside the lab; I really hope we can obetea, I derely miss our times inside and outside the lab; I really hope we can meet more frequently again, once you are back from your postdoc. Erik Ceder- meet more frequently again, once you are back from your postdoc. Erik Ceder- berg, es ist shon wieder viel zu lange her seit unserem letzten Treffen und unserem berg, es ist shon wieder viel zu lange her seit unserem letzten Treffen und unserem Ausflug nach Schottland. Ich hoffe, dass ich dich bald wieder in Kalmar besuchen Ausflug nach Schottland. Ich hoffe, dass ich dich bald wieder in Kalmar besuchen kann. Johannes Toelke, danke fuer die vielen Jahre treuer Freundschaft; wird kann. Johannes Toelke, danke fuer die vielen Jahre treuer Freundschaft; wird Zeit, dass du mich nochmal in Schweden besuchen kommst. Stephan Menze, wir Zeit, dass du mich nochmal in Schweden besuchen kommst. Stephan Menze, wir haben uns von Anfang an so gut verstanden und schon so viel Blödsinn zusam- haben uns von Anfang an so gut verstanden und schon so viel Blödsinn zusam- men angestellt; ich hoffe wir haben bald wieder mehr Zeit dafür. men angestellt; ich hoffe wir haben bald wieder mehr Zeit dafür.

Last but not least, I would like to thank Wera and my family. Wera, danke Last but not least, I would like to thank Wera and my family. Wera, danke dass du immer fuer mich da bist, mich unterstuetzt und an mich glaubst; danke dass du immer fuer mich da bist, mich unterstuetzt und an mich glaubst; danke fuer deine unermuedliche Geduld, wenn ich wieder einmal in Arbeit und Stress fuer deine unermuedliche Geduld, wenn ich wieder einmal in Arbeit und Stress zu ertrinken drohe; danke dafuer, dass du mein Anker bist, wenn ich einen Anker zu ertrinken drohe; danke dafuer, dass du mein Anker bist, wenn ich einen Anker brauche; danke, dass du so bist, wie du bist. Mama und Papa, Worte koennen brauche; danke, dass du so bist, wie du bist. Mama und Papa, Worte koennen nicht ausdrücken, wie dankbar ich bin über eure unerschöpfliche Liebe und euren nicht ausdrücken, wie dankbar ich bin über eure unerschöpfliche Liebe und euren unerschütterlichen Glauben in mich. Ohne euch wäre diese Arbeit und der lange unerschütterlichen Glauben in mich. Ohne euch wäre diese Arbeit und der lange Weg bis hierher nicht möglich gewesen. Ihr hattet stets Nachsicht, wenn ich Weg bis hierher nicht möglich gewesen. Ihr hattet stets Nachsicht, wenn ich wieder einmal wochenlang bis über beide Ohren in Arbeit vergraben war, habt wieder einmal wochenlang bis über beide Ohren in Arbeit vergraben war, habt mich immer in allen Entscheidungen unterstüzt, ihr habt mir von Anfang an nur mich immer in allen Entscheidungen unterstüzt, ihr habt mir von Anfang an nur das Allerbeste mit auf den Weg gegeben, und ihr wart immer fuer mich da wenn das Allerbeste mit auf den Weg gegeben, und ihr wart immer fuer mich da wenn ich euch brauchte. Ganz gleich wie weit wir auseinander wohnen, ihr seid im- ich euch brauchte. Ganz gleich wie weit wir auseinander wohnen, ihr seid im- mer in meinem Herzen. Danke fuer alles! Bald ist Zeit fuer unsere Skandinavi- mer in meinem Herzen. Danke fuer alles! Bald ist Zeit fuer unsere Skandinavi- enrundreise. Thorsten, danke dass du immer an mich geglaubst hast und mir enrundreise. Thorsten, danke dass du immer an mich geglaubst hast und mir immer nur das beste gewünscht hast; danke fuer all die wundervollen Ausflüge immer nur das beste gewünscht hast; danke fuer all die wundervollen Ausflüge und Abenteuer, die wir schon zusammen erlebt haben. Es wartet noch eine lange und Abenteuer, die wir schon zusammen erlebt haben. Es wartet noch eine lange Liste mit Orten und Plätzen auf uns, die erforscht und beangelt werden wollen. Liste mit Orten und Plätzen auf uns, die erforscht und beangelt werden wollen. Danke auch an meine Grosseltern, Tanten, Onkel, Cousinen, und Cousins, fuer Danke auch an meine Grosseltern, Tanten, Onkel, Cousinen, und Cousins, fuer euren Glauben in mich und all die guten Wünsche. euren Glauben in mich und all die guten Wünsche.

ix ix 10 11

Populärvetenskaplig samman- Populärvetenskaplig samman- fattning fattning

Med hjälp av nätverk kan man modellera relationer mellan objekt på ett intui- Med hjälp av nätverk kan man modellera relationer mellan objekt på ett intui- tivt och anpassningsbart sätt. När de översätts till matematiska grafer blir de mot- tivt och anpassningsbart sätt. När de översätts till matematiska grafer blir de mot- tagliga för en mängd matematiska operationer som möjliggör en detaljerad studie tagliga för en mängd matematiska operationer som möjliggör en detaljerad studie av underliggande datamönster. Därför är det inte överraskande att nätverk har ut- av underliggande datamönster. Därför är det inte överraskande att nätverk har ut- vecklats till den främsta metoden för dataanalys inom en mängd olika forsknings- vecklats till den främsta metoden för dataanalys inom en mängd olika forsknings- områden. Men med ökad problemkomplexitet blir tillämpningen av nätverksmo- områden. Men med ökad problemkomplexitet blir tillämpningen av nätverksmo- dellering också mer utmanande och flera frågor uppkommer. Specifikt, beroende dellering också mer utmanande och flera frågor uppkommer. Specifikt, beroende på den process som ska studeras, (i) vilka interaktioner är viktiga och hur kan de på den process som ska studeras, (i) vilka interaktioner är viktiga och hur kan de modelleras, (ii) hur kan relationer utläsas från komplexa och potentiellt bullriga modelleras, (ii) hur kan relationer utläsas från komplexa och potentiellt bullriga data, och (iii) vilka metoder ska användas för att testa hypoteser eller svara på rele- data, och (iii) vilka metoder ska användas för att testa hypoteser eller svara på rele- vanta frågor? Denna avhandling undersöker koncept och utmaningar i nätverksa- vanta frågor? Denna avhandling undersöker koncept och utmaningar i nätverksa- nalys inom ramen för ett väldefinierat användningsområde, nämligen prediktion nalys inom ramen för ett väldefinierat användningsområde, nämligen prediktion av cancergener från biologiska nätverk med applicering på medulloblastomforsk- av cancergener från biologiska nätverk med applicering på medulloblastomforsk- ning. ning. Medulloblastom är den vanligaste maligna hjärntumören hos barn. För närva- Medulloblastom är den vanligaste maligna hjärntumören hos barn. För närva- rande överlever 70% av de behandlade patienterna, men med behandlingen följer rande överlever 70% av de behandlade patienterna, men med behandlingen följer ofta en kognitiv funktionsnedsättning. Medulloblastom har tidigare ofta en permanent kognitiv funktionsnedsättning. Medulloblastom har tidigare visat sig ha minst fyra distinkta molekylära undergrupper. Vidare har studier av visat sig ha minst fyra distinkta molekylära undergrupper. Vidare har studier av dessa undergrupper kraftigt utvecklat vår förståelse för vilka genetiska avvikelser dessa undergrupper kraftigt utvecklat vår förståelse för vilka genetiska avvikelser som finns i tumörens celler. För att översätta denna förståelse till nya och förbätt- som finns i tumörens celler. För att översätta denna förståelse till nya och förbätt- rade behandlingsalternativ krävs ytterligare insikter i hur de avvikande generna rade behandlingsalternativ krävs ytterligare insikter i hur de avvikande generna interagerar med resten av det cellulära systemet, hur en sådan interaktion kan dri- interagerar med resten av det cellulära systemet, hur en sådan interaktion kan dri- va tumörutveckling och hur resulterande tumörgenererande processer kan påver- va tumörutveckling och hur resulterande tumörgenererande processer kan påver- kas av olika läkemedel. Att bygga upp en sådan kunskapsbas kräver kartläggning kas av olika läkemedel. Att bygga upp en sådan kunskapsbas kräver kartläggning av biologiska processer på systemnivå. En populär metod för studier av detta är av biologiska processer på systemnivå. En populär metod för studier av detta är nätverksanalys av molekylära interaktioner. nätverksanalys av molekylära interaktioner. Denna avhandling behandlar tillämpningen av biologisk nätverksanalys för Denna avhandling behandlar tillämpningen av biologisk nätverksanalys för identifiering av cancergener i medulloblastom och cancer i allmänhet, där speci- identifiering av cancergener i medulloblastom och cancer i allmänhet, där speci- fikt fokus läggs på så kallade genreglerande nätverk. Det är nätverk som modelle- fikt fokus läggs på så kallade genreglerande nätverk. Det är nätverk som modelle- rar relationer mellan gener med hänsyn till hur de uttrycks i cellen. Avhandlingen rar relationer mellan gener med hänsyn till hur de uttrycks i cellen. Avhandlingen diskuterar hur man kan knyta an biologiska och matematiska nätverkskoncept, diskuterar hur man kan knyta an biologiska och matematiska nätverkskoncept, och mer specifikt behandlas beräkningsproblematiken vid härledandet av sådana och mer specifikt behandlas beräkningsproblematiken vid härledandet av sådana nätverk från molekylära data. Matematiska metoder för analys av dessa nätverk nätverk från molekylära data. Matematiska metoder för analys av dessa nätverk skisseras och det undersöks hur sådana metoder kan påverkas av nätverksinferens. skisseras och det undersöks hur sådana metoder kan påverkas av nätverksinferens. Huvudfokus ligger på att hantera de statistiska utmaningarna vid skapandet av ett Huvudfokus ligger på att hantera de statistiska utmaningarna vid skapandet av ett dataset över genuttryck som lämper sig för nätverksinferens i MB. Avhandlingen dataset över genuttryck som lämper sig för nätverksinferens i MB. Avhandlingen

xi xi 12

avslutas med en tillämpning av olika nätverksmetoder i en hypotesskapande stu- avslutas med en tillämpning av olika nätverksmetoder i en hypotesskapande stu- die för MB, där nya kandidatgener prioriterades. die för MB, där nya kandidatgener prioriterades.

xii xii 13

List of papers List of papers

The thesis is based on the following papers, which are referred to in the text by The thesis is based on the following papers, which are referred to in the text by their Roman numerals. their Roman numerals.

I. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, I. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2016). “Graph based prediction Sergei Silvestrov, Fredrik J. Swartling. (2016). “Graph centrality based prediction of cancer genes”. In: Engineering Mathematics II: Algebraic, Stochastic and Ana- of cancer genes”. In: Engineering Mathematics II: Algebraic, Stochastic and Ana- lysis Structures for Networks, Data Classification and Optimization /[ed] Sergei Sil- lysis Structures for Networks, Data Classification and Optimization /[ed] Sergei Sil- vestrov; Milica Rancic, pp. 275-311. vestrov; Milica Rancic, pp. 275-311. II. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, II. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2017). “Loss of conservation of graph Sergei Silvestrov, Fredrik J. Swartling. (2017). “Loss of conservation of graph in reverse-engineered transcriptional regulatory networks”. Methodo- centralities in reverse-engineered transcriptional regulatory networks”. Methodo- logy and Computing in Applied , 19(4), 1089-1105. logy and Computing in Applied Probability, 19(4), 1089-1105. III. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, III. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2016). ”Prediction of high centrality nodes Sergei Silvestrov, Fredrik J. Swartling. (2016). ”Prediction of high centrality nodes from reverse-engineered transcriptional regulator networks”. In: SMTDA 2016 Pro- from reverse-engineered transcriptional regulator networks”. In: SMTDA 2016 Pro- ceedings: / 4th Stochastic Modeling Techniques and Data Analysis International Con- ceedings: / 4th Stochastic Modeling Techniques and Data Analysis International Con- ference /[ed] Christos H. Skiadas (Ed), ISAST: International Society for the Advance- ference /[ed] Christos H. Skiadas (Ed), ISAST: International Society for the Advance- ment of Science and Technology, pp. 517-531. ment of Science and Technology, pp. 517-531. IV. Holger Weishaupt, Patrik Johansson, Anders Sundström, Zelmina Lubovac-Pilav, IV. Holger Weishaupt, Patrik Johansson, Anders Sundström, Zelmina Lubovac-Pilav, Björn Olsson, Sven Nelander, Fredrik J. Swartling. (2019). “Batch-normalization Björn Olsson, Sven Nelander, Fredrik J. Swartling. (2019). “Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes”. , epub ahead of print. defined negative control genes”. Bioinformatics, epub ahead of print. V. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, V. Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2019). ”Prioritization of candidate cancer Sergei Silvestrov, Fredrik J. Swartling. (2019). ”Prioritization of candidate cancer genes on chromosome 17q through reverse-engineered transcriptional regulatory genes on chromosome 17q through reverse-engineered transcriptional regulatory networks in medulloblastoma groups 3 and 4”. Manuscript. networks in medulloblastoma groups 3 and 4”. Manuscript.

Permission from the respective publishers was obtained for the reuse of articles Permission from the respective publishers was obtained for the reuse of articles in this thesis. in this thesis.

xiii xiii 14 15

Other papers by the author Other papers by the author

The following papers were also published during the course of the author’s PhD The following papers were also published during the course of the author’s PhD eduction, but are not discussed in this thesis. eduction, but are not discussed in this thesis.

2018: Bolin S., Borgenvik A., Persson C.U., Sundström A., Qi J., Bradner J.E., Weiss W.A., 2018: Bolin S., Borgenvik A., Persson C.U., Sundström A., Qi J., Bradner J.E., Weiss W.A., Cho Y.-J., Weishaupt H., Swartling F.J. (2018). “Combined BET bromodomain and Cho Y.-J., Weishaupt H., Swartling F.J. (2018). “Combined BET bromodomain and CDK2 inhibition in MYC-driven medulloblastoma”. Oncogene, 37(21), 2850. CDK2 inhibition in MYC-driven medulloblastoma”. Oncogene, 37(21), 2850. 2017: Roy, A., Attarha, S., Weishaupt, H., Edqvist, P.-H., Swartling, F. J., Bergqvist, M., 2017: Roy, A., Attarha, S., Weishaupt, H., Edqvist, P.-H., Swartling, F. J., Bergqvist, M., Siebzehnrubl, F. A., Smits, A., Pontén, F., and Tchougounova, E. (2017). “Serglycin Siebzehnrubl, F. A., Smits, A., Pontén, F., and Tchougounova, E. (2017). “Serglycin as a potential biomarker for glioma: association of serglycin expression, extent of as a potential biomarker for glioma: association of serglycin expression, extent of mast cell recruitment and glioblastoma progression”. Oncotarget 8(15): 24815-24827. mast cell recruitment and glioblastoma progression”. Oncotarget 8(15): 24815-24827. Roy, A., Libard, S., Weishaupt, H., Gustavsson, I., Uhrbom, ., Hesselager, G., Roy, A., Libard, S., Weishaupt, H., Gustavsson, I., Uhrbom, L., Hesselager, G., Swartling, F. J., Pontén, F., Alafuzoff, I., and Tchougounova, E. (2017). “Mast Cell Swartling, F. J., Pontén, F., Alafuzoff, I., and Tchougounova, E. (2017). “Mast Cell Infiltration in Human Brain Metastases Modulates the Microenvironment and Con- Infiltration in Human Brain Metastases Modulates the Microenvironment and Con- tributes to the Metastatic Potential”. Frontiers in oncology 7: 115-115. tributes to the Metastatic Potential”. Frontiers in oncology 7: 115-115. Weishaupt, H., Canˇ ˇcer, M., Engström, C., Silvestrov, S., Swartling, F.J. (2017). Weishaupt, H., Canˇ ˇcer, M., Engström, C., Silvestrov, S., Swartling, F.J. (2017). “Comparing the Landcapes of Common Retroviral Insertion Sites across TumorMod- “Comparing the Landcapes of Common Retroviral Insertion Sites across TumorMod- els”. In: AIP Conference Proceedings, Vol. 1798, No. 1, p. 020173. els”. In: AIP Conference Proceedings, Vol. 1798, No. 1, p. 020173. Sreedharan, S., Maturi, N.P., Xie, Y., Sundström, A., Jarvius, M., Libard, S., Alafuzoff, Sreedharan, S., Maturi, N.P., Xie, Y., Sundström, A., Jarvius, M., Libard, S., Alafuzoff, I., Weishaupt, H., Fryknäs, M., Larsson, R., Swartling, F.J., Uhrbom, L. (2017). I., Weishaupt, H., Fryknäs, M., Larsson, R., Swartling, F.J., Uhrbom, L. (2017). “Mouse models of pediatric supratentorial high-grade glioma reveal how cell-of-origin “Mouse models of pediatric supratentorial high-grade glioma reveal how cell-of-origin influences tumor development and phenotype”. Cancer Research, 77(3), 802. influences tumor development and phenotype”. Cancer Research, 77(3), 802. : : : : 2016: Suryo Rahmanto, A. ,Savov,V., Brunner, A.§, Bolin, S.§, Weishaupt, H.§, 2016: Suryo Rahmanto, A. ,Savov,V., Brunner, A.§, Bolin, S.§, Weishaupt, H.§, Malyukova, A., Rosen, G., Cancer, M., Hutter, S., Sundstrom, A., Kawauchi, D., Malyukova, A., Rosen, G., Cancer, M., Hutter, S., Sundstrom, A., Kawauchi, D., Jones, D.T., Spruck, C., Taylor, M.D., Cho, Y.J., Pfister, S.M., Kool, M., Korshunov, Jones, D.T., Spruck, C., Taylor, M.D., Cho, Y.J., Pfister, S.M., Kool, M., Korshunov, A., Swartling, F.J.#, Sangfelt, O.# (2016). “FBW7 suppression leads to SOX9 stabiliz- A., Swartling, F.J.#, Sangfelt, O.# (2016). “FBW7 suppression leads to SOX9 stabiliz- ation and increased malignancy in medulloblastoma”. EMBO J. 35, 2192-2212. ation and increased malignancy in medulloblastoma”. EMBO J. 35, 2192-2212. Truve, K., Dickinson, P., Xiong, A., York, D., Jayashankar, K., Pielberg, G., Koltook- Truve, K., Dickinson, P., Xiong, A., York, D., Jayashankar, K., Pielberg, G., Koltook- ian, M., Muren, E., Fuxelius, H.H., Weishaupt, H., Swartling, F.J., Andersson, G., ian, M., Muren, E., Fuxelius, H.H., Weishaupt, H., Swartling, F.J., Andersson, G., Hedhammar, A., Bongcam-Rudloff, E., Forsberg-Nilsson, K., Bannasch, D., Lindblad- Hedhammar, A., Bongcam-Rudloff, E., Forsberg-Nilsson, K., Bannasch, D., Lindblad- Toh, K. (2016). “Utilizing the Dog Genome in the Search for Novel Candidate Genes Toh, K. (2016). “Utilizing the Dog Genome in the Search for Novel Candidate Genes Involved in Glioma Development - Genome Wide Association Mapping followed by Involved in Glioma Development - Genome Wide Association Mapping followed by Targeted Massive Parallel Sequencing Identifies a Strongly Associated Locus”. PLoS Targeted Massive Parallel Sequencing Identifies a Strongly Associated Locus”. PLoS Genet. 12, e1006000. Genet. 12, e1006000. Sitnik, K.M., Wendland, K., Weishaupt, H., Uronen-Hansson, H., White, A.J., An- Sitnik, K.M., Wendland, K., Weishaupt, H., Uronen-Hansson, H., White, A.J., An- derson, G., Kotarsky, K., Agace, W.W. (2016). “Context-Dependent Development derson, G., Kotarsky, K., Agace, W.W. (2016). “Context-Dependent Development of Lymphoid Stroma from Adult CD34(+) Adventitial Progenitors”. Cell Rep 14, of Lymphoid Stroma from Adult CD34(+) Adventitial Progenitors”. Cell Rep 14, 2375-2388. 2375-2388. 2015: Swartling, F.J., Cancer, M., Frantz, A., Weishaupt, H., Persson, A.I. (2015). “De- 2015: Swartling, F.J., Cancer, M., Frantz, A., Weishaupt, H., Persson, A.I. (2015). “De- regulated proliferation and differentiation in brain tumors”. Cell Tissue Res. 359, regulated proliferation and differentiation in brain tumors”. Cell Tissue Res. 359, 225-254. 225-254. 2014: Hede, S.M., Savov, V., Weishaupt, H., Sangfelt, O., Swartling, F.J. (2014). “On- 2014: Hede, S.M., Savov, V., Weishaupt, H., Sangfelt, O., Swartling, F.J. (2014). “On- coprotein stabilization in brain tumors”. Oncogene 33, 4709-4721. coprotein stabilization in brain tumors”. Oncogene 33, 4709-4721.

: : ,§,#: Authors contributed equally to the work. ,§,#: Authors contributed equally to the work.

xv xv 16 17

List of abbreviations List of abbreviations

ACC ACCuracy ACC ACCuracy ANOVA ANalysis Of VAriance ANOVA ANalysis Of VAriance ARACNE for the Reverse engineering of Accurate ARACNE Algorithm for the Reverse engineering of Accurate Cellular NEtworks Cellular NEtworks auROC area under the Receiver-Operator-Characteristic curve auROC area under the Receiver-Operator-Characteristic curve auPR area under the -Recall curve auPR area under the Precision-Recall curve BC BC Betweenness Centrality CC Clustering Coefficient CC Clustering Coefficient ChIP Chromatin ImmunoPrecipitation ChIP Chromatin ImmunoPrecipitation CLR Context Likelihood of Relatedness CLR Context Likelihood of Relatedness CNV Copy Number Variation CNV Copy Number Variation DC Centrality DC Degree Centrality DE DE Differential Equation DEG Differentially Expressed Gene DEG Differentially Expressed Gene DK Diffusion Kernel DK Diffusion Kernel DN Direct Neighbor DN Direct Neighbor FN False-Negative FN False-Negative FP False-Positive FP False-Positive GBA Guilt-By-Association GBA Guilt-By-Association GENIE3 GEne Network Inference with Ensemble of Trees GENIE3 GEne Network Inference with Ensemble of Trees GRN Gene Regulatory Network GRN Gene Regulatory Network GWAS Genome Wide Association Studies GWAS Genome Wide Association Studies HC Hierarchical Clustering HC Hierarchical Clustering MDS Multi-Dimensional Scaling MDS Multi-Dimensional Scaling MI Mutual Information MI Mutual Information ncRNA Non-Coding RNA ncRNA Non-Coding RNA NIMEFI Network Inference using Multiple Ensemble Feature NIMEFI Network Inference using Multiple Ensemble Feature Importance Importance algorithms ODE Ordinary Differential Equation ODE Ordinary Differential Equation PCA Principal Analysis PCA Principal Component Analysis PDE Partial Differential Equation PDE Partial Differential Equation PR PRecision PR PRecision

xvii xvii 18

RF Representation Factor RF Representation Factor RLE Relative Log Expression RLE Relative Log Expression RMD Relative Mean absolute Deviation RMD Relative Mean absolute Deviation ROC Receiver-Operator-Characteristic ROC Receiver-Operator-Characteristic RUV Removal of Unwanted Variation RUV Removal of Unwanted Variation RWR Random Walks with Restart RWR Random Walks with Restart SDE Stochastic Differential Equation SDE Stochastic Differential Equation SHH Sonic HedgeHog SHH Sonic HedgeHog SNP Single Nucleotide Polymorphism SNP Single Nucleotide Polymorphism SP Shortest SP Shortest Path TF Transcription Factor TF Transcription Factor TIGRESS Trustful Inference of Gene REgulation using TIGRESS Trustful Inference of Gene REgulation using Stability Selection Stability Selection TN True-Negative TN True-Negative TNR True-Negative Rate TNR True-Negative Rate TP True-Positive TP True-Positive TPR True-Positive Rate TPR True-Positive Rate TRN Transcriptional Regulatory Network TRN Transcriptional Regulatory Network WGCNA Weighted Gene Co-expression Network Analysis WGCNA Weighted Gene Co-expression Network Analysis WHO World Health Organization WHO World Health Organization WNT Wingless/Integrated WNT Wingless/Integrated

xviii xviii 19

Contents Contents

Acknowledgments vii Acknowledgments vii

Populärvetenskaplig sammanfattning xi Populärvetenskaplig sammanfattning xi

List of papers xiii List of papers xiii

List of abbreviations xvii List of abbreviations xvii

1 Introduction 21 1 Introduction 21 1.1 Networks and graph theory ...... 22 1.1 Networks and graph theory ...... 22 1.1.1 Mathematical graphs ...... 22 1.1.1 Mathematical graphs ...... 22 1.1.2 Analyzing networks ...... 24 1.1.2 Analyzing networks ...... 24 1.1.2.1 Global /connectivity of networks . . . 24 1.1.2.1 Global topology/connectivity of networks . . . 24 1.1.2.2 Local topology/connectivity of networks . . . . 26 1.1.2.2 Local topology/connectivity of networks . . . . 26 1.1.2.3 Network clustering ...... 28 1.1.2.3 Network clustering ...... 28 1.1.2.4 Random walks and diffusion ...... 31 1.1.2.4 Random walks and diffusion ...... 31 1.1.3 Challenges and problems ...... 31 1.1.3 Challenges and problems ...... 31 1.2 Cancer ...... 33 1.2 Cancer ...... 33 1.2.1 Medulloblastoma ...... 35 1.2.1 Medulloblastoma ...... 35 1.2.1.1 Classification and molecular subgroups ...... 36 1.2.1.1 Classification and molecular subgroups ...... 36 1.2.1.2 Driver genes and pathways ...... 37 1.2.1.2 Driver genes and pathways ...... 37 1.2.1.3 Current problems and future perspectives . . . . 38 1.2.1.3 Current problems and future perspectives . . . . 38 1.2.2 Cancer genes ...... 39 1.2.2 Cancer genes ...... 39 1.2.2.1 Oncogenes ...... 40 1.2.2.1 Oncogenes ...... 40 1.2.2.2 Tumor suppressor genes ...... 40 1.2.2.2 Tumor suppressor genes ...... 40 1.2.2.3 Stability genes ...... 40 1.2.2.3 Stability genes ...... 40 1.2.3 Discovery and prioritization of candidate cancer genes . . 41 1.2.3 Discovery and prioritization of candidate cancer genes . . 41 1.2.3.1 Candidate gene discovery ...... 41 1.2.3.1 Candidate gene discovery ...... 41 1.2.3.2 Cancer gene prioritization ...... 42 1.2.3.2 Cancer gene prioritization ...... 42 1.3 The network approach ...... 43 1.3 The network approach ...... 43 1.3.1 Gene regulatory networks ...... 44 1.3.1 Gene regulatory networks ...... 44 1.3.1.1 Modeling and inference of GRNs ...... 45 1.3.1.1 Modeling and inference of GRNs ...... 45 1.3.1.2 Simulation of gene expression from networks . 50 1.3.1.2 Simulation of gene expression from networks . 50 1.3.1.3 Validation of network inference methods . . . . 51 1.3.1.3 Validation of network inference methods . . . . 51 1.3.2 Network-based prediction of cancer genes ...... 54 1.3.2 Network-based prediction of cancer genes ...... 54 1.3.2.1 Proximity-based methods ...... 54 1.3.2.1 Proximity-based methods ...... 54 1.3.2.2 Clustering-based methods ...... 57 1.3.2.2 Clustering-based methods ...... 57 1.3.2.3 Centrality-based methods ...... 58 1.3.2.3 Centrality-based methods ...... 58

xix xix 20

2 The present investigation 61 2 The present investigation 61 2.1 Paper I ...... 61 2.1 Paper I ...... 61 2.1.1 Summary ...... 61 2.1.1 Summary ...... 61 2.2 Paper II ...... 62 2.2 Paper II ...... 62 2.2.1 Background and aims ...... 62 2.2.1 Background and aims ...... 62 2.2.2 Material and methods ...... 62 2.2.2 Material and methods ...... 62 2.2.3 Results and discussions ...... 63 2.2.3 Results and discussions ...... 63 2.3 Paper III ...... 64 2.3 Paper III ...... 64 2.3.1 Context ...... 64 2.3.1 Context ...... 64 2.3.2 Background and aims ...... 64 2.3.2 Background and aims ...... 64 2.3.3 Material and methods ...... 64 2.3.3 Material and methods ...... 64 2.3.4 Results and discussions ...... 65 2.3.4 Results and discussions ...... 65 2.4 Paper IV ...... 66 2.4 Paper IV ...... 66 2.4.1 Context ...... 66 2.4.1 Context ...... 66 2.4.2 Background and aims ...... 66 2.4.2 Background and aims ...... 66 2.4.3 Material and methods ...... 66 2.4.3 Material and methods ...... 66 2.4.4 Results and discussion ...... 67 2.4.4 Results and discussion ...... 67 2.5 Paper V ...... 68 2.5 Paper V ...... 68 2.5.1 Context ...... 68 2.5.1 Context ...... 68 2.5.2 Background and aims ...... 68 2.5.2 Background and aims ...... 68 2.5.3 Material and methods ...... 68 2.5.3 Material and methods ...... 68 2.5.4 Results and discussion ...... 69 2.5.4 Results and discussion ...... 69 2.5.5 Future perspectives ...... 69 2.5.5 Future perspectives ...... 69

References 71 References 71

xx xx 21

1. Introduction 1. Introduction

Networks resemble an integral component of the world surrounding us. They Networks resemble an integral component of the world surrounding us. They can be found as physical entities, e.g. in the form of power grids, transporta- can be found as physical entities, e.g. in the form of power grids, transporta- tion networks, communication networks, or the brain as a natural information tion networks, communication networks, or the brain as a natural information processor. Furthermore, the recent decades have also seen a sheer explosion of processor. Furthermore, the recent decades have also seen a sheer explosion of abstract networks used to model a widespread number of phenomena, which on abstract networks used to model a widespread number of phenomena, which on first glance might not always appear to resemble actual, physical networks. first glance might not always appear to resemble actual, physical networks. For instance, in the discipline of , abstract network models are fre- For instance, in the discipline of sociology, abstract network models are fre- quently employed to study various aspects of social structures on the level of e.g. quently employed to study various aspects of social structures on the level of e.g. social interactions or relationships [1]. In biological sciences, networks have been social interactions or relationships [1]. In biological sciences, networks have been established to model a multitude of different molecular processes or relationships established to model a multitude of different molecular processes or relationships [2]. Inspired by the architecture of the biological nervous system, the field of [2]. Inspired by the architecture of the biological nervous system, the field of theoretical neuroscience has developed a variety of artificial neural networks in theoretical neuroscience has developed a variety of artificial neural networks in order to model or perform brain-like information processing tasks [3]. Numer- order to model or perform brain-like information processing tasks [3]. Numer- ous other examples can be found, e.g. in ecological networks, citation networks, ous other examples can be found, e.g. in ecological networks, citation networks, or networks for modeling road traffic, a detailed listing of which would however or networks for modeling road traffic, a detailed listing of which would however exceed the scope of this introduction. exceed the scope of this introduction. The allure of networks as one of the dominant choices of modeling systems The allure of networks as one of the dominant choices of modeling systems stems from at least three factors, i.e. (i) their adaptability, which makes them stems from at least three factors, i.e. (i) their adaptability, which makes them applicable to a wide range of problems, (ii) their capability to visualize systems of applicable to a wide range of problems, (ii) their capability to visualize systems of relationships, and (iii) the plethora of established methodology for their analysis. relationships, and (iii) the plethora of established methodology for their analysis. However, as more scientific fields explore networks as a tool to study data and However, as more scientific fields explore networks as a tool to study data and processes, it has also become clear that more research is crucial to the gap processes, it has also become clear that more research is crucial to bridge the gap between theory and application. Specifically, as research shifts towards ever more between theory and application. Specifically, as research shifts towards ever more complex problems, network modeling becomes more challenging in terms of data complex problems, network modeling becomes more challenging in terms of data acquisition, understanding how such data should best be modeled, or which types acquisition, understanding how such data should best be modeled, or which types of analyses need to be performed in order to further our understanding of such of analyses need to be performed in order to further our understanding of such data. data. This thesis will (i) review briefly what networks are and how they can be This thesis will (i) review briefly what networks are and how they can be utilized, (ii) introduce cancer as an example of an application area which could utilized, (ii) introduce cancer as an example of an application area which could benefit from network analyses, (iii) discuss one particular line of research, i.e. benefit from network analyses, (iii) discuss one particular line of research, i.e. cancer gene prioritization, including considerations about potential problems and cancer gene prioritization, including considerations about potential problems and promises associated with network modeling, and (iv) conclude with the display promises associated with network modeling, and (iv) conclude with the display of present work that addresses the application of network analysis in the outlined of present work that addresses the application of network analysis in the outlined context. context.

21 21 22

1.1 Networks and graph theory 1.1 Networks and graph theory

Depending on the scientific field or subject, definitions of what a network is and Depending on the scientific field or subject, definitions of what a network is and how it is applied might differ vastly. However, at the of the majority, if not how it is applied might differ vastly. However, at the core of the majority, if not all, of such designs, networks can be understood as a collection of some objects all, of such designs, networks can be understood as a collection of some objects and links connecting them. and links connecting them. Mathematics provides a more thorough approach to defining networks and Mathematics provides a more thorough approach to defining networks and characterizing their properties. Specifically, in mathematics and particular the characterizing their properties. Specifically, in mathematics and particular the field of graph theory, networks are usually referred to as graphs (from the Greek field of graph theory, networks are usually referred to as graphs (from the Greek “-graphos”, meaning something that is “drawn” or “written”). Graph theory as a “-graphos”, meaning something that is “drawn” or “written”). Graph theory as a whole then denotes the mathematical discipline that is concerned with the study whole then denotes the mathematical discipline that is concerned with the study of such structures and the modeling of relationships between objects. of such structures and the modeling of relationships between objects. The current section will start off by reviewing the definition of mathematical The current section will start off by reviewing the definition of mathematical graphs and general avenues for their analysis, and conclude with an outline of graphs and general avenues for their analysis, and conclude with an outline of some of the challenges associated with network modeling. some of the challenges associated with network modeling.

1.1.1 Mathematical graphs 1.1.1 Mathematical graphs A mathematical graph is defined as a pair GpV, Eq with a of vertices (objects) A mathematical graph is defined as a pair GpV, Eq with a set of vertices (objects) t u  ¤¤¤ t u  ¤¤¤ V vi , i 1,2,3, ,N, V vi , i 1,2,3, ,N, which are joined by a set of edges (links/relationships) which are joined by a set of edges (links/relationships)

E tepi, jqu, i, j Pt1,2,3,¤¤¤,Nu, E tepi, jqu, i, j Pt1,2,3,¤¤¤,Nu, p q p q where e i, j represents an edge between vi and vertex vj . where e i, j represents an edge between vertex vi and vertex vj . Such a graph is often represented in form as an unweighted or weighted Such a graph is often represented in matrix form as an unweighted or weighted ¢ p q  ¢ p q  N N A aij . In an unweighted adjacency matrix aij 1, N N adjacency matrix A aij . In an unweighted adjacency matrix aij 1, p qP  p qP  if e i, j E, i.e. if there is an edge from vertex vi to vertex vj , and aij 0 if e i, j E, i.e. if there is an edge from vertex vi to vertex vj , and aij 0 otherwise. In a weighted adjacency matrix, the entries aij can take on other values otherwise. In a weighted adjacency matrix, the entries aij can take on other values representing for instance the strength, relevance, or confidence of an edge between representing for instance the strength, relevance, or confidence of an edge between vertices vi and vj . vertices vi and vj . Furthermore, graphs can be undirected (Fig. 1.1A), in which case epi, jq is Furthermore, graphs can be undirected (Fig. 1.1A), in which case epi, jq is p q p q directionless and means the same as e j, i (i.e. that vertices vi and vj share a directionless and means the same as e j, i (i.e. that vertices vi and vj share a connection), or directed (Fig. 1.1B), in which case the edges have a direction such connection), or directed (Fig. 1.1B), in which case the edges have a direction such p q p q that e i, j signifies a connection from source vi to target vj . For an undirected that e i, j signifies a connection from source vi to target vj . For an undirected   graph the adjacency matrix is symmetric aij aji, while the adjacency matrix of graph the adjacency matrix is symmetric aij aji, while the adjacency matrix of a can be asymmetric. Assuming that there are no self-loops, i.e. a directed graph can be asymmetric. Assuming that there are no self-loops, i.e.

22 22 23

A B A B

C D C D

Figure 1.1: Illustration of different graph architectures. A) Undirected, connected Figure 1.1: Illustration of different graph architectures. A) Undirected, connected graph. B) Directed, weakly connected graph. C) Undirected, disconnected graph. D) Un- graph. B) Directed, weakly connected graph. C) Undirected, disconnected graph. D) Un- directed, . directed, complete graph. edges connecting a vertex to itself, the maximum number of edges in directed and edges connecting a vertex to itself, the maximum number of edges in directed and undirected graphs is thus undirected graphs is thus # # p ¡ q p ¡ q N N 1 ,ifG is undirected, N N 1 ,ifG is undirected, max|E| 2 max|E| 2 NpN ¡ 1q ,ifG is directed. NpN ¡ 1q ,ifG is directed.

In a graph, a sequence v0, v1, v2,¤¤¤, vT of vertices, where for every consecutive In a graph, a sequence v0, v1, v2,¤¤¤, vT of vertices, where for every consecutive t  t 1  ¤ t  t 1  ¤ pair of vertices v vi and v vj with 0 t T there is a corresponding pair of vertices v vi and v vj with 0 t T there is a corresponding edge epi, jqPE, is referred to as a walk of length T and describes a connection edge epi, jqPE, is referred to as a walk of length T and describes a connection  0  T  0  T between a source vertex va v and a target vertex vb v over connecting between a source vertex va v and a target vertex vb v over connecting edges in the network. If no vertex or edge occurs more than once in such a walk, edges in the network. If no vertex or edge occurs more than once in such a walk,

23 23 24

it is referred to as path. The length of a path is equal to the number of edges it is referred to as path. The length of a path is equal to the number of edges p q p q traversed, and the d i, j from vertex vi to vertex vj is then simply the traversed, and the distance d i, j from vertex vi to vertex vj is then simply the length of the shortest path connecting the vertices. If there is no path between length of the shortest path connecting the vertices. If there is no path between two vertices then one often sets the corresponding distance dpi, jq8. two vertices then one often sets the corresponding distance dpi, jq8. An undirected graph that includes pairs of vertices without a path between An undirected graph that includes pairs of vertices without a path between them is referred to as disconnected (Fig. 1.1C). If on the other hand there is a path them is referred to as disconnected (Fig. 1.1C). If on the other hand there is a path from every vertex vi to every other vertex vj , an undirected graph is said to be from every vertex vi to every other vertex vj , an undirected graph is said to be connected (Fig. 1.1A). A directed network is said to be weakly connected, if there connected (Fig. 1.1A). A directed network is said to be weakly connected, if there is an undirected path between each pair of vertices (Fig. 1.1B), strongly connected, is an undirected path between each pair of vertices (Fig. 1.1B), strongly connected, if there is a directed path from each vertex to each other vertex, and disconnected if there is a directed path from each vertex to each other vertex, and disconnected otherwise. Finally, a graph with an edge from each vertex to each other vertex is otherwise. Finally, a graph with an edge from each vertex to each other vertex is said to be complete (fully connected) (Fig. 1.1D). said to be complete (fully connected) (Fig. 1.1D).

1.1.2 Analyzing networks 1.1.2 Analyzing networks With graphs as the underlying model of relationships between objects, mathem- With graphs as the underlying model of relationships between objects, mathem- atics provides a wealth of approaches for studying such data. Specifically, to give a atics provides a wealth of approaches for studying such data. Specifically, to give a superficial overview, some of the categories of established methods and metrics al- superficial overview, some of the categories of established methods and metrics al- low the study of the global topology or local topology of networks, the clustering low the study of the global topology or local topology of networks, the clustering of networks, or to perform information propagation in networks. of networks, or to perform information propagation in networks.

1.1.2.1 Global topology/connectivity of networks 1.1.2.1 Global topology/connectivity of networks Studies of the global topology of graphs shed light on their overall organization, Studies of the global topology of graphs shed light on their overall organization, such as (i) the overall connectivity of the network, (ii) the distribution of edges such as (i) the overall connectivity of the network, (ii) the distribution of edges across vertices, (iii) the degree of clustering in the network, or (iv) the distribu- across vertices, (iii) the degree of clustering in the network, or (iv) the distribu- tion of path lengths in the network. For instance, the overall connectivity of the tion of path lengths in the network. For instance, the overall connectivity of the network can be represented by the edge density network can be represented by the edge density |E| |E| ρ  , ρ  , max|E| max|E| where ρ  1 implies a complete graph, while a graph with ρ ! 1 can be considered where ρ  1 implies a complete graph, while a graph with ρ ! 1 can be considered sparse. Furthermore, we can define the diameter of the graph as the length of the sparse. Furthermore, we can define the diameter of the graph as the length of the longest shortest path, i.e. longest shortest path, i.e. D  max dpi, jq, D  max dpi, jq, i,j i,j and the mean path length and the mean path length ¸ p q ¸ p q  d i, j  d i, j L | |. L | |. i,j max E i,j max E

24 24 25

A A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

B C B C ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●● ● ● ●●● ● ● ●● ● ●● ●

Figure 1.2: Illustration of different graph . A) Small-world graph. B) Scale- Figure 1.2: Illustration of different graph topologies. A) Small-world graph. B) Scale- free graph. C) Random Gpn, pq graph. free graph. C) Random Gpn, pq graph.

If L grows sufficiently slow, i.e. if L 9 lnpNq, the graph is said to represent a If L grows sufficiently slow, i.e. if L 9 lnpNq, the graph is said to represent a small-world network [4, 5] (Fig. 1.2A). The small-world property implies that small-world network [4, 5] (Fig. 1.2A). The small-world property implies that any target vertex vb can be reached from a source vertex va by traversing only a any target vertex vb can be reached from a source vertex va by traversing only a small number of edges. small number of edges. Another global property of the graph is described by its . Another global property of the graph is described by its degree distribution. p qP p qP Specifically, vertex vj is a neighbor of vertex vi ,ife i, j E, and the number of Specifically, vertex vj is a neighbor of vertex vi ,ife i, j E, and the number of neighbors that a vertex has is then referred to as the degree or degree centrality neighbors that a vertex has is then referred to as the degree or degree centrality (DC) of that vertex. Assuming an unweighted graph (although a generalization to (DC) of that vertex. Assuming an unweighted graph (although a generalization to weighted networks is also possible), the DC of vertex vi in an undirected network weighted networks is also possible), the DC of vertex vi in an undirected network

25 25 26

is computed as is computed as

¸n ¸n ¸n ¸n p q  p q  DC vi aij aji, DC vi aij aji, j1 j1 j1 j1 while in a directed network there is a distinction between in-degree (DCin; only while in a directed network there is a distinction between in-degree (DCin; only incoming edges are counted) and out-degree (DCout; only outgoing edges are incoming edges are counted) and out-degree (DCout; only outgoing edges are counted), respectively defined as counted), respectively defined as

¸n ¸n ¸n ¸n p q p q p q p q DCin vi aji, DCout vi aij. DCin vi aji, DCout vi aij. j1 j1 j1 j1 p p q q p p q q Now, if the probability p DC vi k , i.e. the probability that a vertex vi in Now, if the probability p DC vi k , i.e. the probability that a vertex vi in the graph exhibits a degree centrality DC  k, can be modeled by a power-law the graph exhibits a degree centrality DC  k, can be modeled by a power-law distribution distribution

¡γ ¡γ PpDC  kqk , PpDC  kqk , the graph is said to be scale free [6] (Fig. 1.2B), which implies that the majority the graph is said to be scale free [6] (Fig. 1.2B), which implies that the majority of vertices in the network has very few incident edges, while few vertices have a of vertices in the network has very few incident edges, while few vertices have a large number of incident edges. In random networks on the other hand, vertices large number of incident edges. In random networks on the other hand, vertices tend to have similar degree values distributed around a mean degree xky. For in- tend to have similar degree values distributed around a mean degree xky. For in- stance, a popular type of random network, proposed by Gilbert [7] and denoted stance, a popular type of random network, proposed by Gilbert [7] and denoted p q t ¤¤¤ u p q t ¤¤¤ u as G n, p , is constructed with n vertices V v1, v2, , vn , where each pos- as G n, p , is constructed with n vertices V v1, v2, , vn , where each pos- sible edge is included with probability p (Fig. 1.2C). Following the description sible edge is included with probability p (Fig. 1.2C). Following the description provided by Barabási [8], in such a the distribution of degree cent- provided by Barabási [8], in such a random graph the distribution of degree cent- ralities can instead be modeled by a binomial distribution [8] ralities can instead be modeled by a binomial distribution [8] ¢ ¢ n ¡ 1 ¡ ¡ n ¡ 1 ¡ ¡ PpDC  kq pk p1 ¡ pqn 1 k , PpDC  kq pk p1 ¡ pqn 1 k , k k or in the typical case xky!n by the Poisson distribution [8] or in the typical case xky!n by the Poisson distribution [8] k k ¡x y xky ¡x y xky PpDC  kqe k . PpDC  kqe k . k! k!

1.1.2.2 Local topology/connectivity of networks 1.1.2.2 Local topology/connectivity of networks The investigation of local topological properties of the network allows the identi- The investigation of local topological properties of the network allows the identi- fication of substructures or vertices with particular characteristics. For instance, fication of substructures or vertices with particular characteristics. For instance, a vast number of metrics has been developed to prioritize vertices in terms of their a vast number of metrics has been developed to prioritize vertices in terms of their

26 26 27

A ● A ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ●

B ● B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ●● ●●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ●● ●●●● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●●

● ● ●● ●●

Centrality Centrality

Figure 1.3: Illustration of vertex centralities in a graph. A) Colors and sizes of vertices Figure 1.3: Illustration of vertex centralities in a graph. A) Colors and sizes of vertices represent the betweenness centrality score of the respective vertices. B) Colors and sizes of represent the betweenness centrality score of the respective vertices. B) Colors and sizes of vertices represent the clustering coefficient score of the respective vertices. vertices represent the clustering coefficient score of the respective vertices.

27 27 28

connectivity pattern or other related measures of centrality within networks. In connectivity pattern or other related measures of centrality within networks. In addition to degree centrality, such local topological measures allow for instance addition to degree centrality, such local topological measures allow for instance to identify bottlenecks, e.g. vertices that connect different network modules [9], to identify bottlenecks, e.g. vertices that connect different network modules [9], due to a high betweenness centrality (BC) of these vertices (Fig. 1.3A). Specifically, due to a high betweenness centrality (BC) of these vertices (Fig. 1.3A). Specifically, [ ] [ ] theBCforavertexvi is formally defined as 10 theBCforavertexvi is formally defined as 10 ¸ σ p q ¸ σ p q p q st vi p q st vi BC vi σ , BC vi σ , sit st sit st σ σ where st counts the number of shortest paths from vertex vs to vertex vt and where st counts the number of shortest paths from vertex vs to vertex vt and σ p q σ p q st vi represents the number of shortest paths from vertex vs to vertex vt that st vi represents the number of shortest paths from vertex vs to vertex vt that [ ] [ ] also include vertex vi 10 . also include vertex vi 10 . As another example of a local topological metric, the local clustering coefficient As another example of a local topological metric, the local clustering coefficient (CC) can be employed to identify vertices, which are linked to highly connected (CC) can be employed to identify vertices, which are linked to highly connected p q p q clusters in the network (Fig. 1.3B). Particularly, let N vi be the set of vertices clusters in the network (Fig. 1.3B). Particularly, let N vi be the set of vertices p p that are neighbors of vertex vi . Then CC vi ) can simply be defined as the frac- that are neighbors of vertex vi . Then CC vi ) can simply be defined as the frac- tion of actual versus possible connections between all the pairs of such neighbors tion of actual versus possible connections between all the pairs of such neighbors [ ] | p q|p| p q|¡ q [ ] | p q|p| p q|¡ q 5 . Specifically, in a directed network a total of N vi N vi 1 connections 5 . Specifically, in a directed network a total of N vi N vi 1 connections can exist between the neighbors. Let mi denote the number of observed connec- can exist between the neighbors. Let mi denote the number of observed connec- tions between the neighbors of vertex vi . Then the CC of vertex vi in a directed tions between the neighbors of vertex vi . Then the CC of vertex vi in a directed network equals network equals p q mi p q mi CC vi | p q|p| p q| ¡ q. CC vi | p q|p| p q| ¡ q. N vi N vi 1 N vi N vi 1

1.1.2.3 Network clustering 1.1.2.3 Network clustering Depending on the nature of the modeled data, edges are often not evenly distrib- Depending on the nature of the modeled data, edges are often not evenly distrib- uted in the graph, resulting in an irregular, nodular (Fig. 1.4A). uted in the graph, resulting in an irregular, nodular network topology (Fig. 1.4A). For instance, the graph might exhibit groups of vertices, also referred to as com- For instance, the graph might exhibit groups of vertices, also referred to as com- munities or modules, that display a higher degree of connectivity to each other munities or modules, that display a higher degree of connectivity to each other than to the rest of the network [11]. Graph clustering can then be understood as a than to the rest of the network [11]. Graph clustering can then be understood as a discipline, which is concerned with the development and application of a variety discipline, which is concerned with the development and application of a variety of different approaches in order to identify substructures in a graph or partition of different approaches in order to identify substructures in a graph or partition an entire graph into subgraphs [11, 12]. an entire graph into subgraphs [11, 12]. For instance, many clustering approaches exist that can group data points For instance, many clustering approaches exist that can group data points based on some measure of similarity, and such methods are also applicable to based on some measure of similarity, and such methods are also applicable to graphs after defining similarities between vertices [12]. As an example, in an un- graphs after defining similarities between vertices [12]. As an example, in an un- directed, unweighted graph similarities between vertices can be computed as the directed, unweighted graph similarities between vertices can be computed as the

28 28 29

A A ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

B C B C ● ● ●●● ●●● ●● ●● ●● ●● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●● ●● ● ● ● ● 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Height Height

Figure 1.4: Illustration of hierarchical, agglomerative clustering. A) A network with Figure 1.4: Illustration of hierarchical, agglomerative clustering. A) A network with some expected modularity or existence of more strongly connected subgraphs. B) Dendro- some expected modularity or existence of more strongly connected subgraphs. B) Dendro- gram produced by hierarchical clustering on the neighborhood-derived vertex similarities gram produced by hierarchical clustering on the neighborhood-derived vertex similarities and using average linkage. C) Network from (A) with vertices and edges colored according and using average linkage. C) Network from (A) with vertices and edges colored according to cluster affiliation. to cluster affiliation.

Jaccard index, i.e. the relative overlap, of their neighborhoods as [12] Jaccard index, i.e. the relative overlap, of their neighborhoods as [12] $ $ | p qX p q| | p qX p q| & N vi N vj  & N vi N vj  | p qY p q| ,ifi j, | p qY p q| ,ifi j,  N vi N vj  N vi N vj wij % wij % 0 , otherwise. 0 , otherwise.

Translating such similarities to a corresponding distance measure Translating such similarities to a corresponding distance measure p q ¡ p q ¡ dist vi , vj 1 wij, dist vi , vj 1 wij,

29 29 30

it is then possible to employ a very popular clustering strategy, referred to as hier- it is then possible to employ a very popular clustering strategy, referred to as hier- archical agglomerative clustering [13]. Specifically, this clustering method gener- archical agglomerative clustering [13]. Specifically, this clustering method gener- ally operates according to the following steps ally operates according to the following steps

Initialization In the initial step, each vertex vi is assigned to its own cluster ci , Initialization In the initial step, each vertex vi is assigned to its own cluster ci , | | | | i.e. there are V N clusters, and a DN¢N is created to i.e. there are V N clusters, and a distance matrix DN¢N is created to store all pairwise between the clusters. store all pairwise distances between the clusters. Iteration Subsequently the following operations are repeated until all vertices Iteration Subsequently the following operations are repeated until all vertices belong to a single cluster belong to a single cluster p q p q 1. Identify the pair ci , cj of clusters with the smallest distance, i.e. sat- 1. Identify the pair ci , cj of clusters with the smallest distance, i.e. sat- isfying isfying D  min D , D  min D , ij v,w vw ij v,w vw

2. Merge clusters ci and cj into a new cluster and compute the distance of 2. Merge clusters ci and cj into a new cluster and compute the distance of all other clusters to this new cluster using an adequate distance metric. all other clusters to this new cluster using an adequate distance metric. Particularly, in hierarchical clustering, common distance measures en- Particularly, in hierarchical clustering, common distance measures en- compass single linkage compass single linkage p q p q p q p q distmin cs , cu PminP dist v, w , distmin cs , cu PminP dist v, w , v cs ,w cu v cs ,w cu complete linkage complete linkage p q p q p q p q distmax cs , cu PmaxP dist v, w , distmax cs , cu PmaxP dist v, w , v cs ,w cu v cs ,w cu and average linkage and average linkage ¸ ¸ ¸ ¸ p q 1 p q p q 1 p q distav g cs , cu dist v, w , distav g cs , cu dist v, w , |c ||c | P P |c ||c | P P s u v cs w cu s u v cs w cu 3. Remove rows and columns i and j from the distance matrix and add a 3. Remove rows and columns i and j from the distance matrix and add a new row and column with the computed distances for the new cluster. new row and column with the computed distances for the new cluster. 4. If all vertices belong to a single cluster, finish. Otherwise, return to 4. If all vertices belong to a single cluster, finish. Otherwise, return to step 1. step 1.

Since vertices are added to clusters in a bottom-up fashion, the approach results Since vertices are added to clusters in a bottom-up fashion, the approach results in a nested, i.e. hierarchical, clustering structure, which can be represented as a in a nested, i.e. hierarchical, clustering structure, which can be represented as a dendrogram (Fig. 1.4B), where the height at which subclusters are joined indic- dendrogram (Fig. 1.4B), where the height at which subclusters are joined indic- ates the respective value of the distance metric between these subclusters. Accord- ates the respective value of the distance metric between these subclusters. Accord- ingly, by selecting a -off height, the dendrogram can be into a number of ingly, by selecting a cut-off height, the dendrogram can be split into a number of clusters, which in turn represent communities of vertices (Fig. 1.4C). clusters, which in turn represent communities of vertices (Fig. 1.4C).

30 30 31

1.1.2.4 Random walks and diffusion 1.1.2.4 Random walks and diffusion Finally, network propagation describes a category of methods employed to invest- Finally, network propagation describes a category of methods employed to invest- igate and simulate information flow within a network. Representative methods igate and simulate information flow within a network. Representative methods encompass variations of diffusion processes or random walks on networks, which encompass variations of diffusion processes or random walks on networks, which allow numerous modes of analyses [14, 15]. As an example, a random walk on a allow numerous modes of analyses [14, 15]. As an example, a random walk on a graph GpV, Eq represents a sequence of vertices v0, v1, v2,¤¤¤, where v t denotes graph GpV, Eq represents a sequence of vertices v0, v1, v2,¤¤¤, where v t denotes the vertex visited at time step t ¥ 0, vertex v0 can either be fixed or be randomly the vertex visited at time step t ¥ 0, vertex v0 can either be fixed or be randomly chosen from some initial probability distribution pp0q, and at any time step t the chosen from some initial probability distribution pp0q, and at any time step t the next vertex v t 1 is randomly chosen among the neighbors of the current vertex next vertex v t 1 is randomly chosen among the neighbors of the current vertex v t (Fig. 1.5A). Assuming an undirected, unweighted graph, the probability for v t (Fig. 1.5A). Assuming an undirected, unweighted graph, the probability for any such transition from vertex vi to vj is typically given by a row-stochastic any such transition from vertex vi to vj is typically given by a row-stochastic transition matrix P|V |¢|V | with elements transition matrix P|V |¢|V | with elements $ $ & 1 p qP & 1 p qP p q ,ife i, j E, p q ,ife i, j E,  DC vi  DC vi Pij % Pij % 0 , otherwise. 0 , otherwise.

Using the transition matrix, it is then further possible to determine the prob- Using the transition matrix, it is then further possible to determine the prob- p q p q ability pi t that the random walker is present in vertex vi at time step t (Fig. 1.5B). ability pi t that the random walker is present in vertex vi at time step t (Fig. 1.5B). Specifically, the associated for all vertices at time step t are given by Specifically, the associated probabilities for all vertices at time step t are given by the vector the vector pptqpp0qPt , pptqpp0qPt , which for large t will converge to a stationary probability distribution. which for large t will converge to a stationary probability distribution. lim pptqp8, lim pptqp8, tÑ8 tÑ8 Thus, random walks and related propagation and diffusion processes make it Thus, random walks and related propagation and diffusion processes make it possible to study how information from one or more vertices spread through the possible to study how information from one or more vertices spread through the network. network.

1.1.3 Challenges and problems 1.1.3 Challenges and problems The brief outline given above might paint the picture of networks as a readily The brief outline given above might paint the picture of networks as a readily utilizable tool for modeling and analyzing data. However, especially as the sys- utilizable tool for modeling and analyzing data. However, especially as the sys- tems and processes to be studied become more complex, application of network tems and processes to be studied become more complex, application of network analysis is in fact often hampered by a variety of issues of different natures. analysis is in fact often hampered by a variety of issues of different natures. For instance, selecting a suitable network model for a complex process might For instance, selecting a suitable network model for a complex process might not be straightforward. Specifically, given a real world phenomenon, how does not be straightforward. Specifically, given a real world phenomenon, how does one define interactions? If the process can be described using multiple types of one define interactions? If the process can be described using multiple types of

31 31 32

A A ●● ●●

● ● ● ● ● ● ● ● ● ● ●● ●●

● ● ● ● ●● ●●

● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ● ● ●

● ● ● ●

B B

● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

pj(t=100) pj(t=100)

Figure 1.5: Illustration of random walks on graphs. A) A random walk of length 5 start- Figure 1.5: Illustration of random walks on graphs. A) A random walk of length 5 start- p  q p  q ing from a random vertex. B) Probabilities pi t 100 that a random walker beginning in ing from a random vertex. B) Probabilities pi t 100 that a random walker beginning in the same starting vertex as in A (highlighted by black border) can be found on vertex vi at the same starting vertex as in A (highlighted by black border) can be found on vertex vi at iteration t  100. iteration t  100.

32 32 33

interactions, can a network model capture all of them, or is it necessary to select interactions, can a network model capture all of them, or is it necessary to select a subset of interactions to study? Given a selected type of interaction, which a subset of interactions to study? Given a selected type of interaction, which data is available to infer such interactions and which network types are suitable to data is available to infer such interactions and which network types are suitable to represent the interaction with respect to both the available data and the hypothesis represent the interaction with respect to both the available data and the hypothesis to be addressed? to be addressed? High quality data acquisition might be difficult for many research subjects, High quality data acquisition might be difficult for many research subjects, implying further constraints on network modeling. Specifically, datasets are of- implying further constraints on network modeling. Specifically, datasets are of- ten burdened by inherent noise and display small sample sizes coupled to a high- ten burdened by inherent noise and display small sample sizes coupled to a high- dimensional feature space. Thus, considerations about signal-to-noise ratios and dimensional feature space. Thus, considerations about signal-to-noise ratios and the so called ‘curse-of-dimensionality’ become integral to the process of inferring the so called ‘curse-of-dimensionality’ become integral to the process of inferring networks. In order to deal with such concerns, establishing a robust network networks. In order to deal with such concerns, establishing a robust network model often requires a certain level of abstraction based on a set of assumptions. model often requires a certain level of abstraction based on a set of assumptions. Given a certain process and hypothesis of interest, the challenge then represents Given a certain process and hypothesis of interest, the challenge then represents itself in the choice of an appropriate type of network simplification, such that the itself in the choice of an appropriate type of network simplification, such that the modeling and analysis of the relevant interactions still yield results meaningful in modeling and analysis of the relevant interactions still yield results meaningful in the real-world context. the real-world context. A third type of issue might arise due to a gap between theory, in the form of A third type of issue might arise due to a gap between theory, in the form of mathematically defined concepts and methodology, and praxis, with respect to mathematically defined concepts and methodology, and praxis, with respect to the particular questions and goals to be addressed in an actual application. Spe- the particular questions and goals to be addressed in an actual application. Spe- cifically, depending on the quality or nature of the underlying data, the results ob- cifically, depending on the quality or nature of the underlying data, the results ob- tained from investigating a certain question in the network might often be highly tained from investigating a certain question in the network might often be highly method-dependent. Yet, given the potential complexity of the data, it is often method-dependent. Yet, given the potential complexity of the data, it is often difficult to readily judge, how the choice of methods might influence results or difficult to readily judge, how the choice of methods might influence results or which approach might be best suited to address a given question. Furthermore, which approach might be best suited to address a given question. Furthermore, given the particular properties of a process of interest and the related network given the particular properties of a process of interest and the related network model, generic methods might not always be adequate. Instead, to advance such model, generic methods might not always be adequate. Instead, to advance such areas, it is then required to develop more bespoke tools. areas, it is then required to develop more bespoke tools.

1.2 Cancer 1.2 Cancer

All known living organisms can be conceptualized as systems, the smallest inde- All known living organisms can be conceptualized as systems, the smallest inde- pendent unit of which are biological cells. Such cells can exhibit an astonishing pendent unit of which are biological cells. Such cells can exhibit an astonishing variety of shapes and features not only between organisms but also within a mul- variety of shapes and features not only between organisms but also within a mul- ticellular organism, thus facilitating the myriad of different anatomical and func- ticellular organism, thus facilitating the myriad of different anatomical and func- tional requirements of the individual organs. As a consequence, developing such tional requirements of the individual organs. As a consequence, developing such a complex multi-organ structure and maintaining its structural and physiological a complex multi-organ structure and maintaining its structural and physiological integrity necessitates biological cells to be highly plastic and adaptable. Specific- integrity necessitates biological cells to be highly plastic and adaptable. Specific- ally, in order to coordinate the growth of a multi-organ structure and to later on ally, in order to coordinate the growth of a multi-organ structure and to later on be able to replenish parts of such tissues, biological cells have to be able to prolif- be able to replenish parts of such tissues, biological cells have to be able to prolif-

33 33 34

erate, differentiate, interact, and migrate. erate, differentiate, interact, and migrate. Considering the complexity of the human body, it is not surprising that under Considering the complexity of the human body, it is not surprising that under healthy conditions such cellular processes are subjected to a tightly orchestrated healthy conditions such cellular processes are subjected to a tightly orchestrated control system, embodied by various checkpoints and a developmental control system, embodied by various checkpoints and a developmental hierarchy at both spatial as well as temporal levels. Nevertheless, while such an intricate reg- at both spatial as well as temporal levels. Nevertheless, while such an intricate reg- ulation is needed in order to ensure proper bodily physiology, if disturbed it also ulation is needed in order to ensure proper bodily physiology, if disturbed it also leaves the organism susceptible to abnormal and potentially fatal cellular behavi- leaves the organism susceptible to abnormal and potentially fatal cellular behavi- ors. Specifically, it is now well understood that genomic alterations can enable ors. Specifically, it is now well understood that genomic alterations can enable cells to circumvent typical control systems, thus allowing them to replicate more cells to circumvent typical control systems, thus allowing them to replicate more uncontrollably. Depending on the nature of such an abnormal growth behavior, uncontrollably. Depending on the nature of such an abnormal growth behavior, a normal tissue could thus be transformed into a benign or malignant tumor, the a normal tissue could thus be transformed into a benign or malignant tumor, the latter of which is also referred to as cancer and might eventually lead to the death latter of which is also referred to as cancer and might eventually lead to the death of the organism. of the organism. The last decades have substantially increased our understanding of cancer bio- The last decades have substantially increased our understanding of cancer bio- logy (compare for instance [16, 17]). Specifically, the study of cancer cells has logy (compare for instance [16, 17]). Specifically, the study of cancer cells has identified a number of key mechanisms underlying the malignant behavior of identified a number of key mechanisms underlying the malignant behavior of these cells [18, 19]. According to the 2011 definition by Hanahan and Wein- these cells [18, 19]. According to the 2011 definition by Hanahan and Wein- berg [19] there are a total of 8 such defining signatures, referred to as “hallmarks berg [19] there are a total of 8 such defining signatures, referred to as “hallmarks of cancer”, sustaining proliferative signaling, evading growth suppressors, avoiding of cancer”, sustaining proliferative signaling, evading growth suppressors, avoiding immune destruction, enabling replicative immortality, activating invasion & meta- immune destruction, enabling replicative immortality, activating invasion & meta- stasis, inducing angiogenesis, resisting cell death, deregulating cellular energetics, and stasis, inducing angiogenesis, resisting cell death, deregulating cellular energetics, and two “enabling characteristics”, tumor promoting inflammation, genome instability two “enabling characteristics”, tumor promoting inflammation, genome instability & mutation. Furthermore, the continuous development of ever more advanced & mutation. Furthermore, the continuous development of ever more advanced transcriptional, genomic, proteomic, and epigenetic profiling techniques has also transcriptional, genomic, proteomic, and epigenetic profiling techniques has also allowed researchers to explore the molecular bases, through which such cancer allowed researchers to explore the molecular bases, through which such cancer defining features might arise. For instance, we have witnessed tremendous pro- defining features might arise. For instance, we have witnessed tremendous pro- gress in the mapping of genome landscapes and the identifcation of individual gress in the mapping of genome landscapes and the identifcation of individual cancer-related aberrations in individual genes or pathways [20–25]. In addition, cancer-related aberrations in individual genes or pathways [20–25]. In addition, it has been demonstrated that various tumor types harbor more delimited sub- it has been demonstrated that various tumor types harbor more delimited sub- classes with distinct molecular and clinical properties [26–29], making it possible classes with distinct molecular and clinical properties [26–29], making it possible to narrow down the putative genomic drivers of smaller groups of cancer entities. to narrow down the putative genomic drivers of smaller groups of cancer entities. Together, such efforts have established more insights than ever before into the Together, such efforts have established more insights than ever before into the origins and mechanisms behind cancer development, have greatly improved the origins and mechanisms behind cancer development, have greatly improved the stratification and diagnosis of cancers, and have substantially aided the design of stratification and diagnosis of cancers, and have substantially aided the design of therapeutic strategies. therapeutic strategies. However, despite the recent progress made towards identifying the various However, despite the recent progress made towards identifying the various genetic alterations that can occur in a cancer, the actual driving mechanisms in genetic alterations that can occur in a cancer, the actual driving mechanisms in many of such cancers still remain largely unknown. In fact, the mapped land- many of such cancers still remain largely unknown. In fact, the mapped land- scapes of cancer genomes are typically very complex, making it difficult to distin- scapes of cancer genomes are typically very complex, making it difficult to distin- guish true driver events from a wealth of other genetic alterations present in many guish true driver events from a wealth of other genetic alterations present in many

34 34 35

of such tumors [21, 22]. While it is certainly possible to experimentally validate of such tumors [21, 22]. While it is certainly possible to experimentally validate individual or combinations of the discovered genomic aberrations in order to de- individual or combinations of the discovered genomic aberrations in order to de- termine their role for tumor development, such efforts are typically very time termine their role for tumor development, such efforts are typically very time consuming and costly. Instead there is a clear need for computational methods, consuming and costly. Instead there is a clear need for computational methods, which can prioritize putative cancer genes or pathways from a list of candidates. which can prioritize putative cancer genes or pathways from a list of candidates. Focusing on potential applications in the context of a childhood brain tumor Focusing on potential applications in the context of a childhood brain tumor referred to as medulloblastoma (MB), this thesis explores various concepts and referred to as medulloblastoma (MB), this thesis explores various concepts and challenges of a network-centered theme of candidate cancer gene prioritization. challenges of a network-centered theme of candidate cancer gene prioritization. Specifically, the following sections will first introduce a molecular and clinical Specifically, the following sections will first introduce a molecular and clinical description of MB. Subsequently a brief overview of the nature of cancer genes description of MB. Subsequently a brief overview of the nature of cancer genes in general and some of the most prominent methods for their detection will be in general and some of the most prominent methods for their detection will be discussed. Finally, the last part of the introduction will then be dedicated to a discussed. Finally, the last part of the introduction will then be dedicated to a more in depth review of the theoretical background and application schemes of more in depth review of the theoretical background and application schemes of network-based cancer gene prioritization methods. network-based cancer gene prioritization methods.

1.2.1 Medulloblastoma 1.2.1 Medulloblastoma Among the various forms of tumors, brain cancer takes a special place with re- Among the various forms of tumors, brain cancer takes a special place with re- spect to at least two criteria. First, it might be argued that these cancers and their spect to at least two criteria. First, it might be argued that these cancers and their treatment are especially frightening, because they do not only threaten the body, treatment are especially frightening, because they do not only threaten the body, but might also entail changes to the mind and personality of a patient. Second, but might also entail changes to the mind and personality of a patient. Second, the development of less invasive, more targeted drug therapeutics is hampered by the development of less invasive, more targeted drug therapeutics is hampered by the blood-brain barrier [30, 31]. the blood-brain barrier [30, 31]. According to the World Health Organization (WHO), a large number of dif- According to the World Health Organization (WHO), a large number of dif- ferent types of brain tumors have now been delineated, including for instance as- ferent types of brain tumors have now been delineated, including for instance as- trocytic, oligodendroglial, and neuronal-glial tumors, MBs, ependymomas, and trocytic, oligodendroglial, and neuronal-glial tumors, MBs, ependymomas, and meningiomas [32]. Among these, MB is the most common form of malignant meningiomas [32]. Among these, MB is the most common form of malignant brain tumor in children, with an overall yearly incidence rate of 1.5-1.8 occur- brain tumor in children, with an overall yearly incidence rate of 1.5-1.8 occur- rences per million and a substantially higher yearly rate of up to 6 occurrences rences per million and a substantially higher yearly rate of up to 6 occurrences per million in children [32, 33]. per million in children [32, 33]. Current treatment strategies include combinations of surgery, radiotherapy Current treatment strategies include combinations of surgery, radiotherapy and chemotherapy, achieving 5-year patient survival rates of around 70% [34], and chemotherapy, achieving 5-year patient survival rates of around 70% [34], but survivors often suffer from neurocognitive sequelae [35, 36]. In addition, but survivors often suffer from neurocognitive sequelae [35, 36]. In addition, MBs present with a high rate of metastasis [34, 37], and patients often relapse MBs present with a high rate of metastasis [34, 37], and patients often relapse or sometimes develop secondary neoplasms potentially as a consequence of the or sometimes develop secondary neoplasms potentially as a consequence of the treatment [38]. treatment [38]. In order to overcome such persisting therapeutic challenges, a lot of focus has In order to overcome such persisting therapeutic challenges, a lot of focus has been directed towards risk stratification, identification of targetable driver events been directed towards risk stratification, identification of targetable driver events as well as personalized therapy, as briefly discussed below. as well as personalized therapy, as briefly discussed below.

35 35 36

1.2.1.1 Classification and molecular subgroups 1.2.1.1 Classification and molecular subgroups A traditional classification scheme for MBs referred to a histopathological subtyp- A traditional classification scheme for MBs referred to a histopathological subtyp- ing into tumors with (i) classic, (ii) desmoplastic/nodular, (iii) extensive nodular, ing into tumors with (i) classic, (ii) desmoplastic/nodular, (iii) extensive nodular, or (iv) large cell / anaplastic appearance [39]. or (iv) large cell / anaplastic appearance [39].

Figure 1.6: Molecular subgroups of medulloblastoma: clinical and molecular char- Figure 1.6: Molecular subgroups of medulloblastoma: clinical and molecular char- acteristics. Driver genes are in this context considered to be genes with a high frequency acteristics. Driver genes are in this context considered to be genes with a high frequency of genomic alterations in the respective subgroup. Reprinted by permission from Springer of genomic alterations in the respective subgroup. Reprinted by permission from Springer Nature: Nature Reviews Cancer, Medulloblastomics: the end of the beginning, Northcott, Nature: Nature Reviews Cancer, Medulloblastomics: the end of the beginning, Northcott, P.A. et al. 2012 [34]. P.A. et al. 2012 [34].

In addition, several unsupervised classification efforts of transcriptional data In addition, several unsupervised classification efforts of transcriptional data from various cohorts have also introduced molecular classifications of MBs [40– from various cohorts have also introduced molecular classifications of MBs [40– 43]. These individual classifications have subsequently been integrated to form a 43]. These individual classifications have subsequently been integrated to form a consensus classification with four MB subgroups termed (i) Wingless/Integrated consensus classification with four MB subgroups termed (i) Wingless/Integrated (WNT), (ii) Sonic hedgehog (SHH), (iii) Group 3, and (iv) Group 4 [27]. These (WNT), (ii) Sonic hedgehog (SHH), (iii) Group 3, and (iv) Group 4 [27]. These groups have been found to be recapitulated by DNA methylation profiling [44, groups have been found to be recapitulated by DNA methylation profiling [44, 45], and have been shown to associate with distinct clinical features, such as occur- 45], and have been shown to associate with distinct clinical features, such as occur- rence rates, age and sex distributions, survival prognoses, presence of metastases, rence rates, age and sex distributions, survival prognoses, presence of metastases, and molecular characteristics [27, 34, 37, 46] (Fig. 1.6). and molecular characteristics [27, 34, 37, 46] (Fig. 1.6). Recently, it has also been decided to integrate histopathological classifications, Recently, it has also been decided to integrate histopathological classifications, molecular classifications and additional signature mutations in order to obtain a molecular classifications and additional signature mutations in order to obtain a

36 36 37

more complete stratification of MB patients [32]. more complete stratification of MB patients [32]. Not included in this official classification scheme are yet more recent find- Not included in this official classification scheme are yet more recent find- ings, which suggested that the molecularly defined subgroups might be further ings, which suggested that the molecularly defined subgroups might be further subdivided into more delineated subsets or subtypes [25, 47, 48]. subdivided into more delineated subsets or subtypes [25, 47, 48].

1.2.1.2 Driver genes and pathways 1.2.1.2 Driver genes and pathways As mentioned above and illustrated in figure 1.6, the molecular subgroups of MB As mentioned above and illustrated in figure 1.6, the molecular subgroups of MB have been linked to specific genomic and transcriptional landscapes, which in- have been linked to specific genomic and transcriptional landscapes, which in- clude for instance transcriptional signatures, somatic mutations, and structural clude for instance transcriptional signatures, somatic mutations, and structural copy number alterations. For instance, the WNT and SHH subgroups received copy number alterations. For instance, the WNT and SHH subgroups received their nomenclature from a readily distinguishable activation of the WNT or SHH their nomenclature from a readily distinguishable activation of the WNT or SHH signaling pathways, which are also thought to drive the development of these tu- signaling pathways, which are also thought to drive the development of these tu- mors, respectively [27, 34, 49]. mors, respectively [27, 34, 49]. Specifically, beyond a general transcriptional upregulation of the WNT path- Specifically, beyond a general transcriptional upregulation of the WNT path- way [40–43], WNT patients are further characterized by highly recurrent activat- way [40–43], WNT patients are further characterized by highly recurrent activat- ing somatic mutations in the CTNNB1 gene and a monosomy of chromosome 6 ing somatic mutations in the CTNNB1 gene and a monosomy of chromosome 6 [27, 34, 49].Ofnote,CTNNB1 has been recognized as a distinct driver gene of this [27, 34, 49].Ofnote,CTNNB1 has been recognized as a distinct driver gene of this subgroup, as also supported by the generation of MB mouse models with stabil- subgroup, as also supported by the generation of MB mouse models with stabil- ized CTNNB1 that recapitulate features of the WNT subgroup [50, 51]. Germline ized CTNNB1 that recapitulate features of the WNT subgroup [50, 51]. Germline loss-of-function mutations in the WNT inhibitor gene APC, albeit less frequently loss-of-function mutations in the WNT inhibitor gene APC, albeit less frequently observed, might constitute another driving event [27, 52]. A number of other observed, might constitute another driving event [27, 52]. A number of other gene alterations have also been detected in this subgroup, associated for instance gene alterations have also been detected in this subgroup, associated for instance with the genes DDX3X, SMARCA4, CSNK2B, TP53, KMT2D, and PIK3CA [25]. with the genes DDX3X, SMARCA4, CSNK2B, TP53, KMT2D, and PIK3CA [25]. Tumors of the SHH subgroup exhibit a transcriptional profile associated with Tumors of the SHH subgroup exhibit a transcriptional profile associated with an upregulation of SHH signaling [40–43]. The subgroup is further characterized an upregulation of SHH signaling [40–43]. The subgroup is further characterized by recurrent mutations in PTCH1, which encodes a negative regulator of SHH by recurrent mutations in PTCH1, which encodes a negative regulator of SHH signaling, and the loss of chromosome 9p, on which PTCH1 is located [27].In signaling, and the loss of chromosome 9p, on which PTCH1 is located [27].In addition, genomic alterations have also been found in the SHH associated genes addition, genomic alterations have also been found in the SHH associated genes SUFU, SMO, MYCN, and GLI2 [25, 50, 53–55]. Indeed, SHH signaling activat- SUFU, SMO, MYCN, and GLI2 [25, 50, 53–55]. Indeed, SHH signaling activat- ing events such as the deletion/inactivation of PTCH1 or SUFU, or activating ing events such as the deletion/inactivation of PTCH1 or SUFU, or activating mutation/overexpression of SMO have shown great promise as drivers, as sup- mutation/overexpression of SMO have shown great promise as drivers, as sup- ported by a number of mouse models that develop SHH affiliated MB tumors ported by a number of mouse models that develop SHH affiliated MB tumors from the respective genetic backgrounds (compare [34, 56, 57] and references from the respective genetic backgrounds (compare [34, 56, 57] and references therein). Other genomic alteration in SHH patients affect for instance TERT, therein). Other genomic alteration in SHH patients affect for instance TERT, DDX3X, TP53, KMT2D, and CREBBP [25]. DDX3X, TP53, KMT2D, and CREBBP [25]. As indicated by their general names, less is known about potential driver genes As indicated by their general names, less is known about potential driver genes or pathways in Group 3 and Group 4 MBs. In a recent study probing the genomic or pathways in Group 3 and Group 4 MBs. In a recent study probing the genomic landscape of MBs, Northcott et al. [25] have been able to associate roughly 80% landscape of MBs, Northcott et al. [25] have been able to associate roughly 80% of Group 3 and Group 4 patients with one or more putative drivers (recurrently of Group 3 and Group 4 patients with one or more putative drivers (recurrently

37 37 38

altered genes). Nevertheless, despite these findings, little is known about which altered genes). Nevertheless, despite these findings, little is known about which events actually act as driver genes or pathways of Group 4 and a large fraction of events actually act as driver genes or pathways of Group 4 and a large fraction of Group 3 tumors. Group 3 tumors. Group 3 tumors present with frequent chromosomal aberrations such as losses Group 3 tumors present with frequent chromosomal aberrations such as losses of chromosomes 8, 10q, 16q or 17p and gains of chromosomes 1q, 7, or 17q, in- of chromosomes 8, 10q, 16q or 17p and gains of chromosomes 1q, 7, or 17q, in- cluding the formation of isochromosome 17 [25, 37, 50, 53–55]. The most fre- cluding the formation of isochromosome 17 [25, 37, 50, 53–55]. The most fre- quently altered gene in Group 3 MBs is MYC [25], which has also been estab- quently altered gene in Group 3 MBs is MYC [25], which has also been estab- lished as a driver of a subset of these tumors. Specifically, two research laborator- lished as a driver of a subset of these tumors. Specifically, two research laborator- ies have already established mouse models with Group 3 MB like characteristics ies have already established mouse models with Group 3 MB like characteristics by ectotopically expressing MYC in a Trp53 deficient setup [58, 59]. Additionally, by ectotopically expressing MYC in a Trp53 deficient setup [58, 59]. Additionally, albeit less frequent than MYC amplifications, Group 3 MBs also show recurrent albeit less frequent than MYC amplifications, Group 3 MBs also show recurrent amplifications of MYCN [25, 50, 53, 54], and it has been demonstrated that the amplifications of MYCN [25, 50, 53, 54], and it has been demonstrated that the over-expression of MYCN from the Glt1 promoter in mice generated a model of over-expression of MYCN from the Glt1 promoter in mice generated a model of MB [60], which has subsequently also been classified as a Group 3 tumor [56, 61]. MB [60], which has subsequently also been classified as a Group 3 tumor [56, 61]. However, these models can only recapitulate a fraction of all the Group 3 MBs However, these models can only recapitulate a fraction of all the Group 3 MBs and more insight into the driver events of the remaining patients is needed to be and more insight into the driver events of the remaining patients is needed to be able to more fully model this subgroup. Additional alterations in Group 3 MBs able to more fully model this subgroup. Additional alterations in Group 3 MBs have for instance been discovered in GFI1B, SMARCA4, KBTBD4, CTDNEP1, and have for instance been discovered in GFI1B, SMARCA4, KBTBD4, CTDNEP1, and KMT2D [25]. KMT2D [25]. Group 4 MBs have been found to harbor a variety of broad copy number Group 4 MBs have been found to harbor a variety of broad copy number alterations, including for instance the loss of chromosomes 8, 11p, 17p, or the alterations, including for instance the loss of chromosomes 8, 11p, 17p, or the gains of chromosomes 7 or 17q, including the formation of isochromosome 17 gains of chromosomes 7 or 17q, including the formation of isochromosome 17 [25, 37, 50, 53–55]. Comparisons of the transcriptional profile of this group with [25, 37, 50, 53–55]. Comparisons of the transcriptional profile of this group with the other subgroups has revealed an up-regulation of genes involved in neuronal the other subgroups has revealed an up-regulation of genes involved in neuronal differentiation and glutamergic receptor signaling [40, 42]. Genes frequently tar- differentiation and glutamergic receptor signaling [40, 42]. Genes frequently tar- geted by genomic alterations in Group 4 include, for instance, PRDM6, GFI1B, geted by genomic alterations in Group 4 include, for instance, PRDM6, GFI1B, KDM6A, OTX2, ZMYM3, KMT2C, KBTBD4, and MYCN [25]. While some of KDM6A, OTX2, ZMYM3, KMT2C, KBTBD4, and MYCN [25]. While some of these might be argued to be putative drivers of this subgroup, no related mouse these might be argued to be putative drivers of this subgroup, no related mouse models of Group 4 have yet been established from alterations in these genes [56]. models of Group 4 have yet been established from alterations in these genes [56]. Nevertheless, a recent phosphoproteomic/proteomic based screen has further Nevertheless, a recent phosphoproteomic/proteomic based screen has further identified the aberrant activity of ERBB4-SRC signaling in Group 4 tumors and identified the aberrant activity of ERBB4-SRC signaling in Group 4 tumors and the authors have reported the generation of a SRC-driven mouse model that re- the authors have reported the generation of a SRC-driven mouse model that re- capitulates human Group 4 tumors [62]. capitulates human Group 4 tumors [62].

1.2.1.3 Current problems and future perspectives 1.2.1.3 Current problems and future perspectives Due to the extensive profiling efforts during the last years, the genomic landscape Due to the extensive profiling efforts during the last years, the genomic landscape of MB has now been largely unveiled [25, 50, 53–55]. However, further efforts of MB has now been largely unveiled [25, 50, 53–55]. However, further efforts are yet required to fully understand the pathogenesis of the disease. Specifically, are yet required to fully understand the pathogenesis of the disease. Specifically, despite the abundance of genetic alterations detected in the various MB subgroups despite the abundance of genetic alterations detected in the various MB subgroups

38 38 39

and subsets, driver mechanisms and cells of origin for Group 3 and Group 4 MBs and subsets, driver mechanisms and cells of origin for Group 3 and Group 4 MBs are still poorly explored, mouse models recapturing a large fraction of Group 4 are still poorly explored, mouse models recapturing a large fraction of Group 4 and Group 3 tumors are still lacking, and drug testing against many potential and Group 3 tumors are still lacking, and drug testing against many potential targets has yet to be conducted. targets has yet to be conducted. Thus, the next quest for MB research can be envisioned to encompass a fur- Thus, the next quest for MB research can be envisioned to encompass a fur- ther characterization of the identified genomic events in an attempt to uncover ther characterization of the identified genomic events in an attempt to uncover the biological mechanisms underlying tumor development and progression. Con- the biological mechanisms underlying tumor development and progression. Con- sidering the resources and time required to conduct experimental validations of sidering the resources and time required to conduct experimental validations of individual gene candidates, the first step in this endeavor will likely require ad- individual gene candidates, the first step in this endeavor will likely require ad- ditional filtering of candidates in order to predict promising genes for follow-up ditional filtering of candidates in order to predict promising genes for follow-up functional analyses. Specifically, given a set of genetic aberrations, how can genes functional analyses. Specifically, given a set of genetic aberrations, how can genes be evaluated computationally in order to (i) distinguish driver events from pas- be evaluated computationally in order to (i) distinguish driver events from pas- senger events, (ii) to gauge the effect of a mutation on the entire cellular system, senger events, (ii) to gauge the effect of a mutation on the entire cellular system, and (iii) to detect cross-talk between genes? and (iii) to detect cross-talk between genes? The latter two questions are particular relevant, when considering cancer as The latter two questions are particular relevant, when considering cancer as a multi-step disease relying on the alteration and collaboration of multiple genes, a multi-step disease relying on the alteration and collaboration of multiple genes, as discussed in the following section. For instance, while several mouse mod- as discussed in the following section. For instance, while several mouse mod- els have demonstrated the role of individual genes in brain tumor genesis, these els have demonstrated the role of individual genes in brain tumor genesis, these models might not always be clinically relevant, e.g. if they only use one driving models might not always be clinically relevant, e.g. if they only use one driving gene or a given driving oncogene in combination with an arbitrary tumor sup- gene or a given driving oncogene in combination with an arbitrary tumor sup- pressor gene that does not reflect observed mutational co-occurrences. Thus, in pressor gene that does not reflect observed mutational co-occurrences. Thus, in addition to identifying driver genes, more insight is also needed in the potential addition to identifying driver genes, more insight is also needed in the potential co-occurrences and implied cross-talk between them. co-occurrences and implied cross-talk between them.

1.2.2 Cancer genes 1.2.2 Cancer genes Cancer has for a long time been considered a “genetic disease”, i.e. it is proposed Cancer has for a long time been considered a “genetic disease”, i.e. it is proposed to arise due to genetic defects leading to aberrant cellular functions that in turn to arise due to genetic defects leading to aberrant cellular functions that in turn promote different cancer driving properties [23, 63]. While recent insights into promote different cancer driving properties [23, 63]. While recent insights into cancer genomes have revealed that genetic events may take a variety of different cancer genomes have revealed that genetic events may take a variety of different forms such as genomic relocations, copy number aberration, or point mutations, forms such as genomic relocations, copy number aberration, or point mutations, a persisting view suggests that it is eventually the altered activity of specific genes a persisting view suggests that it is eventually the altered activity of specific genes which give rise to cancer progression [23, 64]. Moreover, since biological cells ex- which give rise to cancer progression [23, 64]. Moreover, since biological cells ex- hibit a wide variety of control mechanisms to protect them from aberrant growth hibit a wide variety of control mechanisms to protect them from aberrant growth behavior, it has further been reasoned that a number of defects in genes with dif- behavior, it has further been reasoned that a number of defects in genes with dif- ferent biological functions have to act in concert in order for the cell to acquire ferent biological functions have to act in concert in order for the cell to acquire all relevant tumorigenic properties such as proliferative, invasive and apoptosis- all relevant tumorigenic properties such as proliferative, invasive and apoptosis- evading capabilities [18, 23, 64]. Since the responsible genetic defects typically do evading capabilities [18, 23, 64]. Since the responsible genetic defects typically do not arise all at once, but instead accumulate over time thus successively bestow- not arise all at once, but instead accumulate over time thus successively bestow- ing the cell with the functional aberrations needed for the tumorigenic phenotype ing the cell with the functional aberrations needed for the tumorigenic phenotype

39 39 40

to establish, cancer development has also been recognized as a multistep process. to establish, cancer development has also been recognized as a multistep process. [63, 64]. [63, 64]. The last decades have witnessed the identification of a multitude of different The last decades have witnessed the identification of a multitude of different genes implicated in tumorigenesis, including also various MB-related genes as re- genes implicated in tumorigenesis, including also various MB-related genes as re- viewed above. The characterization of numerous of such cancer genes has led to viewed above. The characterization of numerous of such cancer genes has led to the insight that they can be broadly divided into three different categories, which the insight that they can be broadly divided into three different categories, which are termed (i) oncogenes, (ii) tumor suppressor genes, and (iii) stability genes [23], are termed (i) oncogenes, (ii) tumor suppressor genes, and (iii) stability genes [23], and which can be described as follows. and which can be described as follows.

1.2.2.1 Oncogenes 1.2.2.1 Oncogenes One type of genomic alteration involved in cancer development affects so called One type of genomic alteration involved in cancer development affects so called proto-oncogenes, i.e. normal genes which are thought to positively regulate cell proto-oncogenes, i.e. normal genes which are thought to positively regulate cell growth and proliferation. When such genes are altered in a fashion that constitutes growth and proliferation. When such genes are altered in a fashion that constitutes a gain of function, e.g. in terms of upregulated expression or stabilized/constitu- a gain of function, e.g. in terms of upregulated expression or stabilized/constitu- tively active protein, they are transformed from their proto-oncogene state to an tively active protein, they are transformed from their proto-oncogene state to an oncogene state [64], in which they can contribute to tumorigenesis through an oncogene state [64], in which they can contribute to tumorigenesis through an aberrant induction of cell proliferation. aberrant induction of cell proliferation.

1.2.2.2 Tumor suppressor genes 1.2.2.2 Tumor suppressor genes Tumor suppressor genes, sometimes also referred to as ‘antioncogenes’ or ‘growth Tumor suppressor genes, sometimes also referred to as ‘antioncogenes’ or ‘growth supressors’ [64], represent genes that instantiate control mechanisms capable to supressors’ [64], represent genes that instantiate control mechanisms capable to inhibit cell proliferation, e.g. through the regulation of cell-cylce progression or inhibit cell proliferation, e.g. through the regulation of cell-cylce progression or cell death. Thus, when these genes are affected by genetic alterations leading to cell death. Thus, when these genes are affected by genetic alterations leading to a loss-of functions, the cell might lose a protection from uncontrolled growth a loss-of functions, the cell might lose a protection from uncontrolled growth behavior. behavior.

1.2.2.3 Stability genes 1.2.2.3 Stability genes Finally, genomic alterations contributing to cancer development have also been Finally, genomic alterations contributing to cancer development have also been suggested to target a third group, which encompasses DNA stability and caretaker suggested to target a third group, which encompasses DNA stability and caretaker genes, mainly responsible for the regulation of proper DNA repair mechanisms genes, mainly responsible for the regulation of proper DNA repair mechanisms and ensuring chromatin integrity [23]. As such, these genes do not directly con- and ensuring chromatin integrity [23]. As such, these genes do not directly con- tribute do the actual tumor driving processes such as proliferation, cell growth, or tribute do the actual tumor driving processes such as proliferation, cell growth, or evasion from apoptosis, but instead they cause genomic instability thus allowing evasion from apoptosis, but instead they cause genomic instability thus allowing cells to acquire the mutations needed to initiate or progress tumorigenesis [23, 65]. cells to acquire the mutations needed to initiate or progress tumorigenesis [23, 65].

40 40 41

1.2.3 Discovery and prioritization of candidate cancer genes 1.2.3 Discovery and prioritization of candidate cancer genes 1.2.3.1 Candidate gene discovery 1.2.3.1 Candidate gene discovery Recent years have seen a number of techniques employed for discovering putative Recent years have seen a number of techniques employed for discovering putative cancer gene, some popular examples of which will be discussed below. cancer gene, some popular examples of which will be discussed below.

GWAS Genome wide association studies (GWAS) represent one widely used GWAS Genome wide association studies (GWAS) represent one widely used type of association technique, which attempts to link genetic variants, typically type of association technique, which attempts to link genetic variants, typically single-nucleotide polymorphisms (SNPs), to specific disease phenotypes [66]. This single-nucleotide polymorphisms (SNPs), to specific disease phenotypes [66]. This method has already produced numerous insights into associations between ge- method has already produced numerous insights into associations between ge- netic polymorphisms and cancer [67]. However, GWAS also presents with a netic polymorphisms and cancer [67]. However, GWAS also presents with a number of limitations [68], which affect the usefulness of this technique for can- number of limitations [68], which affect the usefulness of this technique for can- cer gene discovery. Specifically, such problems may relate to the large number of cer gene discovery. Specifically, such problems may relate to the large number of patients needed to establish confident associations [66], coupled to the issue that patients needed to establish confident associations [66], coupled to the issue that cancer is a multigenic disease caused by different combinations of alterations, not cancer is a multigenic disease caused by different combinations of alterations, not all of which are necessarily reflecting common events [68]. all of which are necessarily reflecting common events [68].

High-throughput omics screens In addition to traditional GWAS approaches, High-throughput omics screens In addition to traditional GWAS approaches, the of high-throughput microarray or sequencing based genomic screen- the emergence of high-throughput microarray or sequencing based genomic screen- ing techniques has also unlocked other avenues for the discovery of cancer genes ing techniques has also unlocked other avenues for the discovery of cancer genes [69, 70]. Specifically, a common approach for identifying candidate cancer genes [69, 70]. Specifically, a common approach for identifying candidate cancer genes is to utilize various molecular profiling methods to identify loci that are recur- is to utilize various molecular profiling methods to identify loci that are recur- rently affected by somatic mutations, copy number variations (CNVs), or chro- rently affected by somatic mutations, copy number variations (CNVs), or chro- mosomal rearrangements, or that show dysregulated epigenetic or transcriptional mosomal rearrangements, or that show dysregulated epigenetic or transcriptional properties. However, similar to the GWAS approach, inferring the whole genome properties. However, similar to the GWAS approach, inferring the whole genome landscape of a tumor requires large cohorts of patients and the approach might landscape of a tumor requires large cohorts of patients and the approach might ultimately only produce candidate genes, with further effort being needed to sep- ultimately only produce candidate genes, with further effort being needed to sep- arate driver and passenger events. arate driver and passenger events.

Forward genetic screens Forward genetics presents another powerful technique Forward genetic screens Forward genetics presents another powerful technique to screen for potential cancer genes in model organisms. Specifically, in forward to screen for potential cancer genes in model organisms. Specifically, in forward genetics, one uses different mutagenic agents to induce almost randomly distrib- genetics, one uses different mutagenic agents to induce almost randomly distrib- uted genomic alterations that can affect the transcriptional regulation and output uted genomic alterations that can affect the transcriptional regulation and output of genes [71]. Mutagenic agents used in this context include for instance different of genes [71]. Mutagenic agents used in this context include for instance different retroviruses [72] and transposable elements (TEs), such as the Sleeping Beauty retroviruses [72] and transposable elements (TEs), such as the Sleeping Beauty (SB) [73] or piggyBac (PB) [74] systems. When genomic events caused by these (SB) [73] or piggyBac (PB) [74] systems. When genomic events caused by these agents lead to tumor development, their position and function in the genome can agents lead to tumor development, their position and function in the genome can be determined retrospectively to identify the potential cancer gene that drove or be determined retrospectively to identify the potential cancer gene that drove or contributed to tumorigenesis [71]. contributed to tumorigenesis [71].

41 41 42

1.2.3.2 Cancer gene prioritization 1.2.3.2 Cancer gene prioritization The description of cancer genes above seems to draw a clear picture of cancer de- The description of cancer genes above seems to draw a clear picture of cancer de- velopment as an accumulation of genomic alterations in specific genes that drive velopment as an accumulation of genomic alterations in specific genes that drive or enable certain cancer beneficial phenotypes. However, cancers present with or enable certain cancer beneficial phenotypes. However, cancers present with genomic landscapes that can often harbor an astounding number of different al- genomic landscapes that can often harbor an astounding number of different al- terations [20, 21, 25]. Out of such an abundance of genomic events, only a small terations [20, 21, 25]. Out of such an abundance of genomic events, only a small fraction, termed ‘driver mutations’, is considered to be actively involved in tum- fraction, termed ‘driver mutations’, is considered to be actively involved in tum- origenesis, while the larger proportion reflects ‘passenger mutations’ that do not origenesis, while the larger proportion reflects ‘passenger mutations’ that do not contribute to the development of cancer [21, 22, 75, 76]. Hence, the identification contribute to the development of cancer [21, 22, 75, 76]. Hence, the identification of the genes, which drive cancer, does not only require a detection of genomic al- of the genes, which drive cancer, does not only require a detection of genomic al- terations, but furthermore also entails the task of distinguishing between the real terations, but furthermore also entails the task of distinguishing between the real cancer genes and passenger events [75, 76]. cancer genes and passenger events [75, 76]. In the ideal case scenario, all putative cancer genes could simply be subjected In the ideal case scenario, all putative cancer genes could simply be subjected to a panel of experimental validations, in order to identify their involvement in to a panel of experimental validations, in order to identify their involvement in cancer development. Yet, even with state-of-the-art experimental technologies, cancer development. Yet, even with state-of-the-art experimental technologies, the validation of just a single gene is both costly and time consuming. Thus, there the validation of just a single gene is both costly and time consuming. Thus, there is a clear need for a computational framework, by which a list of genes can be is a clear need for a computational framework, by which a list of genes can be ranked or prioritized, so that subsequent experimental follow-up validations can ranked or prioritized, so that subsequent experimental follow-up validations can be focused in first line on the most promising candidates. Expanding on a similar be focused in first line on the most promising candidates. Expanding on a similar description published by Zampieri et al. [77], such a framework for cancer gene description published by Zampieri et al. [77], such a framework for cancer gene t ¤¤¤ u t ¤¤¤ u prioritization might be generally described as follows. Let G g1, g2, , gn prioritization might be generally described as follows. Let G g1, g2, , gn denote the set of genes in a cell/organism of interest. Out of these genes, a subset denote the set of genes in a cell/organism of interest. Out of these genes, a subset € € PD G, referred to as positives, is involved in or driving the development of a PD G, referred to as positives, is involved in or driving the development of a /  z /  z disease cancer phenotype D, while the other genes, ND G PD , referred to as disease cancer phenotype D, while the other genes, ND G PD , referred to as negatives, are not related to disease D. Typically, a limited number of positives negatives, are not related to disease D. Typically, a limited number of positives „  „  KD PD is known a-priori, while the involvement of the remaining genes, UD KD PD is known a-priori, while the involvement of the remaining genes, UD z z G KD , in disease D is unknown. The task of cancer gene prioritization is then to G KD , in disease D is unknown. The task of cancer gene prioritization is then to compute a scoring compute a scoring p qψp q P p qψp q P SD gi X, gi , gi UD , SD gi X, gi , gi UD , P ψ P ψ for all genes gi UD (or a subset of candidates among these genes), where de- for all genes gi UD (or a subset of candidates among these genes), where de- scribes a mathematical operation that scores a gene with respect to some indica- scribes a mathematical operation that scores a gene with respect to some indica- tion, based on some available data X, of its involvement in disease D. Often, the tion, based on some available data X, of its involvement in disease D. Often, the known positive genes (KD ) are used as a reference to derive the criteria based on known positive genes (KD ) are used as a reference to derive the criteria based on which a gene’s involvement in a disease should be rated. In such cases, the scoring which a gene’s involvement in a disease should be rated. In such cases, the scoring function may also read as function may also read as p qψp q P p qψp q P SD gi X, gi ,KD , gi UD , SD gi X, gi ,KD , gi UD , p q p q Ideally, a scoring SD gi would provide some information about the probability Ideally, a scoring SD gi would provide some information about the probability p P q p P q p gi PD , but such a relationship can often not be derived. p gi PD , but such a relationship can often not be derived.

42 42 43

Network based methods Among the various types of gene prioritization strate- Network based methods Among the various types of gene prioritization strate- gies employed in cancer/disease research, one particularly popular alternative is gies employed in cancer/disease research, one particularly popular alternative is based on the analysis of biological networks. Modeling relevant processes or de- based on the analysis of biological networks. Modeling relevant processes or de- pendencies in the biological system as networks, these methods seek to score indi- pendencies in the biological system as networks, these methods seek to score indi- vidual genes or proteins for instance based on some measure of importance in the vidual genes or proteins for instance based on some measure of importance in the network or association with network components that are known to be related network or association with network components that are known to be related to the disease. to the disease. A more extensive discussion of network based methods for cancer prioritiza- A more extensive discussion of network based methods for cancer prioritiza- tion is given in the next section. tion is given in the next section.

1.3 The network approach 1.3 The network approach

Networks play an important role in the representation and analysis of relational Networks play an important role in the representation and analysis of relational data throughout many different scientific fields. During the recent decades they data throughout many different scientific fields. During the recent decades they have also gained increasing attention in biological disciplines, including systems have also gained increasing attention in biological disciplines, including systems biology and . Specifically, during the course of exploring biology and computational biology. Specifically, during the course of exploring the biological foundation of an organism’s physiology, two realizations have be- the biological foundation of an organism’s physiology, two realizations have be- come clear: (i) Many biological processes are established through complex mo- come clear: (i) Many biological processes are established through complex mo- lecular interactions and dependencies which can be visualized as pathways or net- lecular interactions and dependencies which can be visualized as pathways or net- works [2], and (ii) the majority of related biological data gathered will exhibit works [2], and (ii) the majority of related biological data gathered will exhibit certain types of measurable intrinsic correlations or associations reflecting such certain types of measurable intrinsic correlations or associations reflecting such relationships. Networks represent an intuitive choice for modeling and illustrat- relationships. Networks represent an intuitive choice for modeling and illustrat- ing the diverse types of relationships that constitute these biological system. Spe- ing the diverse types of relationships that constitute these biological system. Spe- cifically, given the myriad of molecular and biochemical processes that take place cifically, given the myriad of molecular and biochemical processes that take place in a cell, it is not surprising that a wealth of different network concepts have been in a cell, it is not surprising that a wealth of different network concepts have been developed in the field of biology, with some examples being outlined in Paper I. developed in the field of biology, with some examples being outlined in Paper I. Yet, the common feature describing virtually all such networks is their construc- Yet, the common feature describing virtually all such networks is their construc- tion from vertices and edges. Accordingly, biological networks can in most cases tion from vertices and edges. Accordingly, biological networks can in most cases be directly translated to a respective mathematical graph-based framework, allow- be directly translated to a respective mathematical graph-based framework, allow- ing the analysis of the underlying relational data through the use of a large body ing the analysis of the underlying relational data through the use of a large body of well defined mathematical methodology established from matrix and of well defined mathematical methodology established from matrix algebra and graph theory [78]. graph theory [78]. In essence, the network approach for cancer gene prioritization then com- In essence, the network approach for cancer gene prioritization then com- prises two key stages, i.e. (1) the establishment/selection of a network, which prises two key stages, i.e. (1) the establishment/selection of a network, which captures some type of biological relationship informative for the study of tumori- captures some type of biological relationship informative for the study of tumori- genic mechanisms, and (2) the identification and application of adequate network- genic mechanisms, and (2) the identification and application of adequate network- based computational and mathematical methods to rank candidate genes. based computational and mathematical methods to rank candidate genes. We can illustrate the related concepts of construction, in- We can illustrate the related concepts of biological network construction, in- terpretation and application on the example of gene regulatory networks (GRNs), terpretation and application on the example of gene regulatory networks (GRNs),

43 43 44

also referred to as transcriptional regulatory networks (TRN) or gene transcrip- also referred to as transcriptional regulatory networks (TRN) or gene transcrip- tional regulatory networks, which are one of the predominant types of networks tional regulatory networks, which are one of the predominant types of networks utilized in system biology and which have been the focus of a large body of re- utilized in system biology and which have been the focus of a large body of re- search. The next sections will first introduce GRNs as a concrete example of search. The next sections will first introduce GRNs as a concrete example of network biology, subsequently outline some frameworks for their mathematical network biology, subsequently outline some frameworks for their mathematical modeling and statistical inference, and finally discuss several types of graph theory modeling and statistical inference, and finally discuss several types of graph theory based approaches for candidate cancer gene prioritization in such networks. based approaches for candidate cancer gene prioritization in such networks.

1.3.1 Gene regulatory networks 1.3.1 Gene regulatory networks In a GRN, vertices are most commonly considered to be genes, while edges con- In a GRN, vertices are most commonly considered to be genes, while edges con- tain some type of information about the regulatory interaction or transcriptional tain some type of information about the regulatory interaction or transcriptional relationship between two genes. In a simplified and idealistic view, such regulat- relationship between two genes. In a simplified and idealistic view, such regulat- ory interactions could resemble the action of transcription factors (TFs), which ory interactions could resemble the action of transcription factors (TFs), which bind to target genes and influence their mRNA transcription. In such networks, bind to target genes and influence their mRNA transcription. In such networks, vertices should thus be separated into genes that encode a TF and genes which do vertices should thus be separated into genes that encode a TF and genes which do not. An interaction from a gene A to a gene B would then imply that the protein not. An interaction from a gene A to a gene B would then imply that the protein encoded by gene A acts as a TF and binds to a regulatory element, e.g. promoter, encoded by gene A acts as a TF and binds to a regulatory element, e.g. promoter, of gene B (which might be a transcription factor encoding gene itself), thus in- of gene B (which might be a transcription factor encoding gene itself), thus in- fluencing its transcriptional activity (Fig. 1.7A). The regulatory mechanisms and fluencing its transcriptional activity (Fig. 1.7A). The regulatory mechanisms and interactions that involve transcription, translation and protein-DNA binding can interactions that involve transcription, translation and protein-DNA binding can then be abstracted as a simple network structure composed of TFs, target genes, then be abstracted as a simple network structure composed of TFs, target genes, and edges between them (Fig. 1.7B). Yet, it is not always clear a-priori, which of and edges between them (Fig. 1.7B). Yet, it is not always clear a-priori, which of the genes actually encode for TFs, and a clear separation of vertices into one of the genes actually encode for TFs, and a clear separation of vertices into one of the two classes is thus not always feasible. Furthermore, while this type of net- the two classes is thus not always feasible. Furthermore, while this type of net- work model might represent the most intuitive illustration of the transcriptional work model might represent the most intuitive illustration of the transcriptional regulatory system, other network architectures with varying interpretations of regulatory system, other network architectures with varying interpretations of edges between genes exist. The most important of these models and the concept edges between genes exist. The most important of these models and the concept of their inference will be discussed in the following subsection. of their inference will be discussed in the following subsection. While the definition of a GRN as a model of TF-mediated transcriptional While the definition of a GRN as a model of TF-mediated transcriptional regulation of target genes appears commonly employed [79], transcription is in- regulation of target genes appears commonly employed [79], transcription is in- fluenced by a myriad of other factors as well, including for instance non-coding fluenced by a myriad of other factors as well, including for instance non-coding RNAs (ncRNAs), epigenetic factors, chromatin modifiers, posttranslational modi- RNAs (ncRNAs), epigenetic factors, chromatin modifiers, posttranslational modi- fications, and transcriptional co-regulators. Thus, a more general definition of fications, and transcriptional co-regulators. Thus, a more general definition of GRNs might consider edges to correspond to any type of gene-gene based regu- GRNs might consider edges to correspond to any type of gene-gene based regu- latory interaction resulting in a measurable transcriptional change regardless of latory interaction resulting in a measurable transcriptional change regardless of the regulatory molecule. In fact, since GRNs are typically inferred from gene ex- the regulatory molecule. In fact, since GRNs are typically inferred from gene ex- pression data (see below), Emmert-Streib et al. [80] suggested to define a GRN as pression data (see below), Emmert-Streib et al. [80] suggested to define a GRN as “a network that has been inferred from gene expression data”. “a network that has been inferred from gene expression data”. Following this latter, more general definition, GRNs can thus be represented Following this latter, more general definition, GRNs can thus be represented

44 44 45

A A Protein ● ● Protein ● ●

RNA RNA

DNA TF1 G1 G2 TF2 G3 DNA TF1 G1 G2 TF2 G3

B B TF TF encoding gene TF TF encoding gene TF1 TF2 i TF1 TF2 i

Gi non−TF encoding gene Gi non−TF encoding gene

Activating interaction Activating interaction G1 G2 G3 G1 G2 G3 Inhibiting interaction Inhibiting interaction

Figure 1.7: Basic regulatory framework of GRNs. A) Under the general dogma of bio- Figure 1.7: Basic regulatory framework of GRNs. A) Under the general dogma of bio- logy, the mRNA transcribed from TF-encoding genes is further translated to TFs, which logy, the mRNA transcribed from TF-encoding genes is further translated to TFs, which can then regulate the transcription of target genes. B) Simplifying the regulatory mechan- can then regulate the transcription of target genes. B) Simplifying the regulatory mechan- isms in (A), it is possible to arrive at a simplified , in which only the isms in (A), it is possible to arrive at a simplified network architecture, in which only the TFs can act to repress or activate genes. TFs can act to repress or activate genes. as mathematical graphs, GRN G V, E , where the vertices (V ) correspond to as mathematical graphs, GRN G V, E , where the vertices (V ) correspond to genes, or some variables that reflect a gene’s activity or expression, and the edges genes, or some variables that reflect a gene’s activity or expression, and the edges (E) indicate some type of transcriptional relationship between genes. (E) indicate some type of transcriptional relationship between genes. The key questions with respect to the establishment of such GRNs are then: The key questions with respect to the establishment of such GRNs are then: (1) how should transcriptional relationships be encoded by the graph, and (2) how (1) how should transcriptional relationships be encoded by the graph, and (2) how can these relationships be inferred from the underlying biological systems? Some can these relationships be inferred from the underlying biological systems? Some aspects related to these two concerns will be discussed further in the following aspects related to these two concerns will be discussed further in the following section. section.

1.3.1.1 Modeling and inference of GRNs 1.3.1.1 Modeling and inference of GRNs For a long time, the exploration of transcriptional regulation was largely based on For a long time, the exploration of transcriptional regulation was largely based on small-scale experimental validations. More recent high-throughput experimental small-scale experimental validations. More recent high-throughput experimental solutions are for instance presented by integrating chromatin immunoprecipit- solutions are for instance presented by integrating chromatin immunoprecipit- ation (ChIP) assays for a certain TF (TF-ChIP) with transcription profiling as- ation (ChIP) assays for a certain TF (TF-ChIP) with transcription profiling as- says to determine direct TF target genes and the transcriptional effect induced by says to determine direct TF target genes and the transcriptional effect induced by

45 45 46

the corresponding interactions [81, 82]. Another alternative is the gene centered the corresponding interactions [81, 82]. Another alternative is the gene centered method presented by the yeast one-hybrid system, which operates in the other method presented by the yeast one-hybrid system, which operates in the other direction by identifying all TFs binding to a given DNA fragment [83, 84].How- direction by identifying all TFs binding to a given DNA fragment [83, 84].How- ever, performing such techniques at a high-throughput level for all TFs or gene ever, performing such techniques at a high-throughput level for all TFs or gene promoters in a given cell type can be cumbersome and expensive. promoters in a given cell type can be cumbersome and expensive. On the other hand, given the technological advancements made during the On the other hand, given the technological advancements made during the last two decades, the generation of gene expression data, using either micro-array last two decades, the generation of gene expression data, using either micro-array or RNA-sequencing platforms, has become fast and inexpensive. Consequently, or RNA-sequencing platforms, has become fast and inexpensive. Consequently, it is not surprising that the development of methods for the reconstruction of it is not surprising that the development of methods for the reconstruction of transcriptional regulations from such data has become a prominent field of com- transcriptional regulations from such data has become a prominent field of com- putational biology. Specifically, consider gene expression data to be available as putational biology. Specifically, consider gene expression data to be available as a matrix Xn¢m, where m corresponds to the number of samples (observations; a matrix Xn¢m, where m corresponds to the number of samples (observations; j P  j P  columns), n denotes the number of genes (variables; rows), and xi represents columns), n denotes the number of genes (variables; rows), and xi represents a measure of the expression of gene gi in sample j. Then the strategy of such a measure of the expression of gene gi in sample j. Then the strategy of such methods, rather than approaching the problem experimentally, is to attempt to methods, rather than approaching the problem experimentally, is to attempt to infer or reverse-engineer transcriptional networks using statistical techniques in infer or reverse-engineer transcriptional networks using statistical techniques in an often purely data-driven fashion from such a gene expression matrix. an often purely data-driven fashion from such a gene expression matrix. Given the vast variety of ways to approach the problem of how to model the Given the vast variety of ways to approach the problem of how to model the underlying transcriptional relationships in a network and how to infer such rela- underlying transcriptional relationships in a network and how to infer such rela- tionships from the expression data, network inference can today be regarded as tionships from the expression data, network inference can today be regarded as an umbrella term harboring a vast number of approaches with different concepts an umbrella term harboring a vast number of approaches with different concepts and goals. For instance, the individual network models employed for this pur- and goals. For instance, the individual network models employed for this pur- pose can be categorized on several levels including e.g. the choice of (i) a static pose can be categorized on several levels including e.g. the choice of (i) a static versus dynamic nature of the model, (ii) a discrete or continuous representation versus dynamic nature of the model, (ii) a discrete or continuous representation of expression values, (iii) the use of linear, non-linear, or boolean functions to rep- of expression values, (iii) the use of linear, non-linear, or boolean functions to rep- resent gene interactions, and (iv) deterministic or stochastic approaches to model resent gene interactions, and (iv) deterministic or stochastic approaches to model relationships between genes [85]. relationships between genes [85]. The remainder of this subsection briefly outlines some popular examples of The remainder of this subsection briefly outlines some popular examples of related approaches for inferring and modeling GRNs. related approaches for inferring and modeling GRNs.

Co-expression networks: This model represents probably the most intuitive Co-expression networks: This model represents probably the most intuitive and straight forward approach to network inference. Specifically, in co-expression and straight forward approach to network inference. Specifically, in co-expression networks, vertices represent genes and an edge epi, jq measures some form of sim- networks, vertices represent genes and an edge epi, jq measures some form of sim- ilarity or correlation between the expression profiles of genes gi and gj as calcu- ilarity or correlation between the expression profiles of genes gi and gj as calcu- lated from an underlying expression matrix. For instance, a commonly employed lated from an underlying expression matrix. For instance, a commonly employed metric for inferring such co-expression relationships is Pearson’s correlation coef- metric for inferring such co-expression relationships is Pearson’s correlation coef- ficient, which for a pair of genes gi and gj can be estimated from the expression ficient, which for a pair of genes gi and gj can be estimated from the expression

46 46 47

matrix via the sample correlation coefficient as matrix via the sample correlation coefficient as ° ° m p s ¡ qp s ¡ q m p s ¡ qp s ¡ q  x x¯i x x¯j  x x¯i x x¯j p qb s 1 i b j p qb s 1 i b j r xi ,xj ° ° , r xi ,xj ° ° , m p s ¡ q2 m p s ¡ q2 m p s ¡ q2 m p s ¡ q2 s1 xi x¯i s1 xj x¯j s1 xi x¯i s1 xj x¯j where m is the number of samples, xi is the vector holding all m expression values where m is the number of samples, xi is the vector holding all m expression values s s of gene gi , xi denotes the expression of gene gi in sample s, and x¯i is the mean of gene gi , xi denotes the expression of gene gi in sample s, and x¯i is the mean expression of gene gi across all samples. However similarity scores can also be expression of gene gi across all samples. However similarity scores can also be derived through other correlation measures such as Spearman’s rank correlation derived through other correlation measures such as Spearman’s rank correlation coefficient or Kendall’s tau coefficient, or yet other gene association metrics [86]. coefficient or Kendall’s tau coefficient, or yet other gene association metrics [86]. After the initial computation of association values, it is then possible to create After the initial computation of association values, it is then possible to create a sparse network by choosing a ‘hard threshold’ [87, 88] and setting edges between a sparse network by choosing a ‘hard threshold’ [87, 88] and setting edges between any pair of genes whose correlation value exceeds this threshold, or to generate a any pair of genes whose correlation value exceeds this threshold, or to generate a weighted network by the use of ‘soft thresholding’, e.g. in the form of raising the weighted network by the use of ‘soft thresholding’, e.g. in the form of raising the absolute correlation value to a power β [88, 89]. One prominent example of such absolute correlation value to a power β [88, 89]. One prominent example of such a weighted correlation based co-expression method is WGCNA (WeiGhted Cor- a weighted correlation based co-expression method is WGCNA (WeiGhted Cor- Network Analysis / Weighted Gene Co-expression Network Analysis) relation Network Analysis / Weighted Gene Co-expression Network Analysis) [89]. [89].

Information-theoretic networks: This type of modeling is very similar to the Information-theoretic networks: This type of modeling is very similar to the above approach, but instead of using a correlation based measure to determine as- above approach, but instead of using a correlation based measure to determine as- sociations between genes, it relies on the calculation of information-based metrics sociations between genes, it relies on the calculation of information-based metrics such as the mutual information (MI), which measures the amount of shared in- such as the mutual information (MI), which measures the amount of shared in- formation and hence dependence between two variables. Specifically, considering formation and hence dependence between two variables. Specifically, considering ¤¤¤ ¤¤¤ two discrete random variables X , which can take the unique values x1, x2, , xk , two discrete random variables X , which can take the unique values x1, x2, , xk , ¤¤¤ ¤¤¤ and Y , which can take on the unique values y1, y2, , yl , the mutual information and Y , which can take on the unique values y1, y2, , yl , the mutual information between X and Y is defined as [90] between X and Y is defined as [90] £ £ ¸ ¸ p px , y q ¸ ¸ p px , y q p q p q XY i j p q p q XY i j I X ; Y pXY xi , yj log p q p q , I X ; Y pXY xi , yj log p q p q , P P pX xi pY yj P P pX xi pY yj xi X yj Y xi X yj Y p q  p q p q  p q where pX xi represents the marginal probability that X xi , pY yi represents where pX xi represents the marginal probability that X xi , pY yi represents  p q  p q the marginal probability that Y yi , and pXY xi , xi is the joint probability of the marginal probability that Y yi , and pXY xi , xi is the joint probability of xi in X and yj in Y . xi in X and yj in Y . Having computed a matrix of estimated MI values between pairs of genes, a Having computed a matrix of estimated MI values between pairs of genes, a simple pruned network can then again be obtained for instance by applying a hard simple pruned network can then again be obtained for instance by applying a hard threshold [91]. threshold [91]. Examples of network inference algorithms that employ mutual information Examples of network inference algorithms that employ mutual information include ARACNE (Algorithm for the Reverse engineering of Accurate Cellular include ARACNE (Algorithm for the Reverse engineering of Accurate Cellular

47 47 48

NEtworks) [92], CLR (Context Likelihood of Relatedness) [93], and MRNET NEtworks) [92], CLR (Context Likelihood of Relatedness) [93], and MRNET [94]. [94].

Regression-based approaches: Rather than computing every gene-gene associ- Regression-based approaches: Rather than computing every gene-gene associ- ation via a measure of correlation or mutual information, regression-based meth- ation via a measure of correlation or mutual information, regression-based meth- ods follow a type of feature selection strategy to determine the likely regulators, ods follow a type of feature selection strategy to determine the likely regulators, i.e. transcription factors, of a target gene and use this information to reconstruct i.e. transcription factors, of a target gene and use this information to reconstruct the network topology. Specifically, in order to identify a small set of probable the network topology. Specifically, in order to identify a small set of probable regulators for a given target gene, regression analyses are performed to measure regulators for a given target gene, regression analyses are performed to measure the importance of the expression levels of putative transcription factors (inde- the importance of the expression levels of putative transcription factors (inde- pendent variables) for predicting the expression level of the target gene (depend- pendent variables) for predicting the expression level of the target gene (depend- ent variable). Specifically, following the description and notations in [95, 96], ent variable). Specifically, following the description and notations in [95, 96], the concept of regression based network inference might be described as follows. the concept of regression based network inference might be described as follows. s s s s Let the expression of a target gene gi in sample s be denoted by xi ,letx¡i be Let the expression of a target gene gi in sample s be denoted by xi ,letx¡i be s s the vector of gene expression values (excluding xi ) of a set of putative regulators the vector of gene expression values (excluding xi ) of a set of putative regulators s s (transcription factors) of gi , and let represent some random noise for sample s. (transcription factors) of gi , and let represent some random noise for sample s. [ ] [ ] Then the regression problem for gene gi can be stated as 96 Then the regression problem for gene gi can be stated as 96 s  p s q s s  p s q s xi fi x¡i , xi fi x¡i , where the task of network inference is then to identify a set of transcription where the task of network inference is then to identify a set of transcription factors that can solve this equation adequately well for all m samples in the ex- factors that can solve this equation adequately well for all m samples in the ex- [ ] [ ] pression matrix given a certain implementation of the function fi 95, 96 . pression matrix given a certain implementation of the function fi 95, 96 . Examples of related network inference methods, using various strategies to ap- Examples of related network inference methods, using various strategies to ap- proach the feature selection problem, are TIGRESS (Trustful Inference of Gene proach the feature selection problem, are TIGRESS (Trustful Inference of Gene REgulation using Stability Selection) [95], GENIE3 (GEne Network Inference REgulation using Stability Selection) [95], GENIE3 (GEne Network Inference with Ensemble of trees) [96], ENNET [97], NIMEFI (Network Inference using with Ensemble of trees) [96], ENNET [97], NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) [99], and PLSNET (Partial Multiple Ensemble Feature Importance algorithms) [99], and PLSNET (Partial Least Squares based network inference) [98]. Least Squares based network inference) [98].

Bayesian networks: Bayesian networks resemble a class of probabilistic mod- Bayesian networks: Bayesian networks resemble a class of probabilistic mod- els, i.e. in the context of gene regulatory networks they allow a description of els, i.e. in the context of gene regulatory networks they allow a description of probabilistic relationships between the transcriptional states of genes. Specific- probabilistic relationships between the transcriptional states of genes. Specific- ally, the transcriptional activity of each gene gi is modeled as a ally, the transcriptional activity of each gene gi is modeled as a random variable xi and depicted as a vertex in a . Each vertex (gene) xi has a xi and depicted as a vertex in a directed acyclic graph. Each vertex (gene) xi has a p q„ zt u p q„ zt u set of parents denoted as Parents xi V xi , where V is the set of all vertices set of parents denoted as Parents xi V xi , where V is the set of all vertices in the network, and the directed edges in this network then depict a regulatory re- in the network, and the directed edges in this network then depict a regulatory re- lationship from a parent (regulator) to a target gene. Particularly, the distribution lationship from a parent (regulator) to a target gene. Particularly, the distribution of transcriptional states of a gene is then modeled via a conditional probability of transcriptional states of a gene is then modeled via a conditional probability given the transcriptional states of its parents. given the transcriptional states of its parents.

48 48 49

Boolean networks: Boolean networks represent a class of discrete, dynamic Boolean networks: Boolean networks represent a class of discrete, dynamic networks. Specifically, according to [100], a Boolean network for modeling tran- networks. Specifically, according to [100], a Boolean network for modeling tran- scriptional regulation can be defined as follows. Each vertex in such a network scriptional regulation can be defined as follows. Each vertex in such a network Pt u Pt u represents a binary variable xi 0,1 , which models the expression of the cor- represents a binary variable xi 0,1 , which models the expression of the cor-     responding gene gi as either ‘on’ (xi 1; expressed) or ‘off’ (xi 0; not ex- responding gene gi as either ‘on’ (xi 1; expressed) or ‘off’ (xi 0; not ex- pressed). Furthermore, the activity of each vertex is not static, but instead the pressed). Furthermore, the activity of each vertex is not static, but instead the p q / p q / value xi t 1 at time iteration t 1 is computed via a Boolean function that takes value xi t 1 at time iteration t 1 is computed via a Boolean function that takes p q„ p q„ as input the current activities, i.e. activities at time t, of a subset Parents xi as input the current activities, i.e. activities at time t, of a subset Parents xi zt u zt u V xi , referred to as parents of xi , where V is the set of all vertices in the net- V xi , referred to as parents of xi , where V is the set of all vertices in the net- work [100]. Each vertex can have its own Boolean function, composed of Boolean work [100]. Each vertex can have its own Boolean function, composed of Boolean operators such as AND, OR, NOT, or EQ (equality), i.e. given a vertex xi with operators such as AND, OR, NOT, or EQ (equality), i.e. given a vertex xi with two parents xj and xk , and a vertex xu with one parent xv , examples for the values two parents xj and xk , and a vertex xu with one parent xv , examples for the values of xi and xu at time t 1 could be of xi and xu at time t 1 could be p q p q^ p q p q p q^ p q xi t 1 xj t xk t , xi t 1 xj t xk t , p q p q p q p q xu t 1 xv t . xu t 1 xv t .

Such dependencies can be depicted in a directed graph and resemble a dynamic Such dependencies can be depicted in a directed graph and resemble a dynamic network, which is typically governed by a synchronous update rule, such that at network, which is typically governed by a synchronous update rule, such that at any iteration the states of all vertices in the network are updated in synchrony. any iteration the states of all vertices in the network are updated in synchrony.

Differential equation methods: Differential equation based methods describe Differential equation methods: Differential equation based methods describe a group of dynamic network paradigms, in which vertices are continuous vari- a group of dynamic network paradigms, in which vertices are continuous vari- ables representing transcript (mRNA) concentrations of corresponding genes and ables representing transcript (mRNA) concentrations of corresponding genes and the entire network can model the rate of change of those concentrations using the entire network can model the rate of change of those concentrations using differential equations (DEs). Specifically, the underlying framework is based on differential equations (DEs). Specifically, the underlying framework is based on the kinetic nature of transcription and models gene expression as time-dependent the kinetic nature of transcription and models gene expression as time-dependent changes of transcript concentration as a consequence of the transcriptional regu- changes of transcript concentration as a consequence of the transcriptional regu- latory system and external perturbations. The most commonly employed of such latory system and external perturbations. The most commonly employed of such approaches are based on ordinary differential equations (ODE), such that the rate approaches are based on ordinary differential equations (ODE), such that the rate [ ] [ ] of change of mRNA concentration xi of gene gi can be written as 101, 102 of change of mRNA concentration xi of gene gi can be written as 101, 102 dx ptq dx ptq i  F px,θ, u, tq, i  F px,θ, u, tq, dt i dt i where the function Fi measures the rate of change of xi given the mRNA con- where the function Fi measures the rate of change of xi given the mRNA con- centrations of all n genes at time t stored in the vector x, the directed regulatory centrations of all n genes at time t stored in the vector x, the directed regulatory interactions in the network stored in θ, and any external perturbation represented interactions in the network stored in θ, and any external perturbation represented by u [101, 102]. by u [101, 102].

49 49 50

The function F can take on different linear or nonlinear forms, e.g. one of The function F can take on different linear or nonlinear forms, e.g. one of the simplest implementations is a linear differential expression of the form [101] the simplest implementations is a linear differential expression of the form [101] p q ¸n p q ¸n dxi t  p q β p q dxi t  p q β p q ajixj t i u t , ajixj t i u t , dt j1 dt j1 where aji are the weights of the adjacency matrix signifying the strength of the where aji are the weights of the adjacency matrix signifying the strength of the β β transcriptional regulatory effect from gene gj to gene gi and parameter i meas- transcriptional regulatory effect from gene gj to gene gi and parameter i meas- ures how strongly the external perturbation affects the transcription of gene gi ures how strongly the external perturbation affects the transcription of gene gi [101]. [101]. In addition, other DE frameworks such as partial differential equations (PDEs) In addition, other DE frameworks such as partial differential equations (PDEs) or stochastic differential equations (SDE) can be considered, creating a plethora or stochastic differential equations (SDE) can be considered, creating a plethora of different implementation possibilities. of different implementation possibilities.

1.3.1.2 Simulation of gene expression from networks 1.3.1.2 Simulation of gene expression from networks Importantly, while it has been shown to some degree that GRNs can be inferred Importantly, while it has been shown to some degree that GRNs can be inferred from gene expression data, the opposite also holds true. For instance, considering from gene expression data, the opposite also holds true. For instance, considering a dynamic network model such as the ODE based framework, in which edges a dynamic network model such as the ODE based framework, in which edges represent the strength and direction of transcriptional regulatory interactions, it is represent the strength and direction of transcriptional regulatory interactions, it is possible to use a system of differential equations, e.g. ODEs or SDEs, to simulate possible to use a system of differential equations, e.g. ODEs or SDEs, to simulate the gene expression profiles generated by the underlying regulatory interactions. the gene expression profiles generated by the underlying regulatory interactions. For instance, adhering to the frameworks demonstrated in [103, 104] a cor- For instance, adhering to the frameworks demonstrated in [103, 104] a cor- responding model based on a system of linear ODEs can be described as fol- responding model based on a system of linear ODEs can be described as fol- lows. Rooted in the central dogma of biology but using a simplified view on the lows. Rooted in the central dogma of biology but using a simplified view on the transcriptional and translational processes, the underlying assumptions of such a transcriptional and translational processes, the underlying assumptions of such a model are usually that (i) transcription (production of mRNA) of a gene is reg- model are usually that (i) transcription (production of mRNA) of a gene is reg- ulated by proteins (regulators/TFs), (ii) such regulator proteins are translated at ulated by proteins (regulators/TFs), (ii) such regulator proteins are translated at some rate from the mRNA of the TF encoding gene, and (iii) both mRNAs and some rate from the mRNA of the TF encoding gene, and (iii) both mRNAs and proteins are degraded at some rate. Thus, in order to simulate expression data, proteins are degraded at some rate. Thus, in order to simulate expression data, two processes have to be considered, i.e. the rate of change of protein concentra- two processes have to be considered, i.e. the rate of change of protein concentra- tions is determined by the synthesis of new proteins from the existing mRNA and tions is determined by the synthesis of new proteins from the existing mRNA and the degradation of already existing proteins, while the rate of change of mRNA the degradation of already existing proteins, while the rate of change of mRNA concentrations is determined by the synthesis of new mRNA due to some tran- concentrations is determined by the synthesis of new mRNA due to some tran- scripitional activation and the degradation of already existing mRNA molecules. scripitional activation and the degradation of already existing mRNA molecules. These two processes can be modeled using a system of two coupled differential These two processes can be modeled using a system of two coupled differential equations [103, 104] equations [103, 104] dp ptq dp ptq i  α x ptq¡β p ptq, i  α x ptq¡β p ptq, dt i i i i dt i i i i

50 50 51

dx ptq dx ptq i  γ f pp,θ, tq¡δ x ptq, i  γ f pp,θ, tq¡δ x ptq, dt i i i i dt i i i i p q p q p q p q where pi t and xi t are the protein concentration and mRNA concentration of where pi t and xi t are the protein concentration and mRNA concentration of α α gene gi at time t, respectively, p is a vector holding all protein concentrations, i gene gi at time t, respectively, p is a vector holding all protein concentrations, i β γ β γ is the translation rate, i the protein degradation rate, i the transcription rate, is the translation rate, i the protein degradation rate, i the transcription rate, δ p θ q δ p θ q and i the mRNA degradation rate of gene gi , and fi p, , t is a function that and i the mRNA degradation rate of gene gi , and fi p, , t is a function that describes the transcriptional activity of gene gi given the concentrations of all describes the transcriptional activity of gene gi given the concentrations of all proteins p at time t and their transcriptional regulatory effects on gene gi which proteins p at time t and their transcriptional regulatory effects on gene gi which arestoredinθ [103, 104]. arestoredinθ [103, 104]. A steady-state expression dataset can then be generated by starting at a (ran- A steady-state expression dataset can then be generated by starting at a (ran- dom) initial state and iteratively updating mRNA and protein concentrations un- dom) initial state and iteratively updating mRNA and protein concentrations un- til some equilibrium has been reached, in which the system does not substantially til some equilibrium has been reached, in which the system does not substantially change anymore. change anymore. Additional flavors of such models might for instance distinguish between the Additional flavors of such models might for instance distinguish between the concentrations of molecules in the nucleus and cytoplasm and include transporta- concentrations of molecules in the nucleus and cytoplasm and include transporta- tion rates to simulate flow of molecules between the two locations [105, 106]. Fur- tion rates to simulate flow of molecules between the two locations [105, 106]. Fur- thermore, since we can assume that the biological processes underlying gene ex- thermore, since we can assume that the biological processes underlying gene ex- pression are neither deterministic nor instantaneous, the DEs as presented above pression are neither deterministic nor instantaneous, the DEs as presented above can further be supplemented with noise to obtain a SDE formulation [103], and/or can further be supplemented with noise to obtain a SDE formulation [103], and/or adjusted to include time delays [104]. adjusted to include time delays [104]. One obvious challenge in the establishment of such models lies in the infer- One obvious challenge in the establishment of such models lies in the infer- ence of accurate parameters describing the kinetics of the biological processes. ence of accurate parameters describing the kinetics of the biological processes. However, once a feasible model has been established, it can have many applica- However, once a feasible model has been established, it can have many applica- tions. For instance, being able to model steady state expression in a cell would tions. For instance, being able to model steady state expression in a cell would allow to investigate perturbations and deregulation of system wide transcription allow to investigate perturbations and deregulation of system wide transcription or pathway specific signaling under cancer or disease states [105]. Furthermore, or pathway specific signaling under cancer or disease states [105]. Furthermore, the ability to model gene expression data from a known regulatory network has the ability to model gene expression data from a known regulatory network has opened up a widely used avenue for the validation of reverse engineered transcrip- opened up a widely used avenue for the validation of reverse engineered transcrip- tional networks [103, 107], a concept that is further discussed in the following tional networks [103, 107], a concept that is further discussed in the following section. section.

1.3.1.3 Validation of network inference methods 1.3.1.3 Validation of network inference methods In order to evaluate the performance of network inference methods, such as those In order to evaluate the performance of network inference methods, such as those discussed in section 1.3.1.1, researchers commonly compare the existence of reg- discussed in section 1.3.1.1, researchers commonly compare the existence of reg- ulatory interaction in the predicted networks with some form of reference net- ulatory interaction in the predicted networks with some form of reference net- work. The choice of reference networks is usually between two predominant work. The choice of reference networks is usually between two predominant alternatives. Traditionally, references consisted of knowledge-based networks. alternatives. Traditionally, references consisted of knowledge-based networks. However, such networks have the disadvantage that they might be incomplete However, such networks have the disadvantage that they might be incomplete

51 51 52

and resemble a generic view on the , rather than representing any tis- and resemble a generic view on the interactome, rather than representing any tis- sue or disease specific interactions. More recently, the simulation of gene expres- sue or disease specific interactions. More recently, the simulation of gene expres- sion data from a given network structure has allowed more sophisticated network sion data from a given network structure has allowed more sophisticated network comparisons between paired predicted and reference networks [79, 103, 108, 109]. comparisons between paired predicted and reference networks [79, 103, 108, 109]. Specifically, in such an approach, one simulates gene expression data from Specifically, in such an approach, one simulates gene expression data from a predefined network structure serving as reference and then infers a predicted a predefined network structure serving as reference and then infers a predicted network from this synthetic data, thus obtaining a matched pair of inferred and network from this synthetic data, thus obtaining a matched pair of inferred and reference networks that can be compared in order to evaluate network inference reference networks that can be compared in order to evaluate network inference [103, 109]. This approach might however suffer from an over-simplification of [103, 109]. This approach might however suffer from an over-simplification of transcriptional regulation. Specifically, biological transcriptional processes are transcriptional regulation. Specifically, biological transcriptional processes are likely involving a plethora of factors, such as epigenetics, posttranslational modi- likely involving a plethora of factors, such as epigenetics, posttranslational modi- fications, or micro-RNAs, which are not included in the described models for fications, or micro-RNAs, which are not included in the described models for simulating gene expression, and thus any validation process based on such sim- simulating gene expression, and thus any validation process based on such sim- ulated data might overestimate the performance of a network inference method ulated data might overestimate the performance of a network inference method [79]. [79]. Regardless of the choice of reference network, the performance of a network Regardless of the choice of reference network, the performance of a network inference algorithm is commonly determined with respect to the edge prediction inference algorithm is commonly determined with respect to the edge prediction accuracy in the corresponding reconstructed networks. Specifically, a number accuracy in the corresponding reconstructed networks. Specifically, a number of measures have been used to address this type of validation [79, 103, 108, 109], of measures have been used to address this type of validation [79, 103, 108, 109], most of which depend on the evaluation of the . Specifically, most of which depend on the evaluation of the confusion matrix. Specifically, consider the confusion matrix depicted in figure 1.8. consider the confusion matrix depicted in figure 1.8. Having established the number of positive edges (P), negative edges (N), as Having established the number of positive edges (P), negative edges (N), as well as the number of true-positive (TP), false-negative (FN), false-positive (FP), well as the number of true-positive (TP), false-negative (FN), false-positive (FP), and true-negative (TN) edges in the inferred network, one can then determine a and true-negative (TN) edges in the inferred network, one can then determine a number of different to measure the performance of the networks. These number of different statistics to measure the performance of the networks. These statistics include for instance the true positive rate (TPR) and true negative rate statistics include for instance the true positive rate (TPR) and true negative rate (TNR), which are given as (TNR), which are given as

TP TN TP TN TPR , TNR , TPR , TNR , P N P N the precision (PR) given as the precision (PR) given as

TP TP PR , PR , TP FP TP FP the accuracy (ACC) defined as the accuracy (ACC) defined as

TP TN TP TN AC C  . AC C  . P N P N

52 52 53

Predicted network Predicted network Edges Edges True False True False

TP FN P = TP + FN TP FN P = TP + FN Tr ue Tr ue

FP TN N = FP + TN FP TN N = FP + TN False False Reference network Reference network Reference

Figure 1.8: Confusion matrix for comparing inferred and reference networks. In the Figure 1.8: Confusion matrix for comparing inferred and reference networks. In the context of network inference, the confusion matrix is established based on the correctly context of network inference, the confusion matrix is established based on the correctly and incorrectly predicted edges. For any pair of vertices in the two networks there can and incorrectly predicted edges. For any pair of vertices in the two networks there can be an edge (True) or no edge (False) connecting them. Edges that exist in the reference be an edge (True) or no edge (False) connecting them. Edges that exist in the reference network and are correctly predicted in the inferred network are true-positives (TP), while network and are correctly predicted in the inferred network are true-positives (TP), while those missed in the inferred network are false-negatives (FN). Conversely, edges that do not those missed in the inferred network are false-negatives (FN). Conversely, edges that do not exist in the reference network, but are predicted in the inferred network are referred to as exist in the reference network, but are predicted in the inferred network are referred to as false-positives (FP), while those edges that do not exist in either network are referred to as false-positives (FP), while those edges that do not exist in either network are referred to as true-negatives (TN). The total number of positive edges (P) with respect to the reference true-negatives (TN). The total number of positive edges (P) with respect to the reference network is then just the sum P  TP FN, while the number of negative edges (N) with network is then just the sum P  TP FN, while the number of negative edges (N) with respect to the reference network is the sum N  FP TN. respect to the reference network is the sum N  FP TN.

A typical performance evaluation encompasses the inspection of such metrics A typical performance evaluation encompasses the inspection of such metrics over a range of different inference parameters. For instance, through a sequen- over a range of different inference parameters. For instance, through a sequen- tial adjustment of the confidence or strength threshold when selecting predicted tial adjustment of the confidence or strength threshold when selecting predicted network edges, one obtains networks with varying values of TP, FN, FP, and network edges, one obtains networks with varying values of TP, FN, FP, and TN, and can then illustrate the relationship between two metrics. Two meth- TN, and can then illustrate the relationship between two metrics. Two meth- ods widely used in this fashion are the Receiver-Operator-Characteristic (ROC), ods widely used in this fashion are the Receiver-Operator-Characteristic (ROC), which displays the TPRas a function of the false positive rate (FPR FP{N), which displays the TPRas a function of the false positive rate (FPR FP{N), and the Precision-Recall curve, which depicts the PR as a function of the TPR, and the Precision-Recall curve, which depicts the PR as a function of the TPR, which is also referred to as recall. In order to quantify and compare the inform- which is also referred to as recall. In order to quantify and compare the inform- ation provided by such curves, it is then common practice to calculate the area ation provided by such curves, it is then common practice to calculate the area under these curves, i.e. the area under the ROC curve (auROC) or the area un- under these curves, i.e. the area under the ROC curve (auROC) or the area un- der the Precision-Recall curve (auPR), as the decisive performance metric for a der the Precision-Recall curve (auPR), as the decisive performance metric for a method. method.

53 53 54

1.3.2 Network-based prediction of cancer genes 1.3.2 Network-based prediction of cancer genes With the increasing amount of biological data gathered through the last decades With the increasing amount of biological data gathered through the last decades and the need for improved integrative analyses of such data, the scientific field has and the need for improved integrative analyses of such data, the scientific field has witnessed the development of an astonishing number of network-based strategies witnessed the development of an astonishing number of network-based strategies for the prioritization of candidate disease/cancer genes, for the prediction of dis- for the prioritization of candidate disease/cancer genes, for the prediction of dis- ease/cancer modules and pathways, and for distinguishing driver from passenger ease/cancer modules and pathways, and for distinguishing driver from passenger genes. genes. The following sections will discuss three exemplary categories of related can- The following sections will discuss three exemplary categories of related can- cer gene prediction/prioritization methods, namely approaches based on (i) gene cer gene prediction/prioritization methods, namely approaches based on (i) gene proximity in the network (Fig. 1.9A), (ii) disease network modules (Fig. 1.9B), proximity in the network (Fig. 1.9A), (ii) disease network modules (Fig. 1.9B), and (iii) graph centrality (Fig. 1.9C). In general these methods consider biological and (iii) graph centrality (Fig. 1.9C). In general these methods consider biological networks as mathematical graphs and operate on the basis of well defined graph networks as mathematical graphs and operate on the basis of well defined graph theory related concepts. theory related concepts.

1.3.2.1 Proximity-based methods 1.3.2.1 Proximity-based methods One widely used category of methods is based on the frequently reviewed frame- One widely used category of methods is based on the frequently reviewed frame- work of ‘guilt-by-association’ (GBA), which has been particularly highlighted by work of ‘guilt-by-association’ (GBA), which has been particularly highlighted by the proteomics and functional fields and assumes that proteins with re- the proteomics and functional genomics fields and assumes that proteins with re- lated functions are likely to exhibit direct or indirect physical interactions [110]. lated functions are likely to exhibit direct or indirect physical interactions [110]. It has subsequently been shown that such a framework is also feasible for gene It has subsequently been shown that such a framework is also feasible for gene expression derived interaction data [111]. expression derived interaction data [111]. Based on the premise that genes/proteins associated with the same disease Based on the premise that genes/proteins associated with the same disease phenotype are likely to have similar biological functions and will thus display phenotype are likely to have similar biological functions and will thus display a high tendency to interact physically and/or to be part of the same biological a high tendency to interact physically and/or to be part of the same biological module, GBA represents a promising method to identify novel disease-related module, GBA represents a promising method to identify novel disease-related genes [112]. Specifically, with the ability of biological networks to investigate genes [112]. Specifically, with the ability of biological networks to investigate molecular relationships, numerous network-based GBA methods have been pro- molecular relationships, numerous network-based GBA methods have been pro- posed (reviewed for instance in [15, 113–116]), which would allow to prioritize posed (reviewed for instance in [15, 113–116]), which would allow to prioritize individual genes or to identify network modules (see also next section). Since individual genes or to identify network modules (see also next section). Since networks also allow an inspection of indirect interactions in terms of network networks also allow an inspection of indirect interactions in terms of network distances or proximity between proteins, the related methods are sometimes also distances or proximity between proteins, the related methods are sometimes also referred to as ‘guilt-by-proximity’ approaches [115]. referred to as ‘guilt-by-proximity’ approaches [115]. According to Wu and Li [115], Wang et al. [114], and Navlakha and Kings- According to Wu and Li [115], Wang et al. [114], and Navlakha and Kings- ford [117], approaches for prioritizing individual genes based on proximity or as- ford [117], approaches for prioritizing individual genes based on proximity or as- sociation to known disease genes can be broadly categorized into local methods, sociation to known disease genes can be broadly categorized into local methods, global methods, or graph partitioning related techniques. Specifically, following global methods, or graph partitioning related techniques. Specifically, following the extensive reviews provided in these publications, a few concepts and examples the extensive reviews provided in these publications, a few concepts and examples of such strategies are explained below. of such strategies are explained below.

54 54 55

A B A B

C C

Score Score Low High Low High Proximity/Centrality Proximity/Centrality

Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 3 Cluster 4

Known cancer gene Known cancer gene Predicted cancer gene Predicted cancer gene

Figure 1.9: Three approaches for predicting cancer genes using network analysis. A) Figure 1.9: Three approaches for predicting cancer genes using network analysis. A) Vertices can be ranked with respect to their proximity to known cancer genes and the top Vertices can be ranked with respect to their proximity to known cancer genes and the top candidates are being selected as putative cancer genes. B) The network can be partitioned candidates are being selected as putative cancer genes. B) The network can be partitioned into clusters and novel cancer genes can be predicted based on cluster affiliation with known into clusters and novel cancer genes can be predicted based on cluster affiliation with known cancer genes. C) Assuming that known cancer genes take highly central positions in a cancer genes. C) Assuming that known cancer genes take highly central positions in a biological graph, novel cancer genes may be predicted by scoring genes according to the biological graph, novel cancer genes may be predicted by scoring genes according to the respective centrality measure. respective centrality measure.

Popular local methods rely on investigating the ‘direct neighbors’ / ‘direct neigh- Popular local methods rely on investigating the ‘direct neighbors’ / ‘direct neigh- borhood’ (DN) of disease genes in the network, i.e. predicting novel disease genes borhood’ (DN) of disease genes in the network, i.e. predicting novel disease genes based on direct interactions to known disease genes, or prioritize genes based based on direct interactions to known disease genes, or prioritize genes based on the ‘shortest path length’ (SP) to disease genes [114, 115]. For instance, in the on the ‘shortest path length’ (SP) to disease genes [114, 115]. For instance, in the former category of DN methods, candidate vertices may be selected as those that former category of DN methods, candidate vertices may be selected as those that physically interact with known disease genes/proteins [118]. Alternatively, ver- physically interact with known disease genes/proteins [118]. Alternatively, ver- tices can be scored by calculating the number of interaction partners that are dis- tices can be scored by calculating the number of interaction partners that are dis- ease genes/proteins [119], or by calculating the sum of weights of edges connect- ease genes/proteins [119], or by calculating the sum of weights of edges connect-

55 55 56

ing the vertex to disease proteins/genes [120]. On the adjacency matrix A (either ing the vertex to disease proteins/genes [120]. On the adjacency matrix A (either weighted or unweighted) of an undirected network, the respective score for a ver- weighted or unweighted) of an undirected network, the respective score for a ver- [ ] [ ] tex vi can then simply be computed as 120 tex vi can then simply be computed as 120 ¸ ¸ p q p q SDN vi aji, SDN vi aji, jPC jPC where C denotes the set of indices of vertices corresponding to the known cancer where C denotes the set of indices of vertices corresponding to the known cancer genes/protein that are used for the GBA prediction. genes/protein that are used for the GBA prediction. The SP framework on the other hand allows even indirect interactions between The SP framework on the other hand allows even indirect interactions between genes and instead involves the evaluation of the shortest path lengths, i.e. shortest genes and instead involves the evaluation of the shortest path lengths, i.e. shortest distances dpi, jq, from a candidate gene to known disease genes [121–124].For distances dpi, jq, from a candidate gene to known disease genes [121–124].For instance, in order to score a vertex (gene) vi , one could first compute the shortest instance, in order to score a vertex (gene) vi , one could first compute the shortest path lengths between vi and all known cancer related genes, where it is common path lengths between vi and all known cancer related genes, where it is common to set dpi, jq8if there is no path between two genes. Subsequently, the ob- to set dpi, jq8if there is no path between two genes. Subsequently, the ob- tained distances can be transformed into a measure of closeness and the score for tained distances can be transformed into a measure of closeness and the score for a vertex can then be obtained as the sum over these closeness values. For instance, a vertex can then be obtained as the sum over these closeness values. For instance, Franke [121] and Wu et al. [124] utilized a Gaussian kernel to transform the pair- Franke [121] and Wu et al. [124] utilized a Gaussian kernel to transform the pair- wise distances between genes into a pairwise measure of closeness. Accordingly, wise distances between genes into a pairwise measure of closeness. Accordingly, [ ] [ ] a score for vertex vi based on shortest-path lengths could read as 124 a score for vertex vi based on shortest-path lengths could read as 124 ¸ ¸ ¡ p q2 ¡ p q2 p q d vj ,vi p q d vj ,vi SSP vi e . SSP vi e . jPC jPC

Global methods are usually based on variants of network propagation [15], Global methods are usually based on variants of network propagation [15], with examples spanning for instance diffusion kernels (DK) [125],or’random with examples spanning for instance diffusion kernels (DK) [125],or’random walks with restart’ (RWR) [125], equivalent PageRank with priors implementa- walks with restart’ (RWR) [125], equivalent PageRank with priors implementa- tions [126] or related methods [127]. tions [126] or related methods [127]. In the DK framework, scores can for instance be calculated as [125] In the DK framework, scores can for instance be calculated as [125] ¸ ¸ p q p q SDK vi Kji, SDK vi Kji, jPC jPC where K is the diffusion kernel defined as [125] where K is the diffusion kernel defined as [125]

¡β ¡β K  e L, K  e L, with L and β denoting the of the network and the chosen dif- with L and β denoting the Laplacian matrix of the network and the chosen dif- fusion strength, respectively. fusion strength, respectively. Following the description by Köhler et al. [125],theRWR framework can Following the description by Köhler et al. [125],theRWR framework can be described as follows. We assume a walker/surfer that jumps between connec- be described as follows. We assume a walker/surfer that jumps between connec- ted vertices in a network, where the probability of jumping from one vertex to ted vertices in a network, where the probability of jumping from one vertex to

56 56 57

another is given by the row-stochastic transition matrix M, i.e. row normalized another is given by the row-stochastic transition matrix M, i.e. row normalized adjacency matrix, of the network. The random walk is initialized by setting up a adjacency matrix, of the network. The random walk is initialized by setting up a vector of initial probabilities p0, allowing a walker to start at any known disease vector of initial probabilities p0, allowing a walker to start at any known disease gene with equal probability and zero probability to start in a gene not known to gene with equal probability and zero probability to start in a gene not known to be related to the disease. At any time point t 1 the probability of being in any be related to the disease. At any time point t 1 the probability of being in any vertex of the network is then given by the vector [125] vertex of the network is then given by the vector [125] p ¡ λq T λ p ¡ λq T λ pt 1 1 M pt p0, pt 1 1 M pt p0, p q ¥ p q ¥ where pt vi , t 0, is the probability of the walker to reside in vertex vi at time where pt vi , t 0, is the probability of the walker to reside in vertex vi at time t, and λ Pp0,1q is the probability of the walker to return to the initial genes. t, and λ Pp0,1q is the probability of the walker to return to the initial genes. Once the resulting probability vector has converged to a reasonably stable state Once the resulting probability vector has converged to a reasonably stable state [ ] [ ] (steady-state) p8, the score for a gene vi is given as 125 (steady-state) p8, the score for a gene vi is given as 125 p q p q p q p q SRW R vi p8 vi . SRW R vi p8 vi . In the graph partitioning framework, one would start by clustering the net- In the graph partitioning framework, one would start by clustering the net- work into subgraphs using some type of clustering algorithm and then score genes work into subgraphs using some type of clustering algorithm and then score genes based on their membership in clusters that are enriched for known disease genes based on their membership in clusters that are enriched for known disease genes [117] (see also next section). [117] (see also next section). For a further review on the different outlined methods, the reader is referred For a further review on the different outlined methods, the reader is referred to [114, 115, 117]. to [114, 115, 117].

1.3.2.2 Clustering-based methods 1.3.2.2 Clustering-based methods As briefly mentioned in the previous section, beyond the mere scoring of in- As briefly mentioned in the previous section, beyond the mere scoring of in- dividual genes, the GBA framework can also be employed to detect entire dis- dividual genes, the GBA framework can also be employed to detect entire dis- ease modules within a network. Specifically, since proteins/genes related to a ease modules within a network. Specifically, since proteins/genes related to a certain disease phenotype show an increased likelihood of interacting with each certain disease phenotype show an increased likelihood of interacting with each other, they are also likely to form clusters of high connectivity within a network other, they are also likely to form clusters of high connectivity within a network [112, 128, 129]. [112, 128, 129]. Numerous methods have been developed to identify communities and clusters Numerous methods have been developed to identify communities and clusters in graphs in general [11, 12], and functional modules in biological networks [130]. in graphs in general [11, 12], and functional modules in biological networks [130]. However, as pointed out by Barabási et al. [113], in the context of identifying so However, as pointed out by Barabási et al. [113], in the context of identifying so called ‘disease modules’, one generally has to further distinguish between at least called ‘disease modules’, one generally has to further distinguish between at least two other different types of clusters in a network, represented by ‘topological mod- two other different types of clusters in a network, represented by ‘topological mod- ules’, i.e. network clusters with higher intra-cluster to inter-cluster connectivity, ules’, i.e. network clusters with higher intra-cluster to inter-cluster connectivity, and ‘functional modules’, i.e. intra-connected network regions that show an en- and ‘functional modules’, i.e. intra-connected network regions that show an en- richment of vertices/genes with similar biological functions. As further stated richment of vertices/genes with similar biological functions. As further stated by the authors “a disease module may not be identical to, but likely overlaps with, by the authors “a disease module may not be identical to, but likely overlaps with, the topological and/or functional modules” [113]. Consequently, ‘disease module’ the topological and/or functional modules” [113]. Consequently, ‘disease module’ detection can be performed in a number of different ways. detection can be performed in a number of different ways.

57 57 58

Specifically, as also similarly argued by Vlaic et al. [131], many of the related Specifically, as also similarly argued by Vlaic et al. [131], many of the related techniques can be considered to follow one of two broad frameworks. Methods techniques can be considered to follow one of two broad frameworks. Methods in the first category start from a set of seed vertices, e.g. known-disease genes, dif- in the first category start from a set of seed vertices, e.g. known-disease genes, dif- ferentially expressed genes (DEGs), or vertices with some type of disease-related ferentially expressed genes (DEGs), or vertices with some type of disease-related score such as mutation frequency, and then employ propagation/diffusion pro- score such as mutation frequency, and then employ propagation/diffusion pro- cesses to grow disease modules around these seeds [132–134]. Methods from the cesses to grow disease modules around these seeds [132–134]. Methods from the second category instead start with the identification of topological clusters/com- second category instead start with the identification of topological clusters/com- munities, which are then scored as putative disease modules based on some cri- munities, which are then scored as putative disease modules based on some cri- teria, e.g. by counting the number of known disease-related genes/proteins con- teria, e.g. by counting the number of known disease-related genes/proteins con- tained in the cluster or by checking whether the contained genes show differential tained in the cluster or by checking whether the contained genes show differential expression / transcriptional activity in the disease condition [131, 135, 136]. expression / transcriptional activity in the disease condition [131, 135, 136]. For a further review of related methods and concepts, the reader is referred to For a further review of related methods and concepts, the reader is referred to [113, 137] [113, 137]

1.3.2.3 Centrality-based methods 1.3.2.3 Centrality-based methods While GBA methods employ distances or interactions to known disease genes, While GBA methods employ distances or interactions to known disease genes, centrality-based methods instead only rely on the network topology in order to centrality-based methods instead only rely on the network topology in order to prioritize genes. Specifically, the use of graph centralities allows to identify genes prioritize genes. Specifically, the use of graph centralities allows to identify genes with certain topological properties in the network. Assuming that the distinct with certain topological properties in the network. Assuming that the distinct network positions captured in this way are also associated with specific biolo- network positions captured in this way are also associated with specific biolo- gical functions and implicated in disease phenotypes, graph centralities might thus gical functions and implicated in disease phenotypes, graph centralities might thus provide a way of prioritizing genes linked to a specific disease. provide a way of prioritizing genes linked to a specific disease. Specifically, a centrality-based framework would then be composed of two Specifically, a centrality-based framework would then be composed of two main steps, i.e. (i) the identification of a set of individual or a combination of main steps, i.e. (i) the identification of a set of individual or a combination of centrality measures that enrich for known disease genes in the network, and (ii) centrality measures that enrich for known disease genes in the network, and (ii) the prioritization of candidate genes with the respective centrality metrics. Thus, the prioritization of candidate genes with the respective centrality metrics. Thus, once a suitable set of centrality measures for ranking has been identified, the ap- once a suitable set of centrality measures for ranking has been identified, the ap- plication of the centrality-based approach is essentially independent from known plication of the centrality-based approach is essentially independent from known disease/cancer genes. However, disease genes from the same or a highly similar disease/cancer genes. However, disease genes from the same or a highly similar disease are still needed in order to identify such a set of suitable centrality meas- disease are still needed in order to identify such a set of suitable centrality meas- ures. ures. As also reviewed in more detail in Paper I, several lines of evidence have sup- As also reviewed in more detail in Paper I, several lines of evidence have sup- ported the feasibility of such a strategy during the last decades. For instance, com- ported the feasibility of such a strategy during the last decades. For instance, com- bining transcriptional profiling with protein interaction data, it was found that a bining transcriptional profiling with protein interaction data, it was found that a selection based on increased expression in squamous cell lung cancer enriched for selection based on increased expression in squamous cell lung cancer enriched for genes with a high degree centrality in the respective PPI network [138]. Also genes with a high degree centrality in the respective PPI network [138]. Also investigating PPIs, another study revealed that proteins of known cancer genes investigating PPIs, another study revealed that proteins of known cancer genes tended to exhibit higher degree centralities than proteins encoded by genes that tended to exhibit higher degree centralities than proteins encoded by genes that have not been associated with cancer [139]. Others have shown that prostate have not been associated with cancer [139]. Others have shown that prostate

58 58 59

cancer genes could be enriched for using both degree centrality and eigenvector cancer genes could be enriched for using both degree centrality and in a literature-mined network [140]. Furthermore, Goh et al. [129] centrality in a literature-mined network [140]. Furthermore, Goh et al. [129] have studied degree centrality in a human PPI network and concluded that (non- have studied degree centrality in a human PPI network and concluded that (non- essential) disease genes are unlikely to coincide with hub positions, i.e. highly essential) disease genes are unlikely to coincide with hub positions, i.e. highly connected vertices. However, the authors also reported that genes affected by so- connected vertices. However, the authors also reported that genes affected by so- matic mutations, which would include many cancer genes, were excluded from matic mutations, which would include many cancer genes, were excluded from this finding and were instead more prone to display a high connectivity in the this finding and were instead more prone to display a high connectivity in the network. network. A more in depth review of the theory behind this category of prioritization A more in depth review of the theory behind this category of prioritization methods and a more comprehensive discussion of various centralities, their ap- methods and a more comprehensive discussion of various centralities, their ap- plication, and previous results is provided in Paper I. plication, and previous results is provided in Paper I.

59 59 60 61

2. The present investigation 2. The present investigation

2.1 Paper I 2.1 Paper I

Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2016). “Graph centrality based predic- Sergei Silvestrov, Fredrik J. Swartling. (2016). “Graph centrality based predic- tion of cancer genes”. In: Engineering Mathematics II: Algebraic, Stochastic and tion of cancer genes”. In: Engineering Mathematics II: Algebraic, Stochastic and Analysis Structures for Networks, Data Classification and Optimization /[ed] Sergei Analysis Structures for Networks, Data Classification and Optimization /[ed] Sergei Silvestrov; Milica Rancic, pp. 275-311. Silvestrov; Milica Rancic, pp. 275-311.

2.1.1 Summary 2.1.1 Summary As described in the introduction, graph centralities represent one of the tools As described in the introduction, graph centralities represent one of the tools that can be employed to prioritize genes in a network and thus predict putative that can be employed to prioritize genes in a network and thus predict putative cancer genes. However, there exist many nuances to such a strategy, where the cancer genes. However, there exist many nuances to such a strategy, where the details of an application might for instance depend on (i) the underlying biolo- details of an application might for instance depend on (i) the underlying biolo- gical networks, (ii) how the interaction data displayed by such networks has been gical networks, (ii) how the interaction data displayed by such networks has been extracted from the underlying biological systems, (iii) which types of genes are to extracted from the underlying biological systems, (iii) which types of genes are to be identified, and (iv) which centrality measures are used for this purpose. be identified, and (iv) which centrality measures are used for this purpose. In this paper we aimed to explore and review the various aspects and concepts In this paper we aimed to explore and review the various aspects and concepts relating to the prioritization of candidate disease/cancer genes via the centrality- relating to the prioritization of candidate disease/cancer genes via the centrality- based ranking of vertices in biological networks. Specifically, we began by out- based ranking of vertices in biological networks. Specifically, we began by out- lining the properties of several biological networks and related pioneering efforts lining the properties of several biological networks and related pioneering efforts that together have led to the incentive for using this method. Furthermore we that together have led to the incentive for using this method. Furthermore we discussed the mathematical basics for graph centrality based ranking and how to discussed the mathematical basics for graph centrality based ranking and how to interpret a related prioritzation in terms of possible associations between such a interpret a related prioritzation in terms of possible associations between such a ranking and the enrichment of cancer genes. Finally we also reviewed potential ranking and the enrichment of cancer genes. Finally we also reviewed potential complications and open questions associated with this technique. complications and open questions associated with this technique.

61 61 62

2.2 Paper II 2.2 Paper II

Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2017). “Loss of conservation of graph Sergei Silvestrov, Fredrik J. Swartling. (2017). “Loss of conservation of graph centralities in reverse-engineered transcriptional regulatory networks”. Methodo- centralities in reverse-engineered transcriptional regulatory networks”. Methodo- logy and Computing in Applied Probability, 19(4), 1089-1105. logy and Computing in Applied Probability, 19(4), 1089-1105.

2.2.1 Background and aims 2.2.1 Background and aims As outlined in the introduction, there exist numerous strategies for reverse-engi- As outlined in the introduction, there exist numerous strategies for reverse-engi- neering GRNs from gene expression data. When performing a centrality-based neering GRNs from gene expression data. When performing a centrality-based ranking on such inferred graphs in order to prioritize genes, the underlying as- ranking on such inferred graphs in order to prioritize genes, the underlying as- sumption is that the relevant biological network topologies are preserved in the sumption is that the relevant biological network topologies are preserved in the inferred networks. However, when benchmarking network inference algorithms, inferred networks. However, when benchmarking network inference algorithms, the major performance criterion is typically how accurately these methods can the major performance criterion is typically how accurately these methods can predict individual edges (transcriptional relationships), rather than the ability to predict individual edges (transcriptional relationships), rather than the ability to adequately reconstruct overall network centralities or related topological features. adequately reconstruct overall network centralities or related topological features. Comparing a set of paired reference and inferred GRNs, this paper aimed to elu- Comparing a set of paired reference and inferred GRNs, this paper aimed to elu- cidate how well the true graph centralities are preserved when reverse-engineering cidate how well the true graph centralities are preserved when reverse-engineering GRNs from gene expression data. GRNs from gene expression data.

2.2.2 Material and methods 2.2.2 Material and methods In order to investigate how well overall network centralities are preserved between In order to investigate how well overall network centralities are preserved between the true, i.e. reference, GRN and reverse engineered networks, we employed the true, i.e. reference, GRN and reverse engineered networks, we employed a procedure that can be summarized by the following major steps. Using the a procedure that can be summarized by the following major steps. Using the GeneNetWeaver software developed by Schaffter et al. [103], which implements GeneNetWeaver software developed by Schaffter et al. [103], which implements among other things an ODE/SDE based framework for gene expression simula- among other things an ODE/SDE based framework for gene expression simula- tion, we started by generating gene expression data for a set of 200 reference net- tion, we started by generating gene expression data for a set of 200 reference net- works with 200 or 250 vertices each. From these gene expression data, we then works with 200 or 250 vertices each. From these gene expression data, we then reverse-engineered GRNs using four different network inference algorithms, thus reverse-engineered GRNs using four different network inference algorithms, thus producing matched pairs of reference and inferred networks. In order to confirm producing matched pairs of reference and inferred networks. In order to confirm that the network inference worked as expected, we then investigated the auPR and that the network inference worked as expected, we then investigated the auPR and auROC values for the reverse-engineered networks and compared the findings to auROC values for the reverse-engineered networks and compared the findings to the results of a similar inspection performed by Schaffter et al. [103]. Finally, the results of a similar inspection performed by Schaffter et al. [103]. Finally, in order to determine the conservation of centralities between reference and in- in order to determine the conservation of centralities between reference and in- ferred networks, we computed two centrality measures, i.e. degree centrality and ferred networks, we computed two centrality measures, i.e. degree centrality and betweenness centrality, and compared the obtained centralities between paired betweenness centrality, and compared the obtained centralities between paired reference and inferred networks either visually or in terms of weighted rank cor- reference and inferred networks either visually or in terms of weighted rank cor- relation coefficients (weighted Kendall’s tau). relation coefficients (weighted Kendall’s tau).

62 62 63

2.2.3 Results and discussions 2.2.3 Results and discussions As judged by the auROC and auPR results, the network inference demonstrated As judged by the auROC and auPR results, the network inference demonstrated in this paper revealed an edge prediction accuracy comparable to what has previ- in this paper revealed an edge prediction accuracy comparable to what has previ- ously been reported. However, a further investigation of rank correlation coef- ously been reported. However, a further investigation of rank correlation coef- ficients of centrality profiles between reference and inferred networks suggested ficients of centrality profiles between reference and inferred networks suggested at best a modest conservation of the overall topological structure. Of note, the at best a modest conservation of the overall topological structure. Of note, the best agreement of centrality profiles was often observed when including only a best agreement of centrality profiles was often observed when including only a small fraction of highly scored edges into the inferred network. This finding small fraction of highly scored edges into the inferred network. This finding relates directly to the information relayed by the precision-recall curves for the relates directly to the information relayed by the precision-recall curves for the respective network inference methods, which suggested that the techniques are respective network inference methods, which suggested that the techniques are highly confident in predicting a fraction of network edges, but the accuracy drops highly confident in predicting a fraction of network edges, but the accuracy drops when attempting to reconstruct a larger proportion of the interactome. As a con- when attempting to reconstruct a larger proportion of the interactome. As a con- sequence, we propose the following scenario: when including a small number of sequence, we propose the following scenario: when including a small number of highly scored edges, the inferred network contains mainly accurate interactions highly scored edges, the inferred network contains mainly accurate interactions but only a sparsely reconstructed overall topology, thus at best giving rise to a but only a sparsely reconstructed overall topology, thus at best giving rise to a modest agreement of centralities. When instead including more edges, which then modest agreement of centralities. When instead including more edges, which then display a decreasing prediction confidence, the network reconstruction becomes display a decreasing prediction confidence, the network reconstruction becomes faulty and the agreement of centrality profiles between reference and inferred net- faulty and the agreement of centrality profiles between reference and inferred net- works decreases even further. works decreases even further.

63 63 64

2.3 Paper III 2.3 Paper III

Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2016). ”Prediction of high centrality Sergei Silvestrov, Fredrik J. Swartling. (2016). ”Prediction of high centrality nodes from reverse-engineered transcriptional regulator networks”. In: SMTDA nodes from reverse-engineered transcriptional regulator networks”. In: SMTDA 2016 Proceedings: / 4th Stochastic Modeling Techniques and Data Analysis Interna- 2016 Proceedings: / 4th Stochastic Modeling Techniques and Data Analysis Interna- tional Conference /[ed] Christos H. Skiadas (Ed), ISAST: International Society for tional Conference /[ed] Christos H. Skiadas (Ed), ISAST: International Society for the Advancement of Science and Technology, pp. 517-531. the Advancement of Science and Technology, pp. 517-531.

2.3.1 Context 2.3.1 Context This study represents a continuation of the work described in Paper II by extend- This study represents a continuation of the work described in Paper II by extend- ing the analyses to other types of reference networks and investigation strategies. ing the analyses to other types of reference networks and investigation strategies.

2.3.2 Background and aims 2.3.2 Background and aims The investigation of centrality conservation in reverse-engineered GRNs presen- The investigation of centrality conservation in reverse-engineered GRNs presen- ted in Paper II was limited in at least four respects: (i) only small-scale reference ted in Paper II was limited in at least four respects: (i) only small-scale reference networks were employed, (ii) networks were inferred from synthetic gene expres- networks were employed, (ii) networks were inferred from synthetic gene expres- sion data only, (iii) information about known transcription factors was ignored, sion data only, (iii) information about known transcription factors was ignored, and (iv) only a small selection of network inference methods was applied. Thus, and (iv) only a small selection of network inference methods was applied. Thus, to produce a more extensive view on the subject, Paper III revisited the ques- to produce a more extensive view on the subject, Paper III revisited the ques- tion of centrality conservation in reverse-engineered GRNs by utilizing (i) larger- tion of centrality conservation in reverse-engineered GRNs by utilizing (i) larger- scale reference networks, (ii) both synthetic and real microarray gene expression scale reference networks, (ii) both synthetic and real microarray gene expression data, (iii) a network pruning strategy that would only allow known transcrip- data, (iii) a network pruning strategy that would only allow known transcrip- tion factors to interact with other genes, and (iv) a more comprehensive list of tion factors to interact with other genes, and (iv) a more comprehensive list of network inference algorithms. Furthermore, since a typical centrality-based can- network inference algorithms. Furthermore, since a typical centrality-based can- didate gene filtering would predominantly focus on high-centrality vertices, we didate gene filtering would predominantly focus on high-centrality vertices, we also aimed to investigate more explicitly, whether high-centrality vertices overlap also aimed to investigate more explicitly, whether high-centrality vertices overlap between reference and inferred networks. between reference and inferred networks.

2.3.3 Material and methods 2.3.3 Material and methods We used three reference networks and accompanying gene expression data and We used three reference networks and accompanying gene expression data and lists of transcription factors from the DREAM5 challenge [79]. One of these lists of transcription factors from the DREAM5 challenge [79]. One of these datasets represented a synthetic network with simulated gene expression data, datasets represented a synthetic network with simulated gene expression data, while the other two represented knowledge-based networks for Escherichia coli while the other two represented knowledge-based networks for Escherichia coli and Saccharomyces cerevisiae accompanied by real micro-array gene expression and Saccharomyces cerevisiae accompanied by real micro-array gene expression data. From the gene expression data, we inferred GRNs using seven different net- data. From the gene expression data, we inferred GRNs using seven different net- work inference algorithms, which were then pruned by removing any edge that work inference algorithms, which were then pruned by removing any edge that

64 64 65

did not start at a TF-representing vertex, followed by an inspection of precision- did not start at a TF-representing vertex, followed by an inspection of precision- recall curves in order to evaluate inference performance. As in Paper II,we recall curves in order to evaluate inference performance. As in Paper II,we then investigated the conservation of degree and betweenness centralities between then investigated the conservation of degree and betweenness centralities between reference and inferred networks, using either a direct or weighted reference and inferred networks, using either a direct visualization or weighted rank correlation coefficients. Finally, to determine whether vertices with a high rank correlation coefficients. Finally, to determine whether vertices with a high centrality in the reference network would also exhibit a high centrality in the centrality in the reference network would also exhibit a high centrality in the corresponding inferred network, we extracted the 5% or 10% of vertices with corresponding inferred network, we extracted the 5% or 10% of vertices with highest centralities from reference and inferred networks, respectively, and com- highest centralities from reference and inferred networks, respectively, and com- pared their overlap using the representation factor and Fisher’s exact test. In addi- pared their overlap using the representation factor and Fisher’s exact test. In addi- tion to the standard inference methods, we also created one community network, tion to the standard inference methods, we also created one community network, established by scoring edges with the mean rank of the edge weight across all net- established by scoring edges with the mean rank of the edge weight across all net- work methods [79, 108], and one approach where we averaged the centrality of work methods [79, 108], and one approach where we averaged the centrality of nodes across networks from all inference methods before ranking genes. nodes across networks from all inference methods before ranking genes.

2.3.4 Results and discussions 2.3.4 Results and discussions The comparison of degree and betweenness centralities between the synthetic The comparison of degree and betweenness centralities between the synthetic reference and inferred networks revealed at best a modest correlation, while a reference and inferred networks revealed at best a modest correlation, while a similar comparison for the biological datasets did not show any discernible cor- similar comparison for the biological datasets did not show any discernible cor- relation of centralities between reference and inferred networks. A comparison relation of centralities between reference and inferred networks. A comparison of high-centrality genes between the reference network and reconstructed GRNs of high-centrality genes between the reference network and reconstructed GRNs revealed a significant overlap for high-degree genes in all networks inferred from revealed a significant overlap for high-degree genes in all networks inferred from the synthetic data. However, only a subset of methods produced significant over- the synthetic data. However, only a subset of methods produced significant over- laps for high-betweenness genes in the synthetic network or for high-degree/high- laps for high-betweenness genes in the synthetic network or for high-degree/high- betweenness genes in the E. coli network, while there were no significant overlaps betweenness genes in the E. coli network, while there were no significant overlaps for the S. cerevisiae network. Interpreting these findings and their meaning for for the S. cerevisiae network. Interpreting these findings and their meaning for cancer gene prioritization from reverse-engineered networks is challenging. Spe- cancer gene prioritization from reverse-engineered networks is challenging. Spe- cifically, the results obtained for networks inferred from synthetic data, despite cifically, the results obtained for networks inferred from synthetic data, despite displaying at best a modest agreement between overall centrality measures, sug- displaying at best a modest agreement between overall centrality measures, sug- gested that the selection of high-degree vertices would enrich for genes that dis- gested that the selection of high-degree vertices would enrich for genes that dis- play high degree centralities also in the true interactome. However, the reduced play high degree centralities also in the true interactome. However, the reduced edge inference accuracy, loss of overall conservation of centralities, and dimin- edge inference accuracy, loss of overall conservation of centralities, and dimin- ished overlap of high-centrality genes in the biological networks and especially ished overlap of high-centrality genes in the biological networks and especially for the S. cerevisiae dataset warrant caution. Possible explanations for the ob- for the S. cerevisiae dataset warrant caution. Possible explanations for the ob- served discrepancy might be different transcriptional complexities between real served discrepancy might be different transcriptional complexities between real microarray data and simulated gene expression [79], or inconsistencies between microarray data and simulated gene expression [79], or inconsistencies between knowledge-based networks and the actual biological systems from which the gene knowledge-based networks and the actual biological systems from which the gene expression data is recorded. Further research will be required to fully elucidate expression data is recorded. Further research will be required to fully elucidate this issue. this issue.

65 65 66

2.4 Paper IV 2.4 Paper IV

Holger Weishaupt, Patrik Johansson, Anders Sundström, Zelmina Lubovac-Pilav, Holger Weishaupt, Patrik Johansson, Anders Sundström, Zelmina Lubovac-Pilav, Björn Olsson, Sven Nelander, Fredrik J. Swartling. (2019). “Batch-normalization Björn Olsson, Sven Nelander, Fredrik J. Swartling. (2019). “Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes”. Bioinformatics, epub ahead of print. defined negative control genes”. Bioinformatics, epub ahead of print.

2.4.1 Context 2.4.1 Context An overarching goal of our research was to study the cancer gene landscape of An overarching goal of our research was to study the cancer gene landscape of MB in the context of system-wide transcriptional regulations modeled by reverse- MB in the context of system-wide transcriptional regulations modeled by reverse- engineered GRNs. In order to enable the investigation of differential expression engineered GRNs. In order to enable the investigation of differential expression patterns and transcriptional perturbations in these networks, we aimed to infer patterns and transcriptional perturbations in these networks, we aimed to infer networks from gene expression data comprising samples from both MB and nor- networks from gene expression data comprising samples from both MB and nor- mal cerebellum, which represents the likely brain region of origin for many MBs. mal cerebellum, which represents the likely brain region of origin for many MBs. To achieve robust and reproducible results, the respective inference methods typ- To achieve robust and reproducible results, the respective inference methods typ- ically require gene expression data with large sample sizes. Paper IV deals with ically require gene expression data with large sample sizes. Paper IV deals with the generation of such a large-scale gene expression dataset. the generation of such a large-scale gene expression dataset.

2.4.2 Background and aims 2.4.2 Background and aims There exists a multitude of publicly available datasets with gene expression data There exists a multitude of publicly available datasets with gene expression data from MBs or normal cerebellum that could be used for network inference pur- from MBs or normal cerebellum that could be used for network inference pur- poses. However, such datasets typically comprise only a small number of samples, poses. However, such datasets typically comprise only a small number of samples, and any single dataset usually consists exclusively of either MB samples or cerebel- and any single dataset usually consists exclusively of either MB samples or cerebel- lar samples, but rarely both. Thus, in order to generate a larger-scale gene expres- lar samples, but rarely both. Thus, in order to generate a larger-scale gene expres- sion resource comprising both MB and cerebellar samples, this paper aimed to sion resource comprising both MB and cerebellar samples, this paper aimed to merge a selection of individual datasets into an integrated gene expression table, merge a selection of individual datasets into an integrated gene expression table, while removing putative batch-effects, i.e. systematic biases/trends in gene ex- while removing putative batch-effects, i.e. systematic biases/trends in gene ex- pression values between these datasets. Such a merged dataset would then be use- pression values between these datasets. Such a merged dataset would then be use- ful to various downstream purposes, such as the inference of GRNs. ful to various downstream purposes, such as the inference of GRNs.

2.4.3 Material and methods 2.4.3 Material and methods We pre-processed and combined a total of 23 gene expression datasets generated by We pre-processed and combined a total of 23 gene expression datasets generated by four different affymetrix microarray platforms, thus producing a merged dataset four different affymetrix microarray platforms, thus producing a merged dataset with 1350 MB samples and 291 cerebellar samples. To ensure that the subgroup with 1350 MB samples and 291 cerebellar samples. To ensure that the subgroup labels of annotated MB samples were reliable and in order to predict labels for labels of annotated MB samples were reliable and in order to predict labels for those samples with lacking subgroup annotation, we then performed an extensive those samples with lacking subgroup annotation, we then performed an extensive re-classification analyses of all MB samples. re-classification analyses of all MB samples.

66 66 67

Subsequently, in order to remove batch effects in the merged data, we em- Subsequently, in order to remove batch effects in the merged data, we em- ployed the Removal of Unwanted Variation (RUV) algorithm [142, 143], which ployed the Removal of Unwanted Variation (RUV) algorithm [142, 143], which normalizes gene expression data via the use of negative control genes, i.e. genes normalizes gene expression data via the use of negative control genes, i.e. genes that are assumed to be expressed at more or less constant levels across all samples that are assumed to be expressed at more or less constant levels across all samples in the data. Specifically, in the context of the merged data, negative control genes in the data. Specifically, in the context of the merged data, negative control genes would thus be expected to display low gene expression variation within pheno- would thus be expected to display low gene expression variation within pheno- types, i.e. within MB subgroups (WNT, SHH, G3, G4) and within normal cere- types, i.e. within MB subgroups (WNT, SHH, G3, G4) and within normal cere- bellum, and between phenotypes, i.e. between MB subgroups and between MB bellum, and between phenotypes, i.e. between MB subgroups and between MB and cerebellum. To empirically determine a suitable set of negative controls from and cerebellum. To empirically determine a suitable set of negative controls from the collected datasets, we thus proposed a scoring framework based on the com- the collected datasets, we thus proposed a scoring framework based on the com- putation of the relative mean absolute deviation (RMD) within phenotypes and putation of the relative mean absolute deviation (RMD) within phenotypes and analysis of variance (ANOVA) tests between phenotypes. analysis of variance (ANOVA) tests between phenotypes. In order to visualize the presence of batch effects before and after RUV nor- In order to visualize the presence of batch effects before and after RUV nor- malization we employed three types of visual tools, i.e. (i) relative log expression malization we employed three types of visual tools, i.e. (i) relative log expression (RLE) plots, (ii) multidimensional scaling (MDS) / principal component analysis (RLE) plots, (ii) multidimensional scaling (MDS) / principal component analysis (PCA) plots, and (iii) hierarchical clustering (HC). Furthermore, to be able to (PCA) plots, and (iii) hierarchical clustering (HC). Furthermore, to be able to compare different parameter choices for the RUV method and to select the putat- compare different parameter choices for the RUV method and to select the putat- ive best performing RUV setup, we also utilized a panel of six quantitative metrics ive best performing RUV setup, we also utilized a panel of six quantitative metrics for evaluating various properties of the data before and after normalization. for evaluating various properties of the data before and after normalization.

2.4.4 Results and discussion 2.4.4 Results and discussion The empirically defined negative controls displayed relatively low gene expression The empirically defined negative controls displayed relatively low gene expression variation within and between phenotypes, but increased variability between in- variation within and between phenotypes, but increased variability between in- dividual datasets and especially between microarray platforms, thus highlighting dividual datasets and especially between microarray platforms, thus highlighting these genes as a promising resource for batch effect removal. these genes as a promising resource for batch effect removal. Using the identified negative controls in conjunction with the RUV algorithm Using the identified negative controls in conjunction with the RUV algorithm and the visual and quantitative metrics, we were then able to select an arguably and the visual and quantitative metrics, we were then able to select an arguably best performing choice of RUV parameters, thus producing a dataset that dis- best performing choice of RUV parameters, thus producing a dataset that dis- played a clear removal of many of the platform-related differences observed in played a clear removal of many of the platform-related differences observed in the merged data prior to normalization. the merged data prior to normalization. In summary, we have (i) documented an approach for empirically defining In summary, we have (i) documented an approach for empirically defining negative control genes, (ii) outlined a framework for evaluating RUV normaliz- negative control genes, (ii) outlined a framework for evaluating RUV normaliz- ations, and (iii) employed these methods to produce an integrated, batch-corrected ations, and (iii) employed these methods to produce an integrated, batch-corrected resource of MB and cerebellar transcription data, spanning an unprecedented num- resource of MB and cerebellar transcription data, spanning an unprecedented num- ber of 1350 MB and 291 cerebellar samples. ber of 1350 MB and 291 cerebellar samples.

67 67 68

2.5 Paper V 2.5 Paper V

Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Holger Weishaupt, Patrik Johansson, Christopher Engström, Sven Nelander, Sergei Silvestrov, Fredrik J. Swartling. (2019). ”Prioritization of candidate cancer Sergei Silvestrov, Fredrik J. Swartling. (2019). ”Prioritization of candidate cancer genes on chromosome 17q through reverse-engineered transcriptional regulatory genes on chromosome 17q through reverse-engineered transcriptional regulatory networks in medulloblastoma groups 3 and 4”. Manuscript. networks in medulloblastoma groups 3 and 4”. Manuscript.

2.5.1 Context 2.5.1 Context While the previous sections and papers included in this thesis were concerned While the previous sections and papers included in this thesis were concerned with establishing the theory or data needed for network-based cancer gene discov- with establishing the theory or data needed for network-based cancer gene discov- ery, paper V deals with a practical application of related techniques to the field of ery, paper V deals with a practical application of related techniques to the field of MB. MB.

2.5.2 Background and aims 2.5.2 Background and aims As discussed in the introduction, recurrent genomic alterations have been found As discussed in the introduction, recurrent genomic alterations have been found for a majority of Group 3 and Group 4 MBs [25], but there is still very little for a majority of Group 3 and Group 4 MBs [25], but there is still very little known about the genetic events that lead to tumor development in a large fraction known about the genetic events that lead to tumor development in a large fraction of such patients. Among the most frequently observed genomic alterations in of such patients. Among the most frequently observed genomic alterations in these subgroups is the amplification of chromosome 17q (chr 17q), an event that these subgroups is the amplification of chromosome 17q (chr 17q), an event that has previously been shown to be associated with poor survival [144]. However, it has previously been shown to be associated with poor survival [144]. However, it remains to be resolved if there is a particular driver gene or set of interacting driver remains to be resolved if there is a particular driver gene or set of interacting driver genes located on chr 17q that might be responsible for producing the observed genes located on chr 17q that might be responsible for producing the observed pathogenesis. To address this open question, this paper aimed to reverse-engineer pathogenesis. To address this open question, this paper aimed to reverse-engineer GRNs for Group 3 and Group 4 MBs and then utilize these networks to prioritize GRNs for Group 3 and Group 4 MBs and then utilize these networks to prioritize candidate genes on chr 17q based on their proximity to genes recurrently mutated candidate genes on chr 17q based on their proximity to genes recurrently mutated in these subgroups. in these subgroups.

2.5.3 Material and methods 2.5.3 Material and methods As a starting point, we obtained published gene expression data for 144 Group 3 As a starting point, we obtained published gene expression data for 144 Group 3 and 326 Group 4 MB samples [47]. The data was then filtered to select meaning- and 326 Group 4 MB samples [47]. The data was then filtered to select meaning- ful genes and samples for network inference. Specifically, we retained only genes ful genes and samples for network inference. Specifically, we retained only genes that (i) were recurrently mutated in these subgroups [25], (ii) resembled chr 17q that (i) were recurrently mutated in these subgroups [25], (ii) resembled chr 17q candidate genes, i.e. genes which were overexpressed in these MBs as compared candidate genes, i.e. genes which were overexpressed in these MBs as compared to cerebellum, or (iii) represented chr 17q related genes, i.e. genes that displayed to cerebellum, or (iii) represented chr 17q related genes, i.e. genes that displayed differential expression between samples with and without chr 17q amplification. differential expression between samples with and without chr 17q amplification. Furthermore, only samples without broad chr 17q amplifications were included. Furthermore, only samples without broad chr 17q amplifications were included. From the resulting gene expression tables we then reverse-engineered GRNs using From the resulting gene expression tables we then reverse-engineered GRNs using ten different network inference algorithms. The individual network predictions ten different network inference algorithms. The individual network predictions

68 68 69

were integrated into one ensemble network for Group 3 MB and one ensemble were integrated into one ensemble network for Group 3 MB and one ensemble network for Group 4 MB, respectively, and these networks were further pruned to network for Group 4 MB, respectively, and these networks were further pruned to produce approximately scale-free GRNs. Finally, exploiting these GRNs, we pri- produce approximately scale-free GRNs. Finally, exploiting these GRNs, we pri- oritized the candidate chr 17q genes with respect to their proximity to the known oritized the candidate chr 17q genes with respect to their proximity to the known MB cancer genes, i.e. recurrently mutated genes, via the RWR (random walker MB cancer genes, i.e. recurrently mutated genes, via the RWR (random walker with restart) strategy [125]. with restart) strategy [125].

2.5.4 Results and discussion 2.5.4 Results and discussion The prioritization of genes via the RWR algorithm was able to detect a number of The prioritization of genes via the RWR algorithm was able to detect a number of genes, the overexpression or amplification of which had previously been linked to genes, the overexpression or amplification of which had previously been linked to MB or other cancers, thus suggesting that the described framework might be able MB or other cancers, thus suggesting that the described framework might be able to also predict novel MB related genes. While the exact detection performance to also predict novel MB related genes. While the exact detection performance or enrichment statistics will have to be investigated in more detail, the study has or enrichment statistics will have to be investigated in more detail, the study has already identified KIF18B as a novel, putative cancer gene associated with Group already identified KIF18B as a novel, putative cancer gene associated with Group 4 MBs. To our knowledge, this gene has not yet been implicated in MB patho- 4 MBs. To our knowledge, this gene has not yet been implicated in MB patho- genesis, but we found that patients with high expression of KIF18B exhibited on genesis, but we found that patients with high expression of KIF18B exhibited on average a shorter overall survival, even if only cases without chr 17q arm ampli- average a shorter overall survival, even if only cases without chr 17q arm ampli- fications were compared. fications were compared.

2.5.5 Future perspectives 2.5.5 Future perspectives This paper represents an initial effort to investigate putative MB cancer genes in This paper represents an initial effort to investigate putative MB cancer genes in the context of subgroup specific transcriptional regulatory networks and to pre- the context of subgroup specific transcriptional regulatory networks and to pre- dict novel MB related cancer genes from such networks. Further work will be re- dict novel MB related cancer genes from such networks. Further work will be re- quired to expand the work, i.e. in terms of including additional omics data types, quired to expand the work, i.e. in terms of including additional omics data types, inferring larger-scale networks, and utilizing additional, complementary cancer inferring larger-scale networks, and utilizing additional, complementary cancer gene prediction algorithms. However, the identification of KIF18B as a novel, gene prediction algorithms. However, the identification of KIF18B as a novel, putative cancer gene in Group 4 MBs has clearly demonstrated the potential of putative cancer gene in Group 4 MBs has clearly demonstrated the potential of the outlined method and follow-up experiments are planned in order to further the outlined method and follow-up experiments are planned in order to further elucidate the role of KIF18B in MB biology. elucidate the role of KIF18B in MB biology.

69 69 71

References References

[1] Otte, E. and Rousseau, R. (2002). “ analysis: a powerful strategy, also for [1] Otte, E. and Rousseau, R. (2002). “: a powerful strategy, also for the information sciences". Journal of information Science, 28(6), 441-453. the information sciences". Journal of information Science, 28(6), 441-453. [2] Zhu X. et al. (2007).“Getting connected: analysis and principles of biological networks". [2] Zhu X. et al. (2007).“Getting connected: analysis and principles of biological networks". Genes Dev, 21: 1010-1024. Genes Dev, 21: 1010-1024. [3] Basheer, I. A. and Hajmeer, M. (2000). “Artificial neural networks: fundamentals, comput- [3] Basheer, I. A. and Hajmeer, M. (2000). “Artificial neural networks: fundamentals, comput- ing, design, and application". Journal of microbiological methods, 43(1), 3-31. ing, design, and application". Journal of microbiological methods, 43(1), 3-31. [4] Amaral, L. A. N. et al (2000). ”Classes of small-world networks". PNAS, 97(21), 11149- [4] Amaral, L. A. N. et al (2000). ”Classes of small-world networks". PNAS, 97(21), 11149- 11152. 11152. [5] Watts, D. J., and Strogatz, S. H. (1998). “Collective dynamics of ’small-world’ networks". [5] Watts, D. J., and Strogatz, S. H. (1998). “Collective dynamics of ’small-world’ networks". Nature, 393(6684), 440. Nature, 393(6684), 440. [6] Barabási, A. L., and Albert, R. (1999). “Emergence of scaling in random networks". Science, [6] Barabási, A. L., and Albert, R. (1999). “Emergence of scaling in random networks". Science, 286(5439), 509-512. 286(5439), 509-512. [7] E. N. Gilbert. (1959) “Random graphs". The Annals of , 30:1141- [7] E. N. Gilbert. (1959) “Random graphs". The Annals of Mathematical Statistics, 30:1141- 1144. 1144. [8] Barabási, A. L. (2016). “". Cambridge university press. [8] Barabási, A. L. (2016). “Network science". Cambridge university press. [9] Yu, H. et al. (2007). “The importance of bottlenecks in protein networks: correlation with [9] Yu, H. et al. (2007). “The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics". PLoS computational biology, 3(4), e59. gene essentiality and expression dynamics". PLoS computational biology, 3(4), e59. [10] Freeman, L.C. (1977). “A set of measures of centrality based on betweenness". Sociometry, [10] Freeman, L.C. (1977). “A set of measures of centrality based on betweenness". Sociometry, 40, 35-41 40, 35-41 [11] Fortunato, S. (2010). “Community detection in graphs". reports, 486(3), 75-174. [11] Fortunato, S. (2010). “Community detection in graphs". Physics reports, 486(3), 75-174. [12] Schaeffer, S. E. (2007). “Graph clustering". review, 1(1), 27-64. [12] Schaeffer, S. E. (2007). “Graph clustering". Computer science review, 1(1), 27-64. [13] Hastie, T., et al. (2009). “The elements of statistical learning". Springer,NewYork,NY, [13] Hastie, T., et al. (2009). “The elements of statistical learning". Springer,NewYork,NY, 2009. p. 520. 2009. p. 520. [14] Masuda, N., et al. (2017). “Random walks and diffusion on networks". Physics Reports, 716, [14] Masuda, N., et al. (2017). “Random walks and diffusion on networks". Physics Reports, 716, 1-58. 1-58. [15] Cowen, L. et al. (2017). “Network propagation: a universal amplifier of genetic associ- [15] Cowen, L. et al. (2017). “Network propagation: a universal amplifier of genetic associ- ations". Nature Reviews Genetics, 18(9), 551. ations". Nature Reviews Genetics, 18(9), 551. [16] Ruddon, R. W. (2007). “Cancer biology". Oxford University Press. [16] Ruddon, R. W. (2007). “Cancer biology". Oxford University Press. [17] Weinberg, R. (2013). “The biology of cancer". Garland science. [17] Weinberg, R. (2013). “The biology of cancer". Garland science. [18] Hanahan, D. and Weinberg, R. A. (2000). ”The hallmarks of cancer". Cell, 100(1), 57-70. [18] Hanahan, D. and Weinberg, R. A. (2000). ”The hallmarks of cancer". Cell, 100(1), 57-70. [19] Hanahan, D. and Weinberg, R. A. (2011). ”Hallmarks of cancer: the next generation". Cell, [19] Hanahan, D. and Weinberg, R. A. (2011). ”Hallmarks of cancer: the next generation". Cell, 144(5), 646-674. 144(5), 646-674. [20] Kandoth, C. et al. (2013). ”Mutational landscape and significance across 12 major cancer [20] Kandoth, C. et al. (2013). ”Mutational landscape and significance across 12 major cancer types". Nature, 502(7471), 333-339. types". Nature, 502(7471), 333-339. [21] Vogelstein, B. et al. (2013). ”Cancer genome landscapes". Science, 339(6127), 1546-1558. [21] Vogelstein, B. et al. (2013). ”Cancer genome landscapes". Science, 339(6127), 1546-1558. [22] Stratton, M. R. et al. (2009). ”The cancer genome". Nature, 458(7239), 719-724. [22] Stratton, M. R. et al. (2009). ”The cancer genome". Nature, 458(7239), 719-724. [23] Vogelstein, B. and Kinzler, K. W. (2004). ”Cancer genes and the pathways they control". [23] Vogelstein, B. and Kinzler, K. W. (2004). ”Cancer genes and the pathways they control". Nature medicine, 10(8), 789-799. Nature medicine, 10(8), 789-799. [24] Forbes, S. A. et al. (2015). ”COSMIC: exploring the world’s knowledge of somatic muta- [24] Forbes, S. A. et al. (2015). ”COSMIC: exploring the world’s knowledge of somatic muta- tions in human cancer". Nucleic acids research, 43(D1), D805-D811. tions in human cancer". Nucleic acids research, 43(D1), D805-D811. [25] Northcott, P. A. et al. (2017). ”The whole-genome landscape of medulloblastoma sub- [25] Northcott, P. A. et al. (2017). ”The whole-genome landscape of medulloblastoma sub- types". Nature, 547(7663), 311-317. types". Nature, 547(7663), 311-317. [26] Ceccarelli, M. et al. (2016). ”Molecular profiling reveals biologically discrete subsets and [26] Ceccarelli, M. et al. (2016). ”Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma". Cell, 164(3), 550-563. pathways of progression in diffuse glioma". Cell, 164(3), 550-563. [27] Taylor M.D. et al. (2012). ”Molecular subgroups of medulloblastoma: the current con- [27] Taylor M.D. et al. (2012). ”Molecular subgroups of medulloblastoma: the current con- sensus". Acta neuropathologica, 123: 465-472. sensus". Acta neuropathologica, 123: 465-472. [28] Sorlie T. et al. (2003). ”Repeated observation of breast tumor subtypes in independent gene [28] Sorlie T. et al. (2003). ”Repeated observation of breast tumor subtypes in independent gene

71 71 72

expression data sets". PNAS, 100: 8418-8423. expression data sets". PNAS, 100: 8418-8423. [29] Guinney J. et al. (2015). ”The consensus molecular subtypes of colorectal cancer". Nat Med, [29] Guinney J. et al. (2015). ”The consensus molecular subtypes of colorectal cancer". Nat Med, 21: 1350-1356. 21: 1350-1356. [30] Deeken, J. F. and Löscher, W. (2007). ”The blood-brain barrier and cancer: transporters, [30] Deeken, J. F. and Löscher, W. (2007). ”The blood-brain barrier and cancer: transporters, treatment, and Trojan horses". Clinical Cancer Research, 13(6), 1663-1674. treatment, and Trojan horses". Clinical Cancer Research, 13(6), 1663-1674. [31] Pardridge, W. M. (2005). ”The blood-brain barrier: bottleneck in brain drug development". [31] Pardridge, W. M. (2005). ”The blood-brain barrier: bottleneck in brain drug development". NeuroRx, 2(1), 3-14. NeuroRx, 2(1), 3-14. [32] Louis, D. N. et al. (2016). ”WHO classification of Tumours of the Central Nervous Sys- [32] Louis, D. N. et al. (2016). ”WHO classification of Tumours of the Central Nervous Sys- tem", World Health Organisation. tem", World Health Organisation. [33] Smoll, N. R. and Drummond, K. J. (2012). ”The incidence of medulloblastomas and prim- [33] Smoll, N. R. and Drummond, K. J. (2012). ”The incidence of medulloblastomas and prim- itive neurectodermal tumours in adults and children". Journal of Clinical Neuroscience, itive neurectodermal tumours in adults and children". Journal of Clinical Neuroscience, 19(11), 1541-1544. 19(11), 1541-1544. [34] Northcott, P. A. et al. (2012). ”Medulloblastomics: the end of the beginning". Nature Re- [34] Northcott, P. A. et al. (2012). ”Medulloblastomics: the end of the beginning". Nature Re- views Cancer, 12(12), 818-834. views Cancer, 12(12), 818-834. [35] Crawford, J. R. et al. (2007). ”Medulloblastoma in childhood: new biological advances". [35] Crawford, J. R. et al. (2007). ”Medulloblastoma in childhood: new biological advances". Lancet Neurol, 6(12): 1073-1085. Lancet Neurol, 6(12): 1073-1085. [36] Spiegler, B. J. et al. (2004). “Change in neurocognitive functioning after treatment with [36] Spiegler, B. J. et al. (2004). “Change in neurocognitive functioning after treatment with cranial radiation in childhood." J Clin Oncol, 22(4): 706-713. cranial radiation in childhood." J Clin Oncol, 22(4): 706-713. [37] Kool, M. et al. (2012). ”Molecular subgroups of medulloblastoma: an international meta- [37] Kool, M. et al. (2012). ”Molecular subgroups of medulloblastoma: an international meta- analysis of transcriptome, genetic aberrations, and clinical data of WNT, SHH, Group 3, analysis of transcriptome, genetic aberrations, and clinical data of WNT, SHH, Group 3, and Group 4 medulloblastomas". Acta neuropathologica„ 123(4), 473-484. and Group 4 medulloblastomas". Acta neuropathologica„ 123(4), 473-484. [38] Packer, R. J. et al. (2013). “Survival and secondary tumors in children with medullo- [38] Packer, R. J. et al. (2013). “Survival and secondary tumors in children with medullo- blastoma receiving radiotherapy and adjuvant chemotherapy: results of Children’s On- blastoma receiving radiotherapy and adjuvant chemotherapy: results of Children’s On- cology Group trial A9961". Neuro Oncol, 15(1): 97-103. cology Group trial A9961". Neuro Oncol, 15(1): 97-103. [39] Louis, D. N. et al. (2007). “The 2007 WHO classification of tumours of the central nervous [39] Louis, D. N. et al. (2007). “The 2007 WHO classification of tumours of the central nervous system". Acta neuropathologica, 114(2), 97-109. system". Acta neuropathologica, 114(2), 97-109. [40] Kool, M. et al. (2008). “Integrated genomics identifies five medulloblastoma subtypes with [40] Kool, M. et al. (2008). “Integrated genomics identifies five medulloblastoma subtypes with distinct genetic profiles, pathway signatures and clinicopathological features". PloS one, distinct genetic profiles, pathway signatures and clinicopathological features". PloS one, 3(8), e3088. 3(8), e3088. [41] Thompson, M. C. et al. (2006). “Genomics identifies medulloblastoma subgroups that are [41] Thompson, M. C. et al. (2006). “Genomics identifies medulloblastoma subgroups that are enriched for specific genetic alterations". Journal of Clinical Oncology, 24(12), 1924-1931. enriched for specific genetic alterations". Journal of Clinical Oncology, 24(12), 1924-1931. [42] Cho, Y. J. et al. (2011). “Integrative genomic analysis of medulloblastoma identifies a mo- [42] Cho, Y. J. et al. (2011). “Integrative genomic analysis of medulloblastoma identifies a mo- lecular subgroup that drives poor clinical outcome". Journal of Clinical Oncology, 29(11), lecular subgroup that drives poor clinical outcome". Journal of Clinical Oncology, 29(11), 1424-1430. 1424-1430. [43] Northcott, P. A. et al. (2011). “Medulloblastoma comprises four distinct molecular vari- [43] Northcott, P. A. et al. (2011). “Medulloblastoma comprises four distinct molecular vari- ants". Journal of Clinical Oncology, 29(11), 1408-1414. ants". Journal of Clinical Oncology, 29(11), 1408-1414. [44] Hovestadt, V. et al. (2013). “Robust molecular subgrouping and copy-number profiling [44] Hovestadt, V. et al. (2013). “Robust molecular subgrouping and copy-number profiling of medulloblastoma from small amounts of archival tumour material using high-density of medulloblastoma from small amounts of archival tumour material using high-density DNA methylation arrays". Acta neuropathologica, 125(6), 913. DNA methylation arrays". Acta neuropathologica, 125(6), 913. [45] Schwalbe, E. C. et al. (2013). “DNA methylation profiling of medulloblastoma allows ro- [45] Schwalbe, E. C. et al. (2013). “DNA methylation profiling of medulloblastoma allows ro- bust subclassification and improved outcome prediction using formalin-fixed biopsies". bust subclassification and improved outcome prediction using formalin-fixed biopsies". Acta neuropathologica, 125(3), 359-371. Acta neuropathologica, 125(3), 359-371. [46] Northcott, P. A. et al. (2012). “The clinical implications of medulloblastoma subgroups". [46] Northcott, P. A. et al. (2012). “The clinical implications of medulloblastoma subgroups". Nature Reviews Neurology, 8(6), 340-351. Nature Reviews Neurology, 8(6), 340-351. [47] Cavalli, F. M. et al. (2017). “Intertumoral Heterogeneity within Medulloblastoma Sub- [47] Cavalli, F. M. et al. (2017). “Intertumoral Heterogeneity within Medulloblastoma Sub- groups". Cancer Cell, 31(6), 737-754. groups". Cancer Cell, 31(6), 737-754. [48] Schwalbe, E. C. et al. (2017). “Novel molecular subgroups for clinical classification and [48] Schwalbe, E. C. et al. (2017). “Novel molecular subgroups for clinical classification and outcome prediction in childhood medulloblastoma: a cohort study". The Lancet Oncology, outcome prediction in childhood medulloblastoma: a cohort study". The Lancet Oncology, 18(7), 958-971. 18(7), 958-971.

72 72 73

[49] Northcott, P. A. et al. (2012). “Molecular subgroups of medulloblastoma". Expert review [49] Northcott, P. A. et al. (2012). “Molecular subgroups of medulloblastoma". Expert review of neurotherapeutics, 12(7), 871-884. of neurotherapeutics, 12(7), 871-884. [50] Robinson, G. et al. (2012). “Novel mutations target distinct subgroups of medullo- [50] Robinson, G. et al. (2012). “Novel mutations target distinct subgroups of medullo- blastoma". Nature, 488(7409), 43-48. blastoma". Nature, 488(7409), 43-48. [51] Gibson, P. et al. (2010). “Subtypes of medulloblastoma have distinct developmental ori- [51] Gibson, P. et al. (2010). “Subtypes of medulloblastoma have distinct developmental ori- gins". Nature, 468(7327), 1095-1099. gins". Nature, 468(7327), 1095-1099. [52] Hamilton, S. R. et al. (1995). “The molecular basis of Turcot’s syndrome". New England [52] Hamilton, S. R. et al. (1995). “The molecular basis of Turcot’s syndrome". New England Journal of Medicine, 332(13), 839-847. Journal of Medicine, 332(13), 839-847. [53] Jones, D. T. et al. (2012). “Dissecting the genomic complexity underlying medullo- [53] Jones, D. T. et al. (2012). “Dissecting the genomic complexity underlying medullo- blastoma". Nature, 488(7409), 100-105. blastoma". Nature, 488(7409), 100-105. [54] Northcott, P. A. et al. (2012). “Subgroup-specific structural variation across 1,000 medullo- [54] Northcott, P. A. et al. (2012). “Subgroup-specific structural variation across 1,000 medullo- blastoma genomes". Nature, 488(7409), 49-56. blastoma genomes". Nature, 488(7409), 49-56. [55] Pugh, T. J. et al. (2012). “Medulloblastoma exome sequencing uncovers subtype-specific [55] Pugh, T. J. et al. (2012). “Medulloblastoma exome sequencing uncovers subtype-specific somatic mutations". Nature, 488(7409), 106-110. somatic mutations". Nature, 488(7409), 106-110. [56] Pöschl, J. et al. (2014). “Genomic and transcriptomic analyses match medulloblastoma [56] Pöschl, J. et al. (2014). “Genomic and transcriptomic analyses match medulloblastoma mouse models to their human counterparts". Acta neuropathologica, 128(1), 123-136. mouse models to their human counterparts". Acta neuropathologica, 128(1), 123-136. [57] Wu, X. et al. (2011). “Mouse models of medulloblastoma". Chinese journal of cancer, 30(7), [57] Wu, X. et al. (2011). “Mouse models of medulloblastoma". Chinese journal of cancer, 30(7), 442. 442. [58] Kawauchi, D. et al. (2012). “A mouse model of the most aggressive subgroup of human [58] Kawauchi, D. et al. (2012). “A mouse model of the most aggressive subgroup of human medulloblastoma". Cancer cell, 21(2), 168-180. medulloblastoma". Cancer cell, 21(2), 168-180. [59] Pei, Y. et al. (2012). “An animal model of MYC-driven medulloblastoma". Cancer cell, 21(2), [59] Pei, Y. et al. (2012). “An animal model of MYC-driven medulloblastoma". Cancer cell, 21(2), 155-167. 155-167. [60] Swartling, F. J. et al. (2012). “Distinct neural stem cell populations give rise to disparate [60] Swartling, F. J. et al. (2012). “Distinct neural stem cell populations give rise to disparate brain tumors in response to N-MYC". Cancer cell, 21(5), 601-613. brain tumors in response to N-MYC". Cancer cell, 21(5), 601-613. [61] Hill, R. M. et al. (2015). “Combined MYC and P53 Defects Emerge at Medulloblastoma [61] Hill, R. M. et al. (2015). “Combined MYC and P53 Defects Emerge at Medulloblastoma Relapse and Define Rapidly Progressive, Therapeutically Targetable Disease". Cancer Cell, Relapse and Define Rapidly Progressive, Therapeutically Targetable Disease". Cancer Cell, 27, 72-84. 27, 72-84. [62] Archer, T. C. et al. (2018). “Proteomics, post-translational modifications, and integrative [62] Archer, T. C. et al. (2018). “Proteomics, post-translational modifications, and integrative analyses reveal molecular heterogeneity within medulloblastoma subgroups". Cancer cell, analyses reveal molecular heterogeneity within medulloblastoma subgroups". Cancer cell, 34(3), 396-410. 34(3), 396-410. [63] Vogelstein, B. and Kinzler, K. W. (1993). “The multistep nature of cancer". Trends in genet- [63] Vogelstein, B. and Kinzler, K. W. (1993). “The multistep nature of cancer". Trends in genet- ics, 9(4), 138-141. ics, 9(4), 138-141. [64] Weinberg, R. A. (1989). “Oncogenes, antioncogenes, and the molecular bases of multistep [64] Weinberg, R. A. (1989). “Oncogenes, antioncogenes, and the molecular bases of multistep carcinogenesis". Cancer Research, 49(14), 3713-3721. carcinogenesis". Cancer Research, 49(14), 3713-3721. [65] Loeb, L. A. et al. (2003). “Multiple mutations and cancer". PNAS, 100(3), 776-781. [65] Loeb, L. A. et al. (2003). “Multiple mutations and cancer". PNAS, 100(3), 776-781. [66] Hirschhorn, J. N. and Daly, M. J. (2005). “Genome-wide association studies for common [66] Hirschhorn, J. N. and Daly, M. J. (2005). “Genome-wide association studies for common diseases and complex traits". Nature Reviews Genetics, 6(2), 95-108. diseases and complex traits". Nature Reviews Genetics, 6(2), 95-108. [67] Easton, D. F. and Eeles, R. A. (2008). “Genome-wide association studies in cancer". Human [67] Easton, D. F. and Eeles, R. A. (2008). “Genome-wide association studies in cancer". Human Molecular Genetics, 17(R2), R109-R115. Molecular Genetics, 17(R2), R109-R115. [68] Visscher, P. M. et al. (2012). “Five years of GWAS discovery". The American Journal of [68] Visscher, P. M. et al. (2012). “Five years of GWAS discovery". The American Journal of Human Genetics, 90(1), 7-24. Human Genetics, 90(1), 7-24. [69] Mattison, J. et al. (2009). “Cancer gene discovery in mouse and man". Biochimica et Bio- [69] Mattison, J. et al. (2009). “Cancer gene discovery in mouse and man". Biochimica et Bio- physica Acta (BBA)-Reviews on Cancer, 1796(2), 140-161. physica Acta (BBA)-Reviews on Cancer, 1796(2), 140-161. [70] Lizardi, P. M. et al. (2011). “Genome-wide approaches for cancer gene discovery". Trends [70] Lizardi, P. M. et al. (2011). “Genome-wide approaches for cancer gene discovery". Trends in biotechnology, 29(11), 558-568. in biotechnology, 29(11), 558-568. [71] Ranzani, M. et al. (2013). “Cancer gene discovery: exploiting insertional mutagenesis". [71] Ranzani, M. et al. (2013). “Cancer gene discovery: exploiting insertional mutagenesis". Molecular Cancer Research, 11(10), 1141-1158. Molecular Cancer Research, 11(10), 1141-1158. [72] Uren A.G. et al. (2005). “Retroviral insertional mutagenesis: past, present and future". [72] Uren A.G. et al. (2005). “Retroviral insertional mutagenesis: past, present and future". Oncogene, 24: 7656-7672. Oncogene, 24: 7656-7672.

73 73 74

[73] Collier, L. S. and Largaespada, D. A. (2005). “Hopping around the tumor genome: trans- [73] Collier, L. S. and Largaespada, D. A. (2005). “Hopping around the tumor genome: trans- posons for cancer gene discovery". Cancer research, 65(21), 9607-9610. posons for cancer gene discovery". Cancer research, 65(21), 9607-9610. [74] Rad, R. et al. (2010). “PiggyBac transposon mutagenesis: a tool for cancer gene discovery [74] Rad, R. et al. (2010). “PiggyBac transposon mutagenesis: a tool for cancer gene discovery in mice". Science, 330(6007), 1104-1107. in mice". Science, 330(6007), 1104-1107. [75] Marx, V. (2014). “Cancer genomes: discerning drivers from passengers". Nature methods, [75] Marx, V. (2014). “Cancer genomes: discerning drivers from passengers". Nature methods, 11(4), 375. 11(4), 375. [76] Pon, J. R. and Marra, M. A. (2015). “Driver and passenger mutations in cancer". Annual [76] Pon, J. R. and Marra, M. A. (2015). “Driver and passenger mutations in cancer". Annual Review of Pathology: Mechanisms of Disease, 10, 25-50. Review of Pathology: Mechanisms of Disease, 10, 25-50. [77] Zampieri, G. et al. (2018). “Scuba: scalable kernel-based gene prioritization". BMC bioin- [77] Zampieri, G. et al. (2018). “Scuba: scalable kernel-based gene prioritization". BMC bioin- formatics, 19(1), 23. formatics, 19(1), 23. [78] Pavlopoulos, G. A. et al., (2011). “Using graph theory to analyze biological networks". [78] Pavlopoulos, G. A. et al., (2011). “Using graph theory to analyze biological networks". BioData mining, 4(1), 10. BioData mining, 4(1), 10. [79] Marbach D. et al. (2012). “Wisdom of crowds for robust gene network inference". Nat [79] Marbach D. et al. (2012). “Wisdom of crowds for robust gene network inference". Nat Methods, 9: 796-804. Methods, 9: 796-804. [80] Emmert-Streib, F. et al. (2014). “Gene regulatory networks and their applications: un- [80] Emmert-Streib, F. et al. (2014). “Gene regulatory networks and their applications: un- derstanding biological and medical problems in terms of networks". Frontiers in cell and derstanding biological and medical problems in terms of networks". Frontiers in cell and developmental biology,2. developmental biology,2. [81] Qin J. et al. (2014). “Inferring gene regulatory networks by integrating ChIP-seq/chip and [81] Qin J. et al. (2014). “Inferring gene regulatory networks by integrating ChIP-seq/chip and transcriptome data via LASSO-type regularization methods". Methods, 67: 294-303. transcriptome data via LASSO-type regularization methods". Methods, 67: 294-303. [82] Wang S. et al. (2013). “Target analysis by integration of transcriptome and ChIP-seq data [82] Wang S. et al. (2013). “Target analysis by integration of transcriptome and ChIP-seq data with BETA". Nat Protoc, 8: 2502-2515. with BETA". Nat Protoc, 8: 2502-2515. [83] Reece-Hoyes, J. S. et al. (2011). “Yeast one-hybrid assays for gene-centered human gene [83] Reece-Hoyes, J. S. et al. (2011). “Yeast one-hybrid assays for gene-centered human gene regulatory network mapping". Nature methods, 8(12), 1050-1052. regulatory network mapping". Nature methods, 8(12), 1050-1052. [84] Walhout, A. J. (2011). “Gene-centered regulatory network mapping". Methods in cell bio- [84] Walhout, A. J. (2011). “Gene-centered regulatory network mapping". Methods in cell bio- logy, 106. logy, 106. [85] Someren, E. V. et al. (2002). “Genetic network modeling". Pharmacogenomics, 3(4), 507- [85] Someren, E. V. et al. (2002). “Genetic network modeling". Pharmacogenomics, 3(4), 507- 525. 525. [86] Kumari, S. et al. (2012). “Evaluation of gene association methods for coexpression network [86] Kumari, S. et al. (2012). “Evaluation of gene association methods for coexpression network construction and biological knowledge discovery". PloS one, 7(11), e50411. construction and biological knowledge discovery". PloS one, 7(11), e50411. [87] Borate, B. R. et al. (2009). “Comparison of threshold selection methods for microarray [87] Borate, B. R. et al. (2009). “Comparison of threshold selection methods for microarray gene co-expression matrices". BMC research notes, 2(1), 240. gene co-expression matrices". BMC research notes, 2(1), 240. [88] Zhang, B. and Horvath, S. (2005). “A general framework for weighted gene co-expression [88] Zhang, B. and Horvath, S. (2005). “A general framework for weighted gene co-expression network analysis". Statistical applications in genetics and , 4(1). network analysis". Statistical applications in genetics and molecular biology, 4(1). [89] Langfelder, P. and Horvath, S (2008). “WGCNA: an R package for weighted correlation [89] Langfelder, P. and Horvath, S (2008). “WGCNA: an R package for weighted correlation network analysis". BMC bioinformatics, 9.1: 559. network analysis". BMC bioinformatics, 9.1: 559. [90] Cover, T. M., and Thomas, J. A. (2012). “Elements of ". John Wiley & [90] Cover, T. M., and Thomas, J. A. (2012). “Elements of information theory". John Wiley & Sons. Sons. [91] Butte, A. J. and Kohane, I. S. (2000). “Mutual information relevance networks: functional [91] Butte, A. J. and Kohane, I. S. (2000). “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements". Pac Symp Biocomput, 418-429. genomic clustering using pairwise entropy measurements". Pac Symp Biocomput, 418-429. [92] Margolin A.A. et al. (2006). “ARACNE: An algorithm for the reconstruction of gene reg- [92] Margolin A.A. et al. (2006). “ARACNE: An algorithm for the reconstruction of gene reg- ulatory networks in a mammalian cellular context". BMC Bioinformatics,7. ulatory networks in a mammalian cellular context". BMC Bioinformatics,7. [93] Faith J.J. et al. (2007). “Large-scale mapping and validation of Escherichia coli transcrip- [93] Faith J.J. et al. (2007). “Large-scale mapping and validation of Escherichia coli transcrip- tional regulation from a compendium of expression profiles". PLoS biology, 5(1), e8. tional regulation from a compendium of expression profiles". PLoS biology, 5(1), e8. [94] Meyer, P. E. et al. (2007). “Information-theoretic inference of large transcriptional regulat- [94] Meyer, P. E. et al. (2007). “Information-theoretic inference of large transcriptional regulat- ory networks". EURASIP journal on bioinformatics and systems biology, 2007(1), 79879. ory networks". EURASIP journal on bioinformatics and systems biology, 2007(1), 79879. [95] Haury A.C. et al. (2012). “TIGRESS: Trustful Inference of Gene REgulation using Stability [95] Haury A.C. et al. (2012). “TIGRESS: Trustful Inference of Gene REgulation using Stability Selection". Bmc Syst Biol,6. Selection". Bmc Syst Biol,6. [96] Huynh-Thu V.A. et al. (2010). “Inferring Regulatory Networks from Expression Data Us- [96] Huynh-Thu V.A. et al. (2010). “Inferring Regulatory Networks from Expression Data Us- ing -Based Methods". Plos One,5. ing Tree-Based Methods". Plos One,5.

74 74 75

[97] Sławek, J. and Arod´z, T. (2013). “ENNET: inferring large gene regulatory networks from [97] Sławek, J. and Arod´z, T. (2013). “ENNET: inferring large gene regulatory networks from expression data using gradient boosting". BMC systems biology, 7(1), 106. expression data using gradient boosting". BMC systems biology, 7(1), 106. [98] Guo, S. et al. (2016). “Gene regulatory network inference using PLS-based methods". BMC [98] Guo, S. et al. (2016). “Gene regulatory network inference using PLS-based methods". BMC bioinformatics, 17(1), 545. bioinformatics, 17(1), 545. [99] Ruyssinck, J. et al. (2014). “Nimefi: gene regulatory network inference using multiple en- [99] Ruyssinck, J. et al. (2014). “Nimefi: gene regulatory network inference using multiple en- semble feature importance algorithms". PLoS One, 9(3), e92709. semble feature importance algorithms". PLoS One, 9(3), e92709. [100] Xiao, Y. (2009). “A tutorial on analysis and simulation of boolean gene regulatory network [100] Xiao, Y. (2009). “A tutorial on analysis and simulation of boolean gene regulatory network models". Current genomics, 10(7), 511-525 models". Current genomics, 10(7), 511-525 [101] Hecker, M. et al. (2009). “Gene regulatory network inference: data integration in dynamic [101] Hecker, M. et al. (2009). “Gene regulatory network inference: data integration in dynamic models-a review". Biosystems, 96(1), 86-103. models-a review". Biosystems, 96(1), 86-103. [102] Bansal, M. et al. (2007). “How to infer gene networks from expression profiles". Molecular [102] Bansal, M. et al. (2007). “How to infer gene networks from expression profiles". Molecular systems biology, 3(1), 78. systems biology, 3(1), 78. [103] Schaffter, T. et al. (2011). “GeneNetWeaver: in silico benchmark generation and perform- [103] Schaffter, T. et al. (2011). “GeneNetWeaver: in silico benchmark generation and perform- ance profiling of network inference methods". Bioinformatics, 27.16: 2263-2270. ance profiling of network inference methods". Bioinformatics, 27.16: 2263-2270. [104] Chen, T. et al. (1999). “Modeling gene expression with differential equations". Pac Symp [104] Chen, T. et al. (1999). “Modeling gene expression with differential equations". Pac Symp Biocomput, 29-40. Biocomput, 29-40. [105] Kawka, J. et al. (2014). “Revealing the role of SGK1 in the dynamics of medulloblastoma [105] Kawka, J. et al. (2014). “Revealing the role of SGK1 in the dynamics of medulloblastoma using a mathematical model". Journal of theoretical biology, 354, 105-112. using a mathematical model". Journal of theoretical biology, 354, 105-112. [106] Lipniacki, T. et al. (2004). “Mathematical model of NF-κB regulatory module". Journal of [106] Lipniacki, T. et al. (2004). “Mathematical model of NF-κB regulatory module". Journal of theoretical biology, 228(2), 195-215. theoretical biology, 228(2), 195-215. [107] Mendes, P. et al. (2003). “Artificial gene networks for objective comparison of analysis al- [107] Mendes, P. et al. (2003). “Artificial gene networks for objective comparison of analysis al- gorithms". Bioinformatics, 19(suppl 2), ii122-ii129. gorithms". Bioinformatics, 19(suppl 2), ii122-ii129. [108] Marbach D. et al. (2010). “Revealing strengths and weaknesses of methods for gene network [108] Marbach D. et al. (2010). “Revealing strengths and weaknesses of methods for gene network inference". PNAS, 107: 6286-6291. inference". PNAS, 107: 6286-6291. [109] Liu, Z.P. et al. (2014). “Systematic identification of transcriptional and post-transcriptional [109] Liu, Z.P. et al. (2014). “Systematic identification of transcriptional and post-transcriptional regulations in human respiratory epithelial cells during inuenza a virus infection". BMC regulations in human respiratory epithelial cells during inuenza a virus infection". BMC bioinformatics, 15(1):336. bioinformatics, 15(1):336. [110] Oliver, S. (2000). “Proteomics: guilt-by-association goes global". Nature, 403(6770), 601- [110] Oliver, S. (2000). “Proteomics: guilt-by-association goes global". Nature, 403(6770), 601- 603. 603. [111] Wolfe, C. J. et al. (2005). “Systematic survey reveals general applicability of ‘guilt-by- [111] Wolfe, C. J. et al. (2005). “Systematic survey reveals general applicability of ‘guilt-by- association’ within gene coexpression networks". BMC bioinformatics, 6(1), 227. association’ within gene coexpression networks". BMC bioinformatics, 6(1), 227. [112] Oti, M. and Brunner, H. G. (2007). “The modular nature of genetic diseases". Clinical [112] Oti, M. and Brunner, H. G. (2007). “The modular nature of genetic diseases". Clinical genetics, 71(1), 1-11. genetics, 71(1), 1-11. [113] Barabasi A.L. et al. (2011). “: a network-based approach to human dis- [113] Barabasi A.L. et al. (2011). “Network medicine: a network-based approach to human dis- ease". Nat Rev Genet, 12: 56-68. ease". Nat Rev Genet, 12: 56-68. [114] Wang, X. et al. (2011). “Network-based methods for human disease gene prediction". Brief [114] Wang, X. et al. (2011). “Network-based methods for human disease gene prediction". Brief Funct Genomics, 10: 280-293. Funct Genomics, 10: 280-293. [115] Wu, X.B. and Li, S. (2010). “Cancer Gene Prediction Using a Network Approach". Ch Crc [115] Wu, X.B. and Li, S. (2010). “Cancer Gene Prediction Using a Network Approach". Ch Crc Math Comp Bio,: 191-212. Math Comp Bio,: 191-212. [116] Zou, Q. et al. (2014). “Approaches for recognizing disease genes based on network". BioMed [116] Zou, Q. et al. (2014). “Approaches for recognizing disease genes based on network". BioMed research international, 2014. research international, 2014. [117] Navlakha, S. and Kingsford, C. (2010). “The power of protein interaction networks for [117] Navlakha, S. and Kingsford, C. (2010). “The power of protein interaction networks for associating genes with diseases". Bioinformatics, 26(8), 1057-1063. associating genes with diseases". Bioinformatics, 26(8), 1057-1063. [118] Oti, M. et al. (2006). “Predicting disease genes using protein-protein interactions". Journal [118] Oti, M. et al. (2006). “Predicting disease genes using protein-protein interactions". Journal of medical genetics, 43(8), 691-698. of medical genetics, 43(8), 691-698. [119] Aragues, R. et al. (2008). “Predicting cancer involvement of genes from heterogeneous [119] Aragues, R. et al. (2008). “Predicting cancer involvement of genes from heterogeneous data". BMC bioinformatics, 9(1), 172. data". BMC bioinformatics, 9(1), 172. [120] Linghu, B. et al. (2009). “Genome-wide prioritization of disease genes and identification of [120] Linghu, B. et al. (2009). “Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network". Gen- disease-disease associations from an integrated human functional linkage network". Gen-

75 75 76

ome biology, 10(9), R91. ome biology, 10(9), R91. [121] Franke, L. et al. (2006). “Reconstruction of a functional human gene network, with an [121] Franke, L. et al. (2006). “Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes". The American Journal of Human application for prioritizing positional candidate genes". The American Journal of Human Genetics, 78(6), 1011-1025. Genetics, 78(6), 1011-1025. [122] Krauthammer, M. et al. (2004). “Molecular triangulation: bridging linkage and molecular- [122] Krauthammer, M. et al. (2004). “Molecular triangulation: bridging linkage and molecular- network information for identifying candidate genes in Alzheimer’s disease". PNAS, network information for identifying candidate genes in Alzheimer’s disease". PNAS, 101(42), 15148-15153. 101(42), 15148-15153. [123] Radivojac, P. et al. (2008). “An integrated approach to inferring gene-disease associations in [123] Radivojac, P. et al. (2008). “An integrated approach to inferring gene-disease associations in humans". Proteins: Structure, Function, and Bioinformatics, 72(3), 1030-1037. humans". Proteins: Structure, Function, and Bioinformatics, 72(3), 1030-1037. [124] Wu, X. et al. (2008). “Network-based global inference of human disease genes". Molecular [124] Wu, X. et al. (2008). “Network-based global inference of human disease genes". Molecular systems biology, 4(1), 189. systems biology, 4(1), 189. [125] Köhler, S. et al. (2008). “Walking the interactome for prioritization of candidate disease [125] Köhler, S. et al. (2008). “Walking the interactome for prioritization of candidate disease genes". The American Journal of Human Genetics, 82(4), 949-958. genes". The American Journal of Human Genetics, 82(4), 949-958. [126] Chen, J. et al. (2009). “Disease candidate gene identification and prioritization using protein [126] Chen, J. et al. (2009). “Disease candidate gene identification and prioritization using protein interaction networks". BMC bioinformatics, 10(1), 73. interaction networks". BMC bioinformatics, 10(1), 73. [127] Vanunu, O. and Sharan, R. (2008). “A Propagation-based Algorithm for Inferring Gene- [127] Vanunu, O. and Sharan, R. (2008). “A Propagation-based Algorithm for Inferring Gene- Disease Assocations". In German Conference on Bioinformatics, 54-63. Disease Assocations". In German Conference on Bioinformatics, 54-63. [128] Feldman, I. et al. (2008). “Network properties of genes harboring inherited disease muta- [128] Feldman, I. et al. (2008). “Network properties of genes harboring inherited disease muta- tions". PNAS, 105(11), 4323-4328. tions". PNAS, 105(11), 4323-4328. [129] Goh K.I. et al. (2007). “The human disease network". PNAS, 104(21), 8685-8690. [129] Goh K.I. et al. (2007). “The human disease network". PNAS, 104(21), 8685-8690. [130] Mitra, K. et al. (2013). “Integrative approaches for finding modular structure in biological [130] Mitra, K. et al. (2013). “Integrative approaches for finding modular structure in biological networks". Nature reviews. Genetics, 14(10), 719. networks". Nature reviews. Genetics, 14(10), 719. [131] Vlaic, S. et al. (2018). “ModuleDiscoverer: Identification of regulatory modules in protein- [131] Vlaic, S. et al. (2018). “ModuleDiscoverer: Identification of regulatory modules in protein- protein interaction networks". Scientific reports, 8(1), 433. protein interaction networks". Scientific reports, 8(1), 433. [132] Vanunu, O. et al. (2010). “Associating genes and protein complexes with disease via net- [132] Vanunu, O. et al. (2010). “Associating genes and protein complexes with disease via net- work propagation". PLoS computational biology, 6(1), e1000641. work propagation". PLoS computational biology, 6(1), e1000641. [133] Ghiassian, S. D. et al. (2015). “A DIseAse MOdule Detection (DIAMOnD) algorithm de- [133] Ghiassian, S. D. et al. (2015). “A DIseAse MOdule Detection (DIAMOnD) algorithm de- rived from a systematic analysis of connectivity patterns of disease proteins in the human rived from a systematic analysis of connectivity patterns of disease proteins in the human interactome". PLoS computational biology, 11(4), e1004120. interactome". PLoS computational biology, 11(4), e1004120. [134] Leiserson, M. D. et al. (2015). “Pan-cancer network analysis identifies combinations of rare [134] Leiserson, M. D. et al. (2015). “Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes". Nature genetics, 47(2), 106-114. somatic mutations across pathways and protein complexes". Nature genetics, 47(2), 106-114. [135] Sun, P.G. et al. (2011). “Prediction of human disease-related gene clusters by clustering [135] Sun, P.G. et al. (2011). “Prediction of human disease-related gene clusters by clustering analysis". Int J Biol Sci, 7 (1): 61-73. analysis". Int J Biol Sci, 7 (1): 61-73. [136] Wen, Z. et al. (2013). “An integrated approach to identify causal network modules of com- [136] Wen, Z. et al. (2013). “An integrated approach to identify causal network modules of com- plex diseases with application to colorectal cancer". Journal of the American Medical In- plex diseases with application to colorectal cancer". Journal of the American Medical In- formatics Association, 20(4), 659-667. formatics Association, 20(4), 659-667. [137] Dimitrakopoulos, C. M. and Beerenwinkel, N. (2017). “Computational approaches for the [137] Dimitrakopoulos, C. M. and Beerenwinkel, N. (2017). “Computational approaches for the identification of cancer genes and pathways". Wiley Interdisciplinary Reviews: Systems Bio- identification of cancer genes and pathways". Wiley Interdisciplinary Reviews: Systems Bio- logy and Medicine, 9(1). logy and Medicine, 9(1). [138] Wachi S. et al (2005). “Interactome-transcriptome analysis reveals the high centrality of [138] Wachi S. et al (2005). “Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues". Bioinformatics, 21: 4205-4208. genes differentially expressed in lung cancer tissues". Bioinformatics, 21: 4205-4208. [139] Jonsson, P.F. and Bates, P.A. (2006). “Global topological features of cancer proteins in the [139] Jonsson, P.F. and Bates, P.A. (2006). “Global topological features of cancer proteins in the human interactome". Bioinformatics, 22: 2291-2297. human interactome". Bioinformatics, 22: 2291-2297. [140] Özgür A. et al (2008). “Identifying gene-disease associations using centrality on a literature [140] Özgür A. et al (2008). “Identifying gene-disease associations using centrality on a literature mined gene-interaction network". Bioinformatics, 24: I277-I285. mined gene-interaction network". Bioinformatics, 24: I277-I285. [141] Cerami, E. G. et al., (2011). “Pathway Commons, a web resource for biological pathway [141] Cerami, E. G. et al., (2011). “Pathway Commons, a web resource for biological pathway data". Nucleic acids research, 39(suppl 1), D685-D690. data". Nucleic acids research, 39(suppl 1), D685-D690. [142] Gagnon-Bartsch, J. A. and Speed, T. P. (2012). “Using control genes to correct for unwanted [142] Gagnon-Bartsch, J. A. and Speed, T. P. (2012). “Using control genes to correct for unwanted variation in microarray data". Biostatistics, 13(3), 539-552. variation in microarray data". Biostatistics, 13(3), 539-552.

76 76 77

[143] Jacob, L. et al. (2016). “Correcting gene expression data when neither the unwanted vari- [143] Jacob, L. et al. (2016). “Correcting gene expression data when neither the unwanted vari- ation nor the factor of interest are observed". Biostatistics, 17.1: 16-28. ation nor the factor of interest are observed". Biostatistics, 17.1: 16-28. [144] Pfister, S. et al. (2009). “Outcome prediction in pediatric medulloblastoma based on DNA [144] Pfister, S. et al. (2009). “Outcome prediction in pediatric medulloblastoma based on DNA copy-number aberrations of chromosomes 6q and 17q and the MYC and MYCN loci". copy-number aberrations of chromosomes 6q and 17q and the MYC and MYCN loci". Journal of clinical oncology, 27(10), 1627-1636. Journal of clinical oncology, 27(10), 1627-1636.

77 77