A Complex Trait Genomics Approach to Investigating Amyotrophic Lateral Sclerosis

Restuadi Bachelor of Science, Master of Bioinformatics (Advanced)

https://orcid.org/0000-0001-8434-4465

A thesis submitted for the degree of Doctor of Philosophy at The University of Queensland in 2020 Institute For Molecular Bioscience

2 Abstract

Amyotrophic lateral sclerosis (ALS) is the most common form of motor neuron disease (MND). It is a fatal neurodegenerative disease that has a lifetime risk of 1 in 300 people worldwide, with death often occurring within 3 to 5 years from the onset of symptoms. Currently, there is no effective treatment and no cure for ALS, in part due to a limited understanding of its aetiology. Most ALS research to date is driven by discoveries in the familial form of the disease, where causality seems to reflect a heritable single gene mutation. Only a small proportion of ALS research is focussed on investigating the cause of sporadic ALS, which is likely to be a complex with many factors contributing to its aetiology, including genetic and environmental risk factors. The traditional ALS genetic research approaches which only rely on familial data or the single-trait Genome Wide Association Study (GWAS) paradigm might not be ideal for studying complex disorders like ALS. Therefore, the research presented in this thesis aims to study the aetiology of ALS using statistical techniques that have been developed for genetic analysis of complex traits applied to publicly available summary statistics level GWAS and multi-omics data. Newly available gut microbiome data was also used to understand non-genetic factors on the development of disease. Using more than 200 publicly available GWAS summary statistics from various traits to estimate genetic correlations with ALS, I found a negative correlation between ALS with cognitive performance (CP) and with education attainment (EA). This analysis also confirmed the previously reported positive correlation with (SCZ). Adding support for these genetic correlations, I was able to leverage ALS polygenic risk score prediction by multi- traits methods (MTAG and SMTpred) using these correlated traits. Furthermore, cell-type enrichment analyses of these traits showed that many central nervous system tissues relevant to ALS are also found to be highly significant in its correlated traits (CP, EA, SCZ). Genes expressed in dendritic cell were found to be significantly enriched in ALS association results which might suggest immune system involvement in ALS aetiology. The post-GWAS analysis of the largest unpublished ALS European ancestry GWAS (139,452 individuals) using Bayesian methodology suggested that ALS genetic architecture is likely to be less polygenic, has greater relative contribution from rare variants and has greater indication of negative selection compared to other common diseases. Using Summary Based Mendelian Randomisation (SMR) analysis, I provided support that ALS associated SNPs could

3 play a causal role mediated through gene expression in brain and blood (SCFD1, RES18, MOBP, SLC98A, GGNBP2, DHRS11, ZNHIT3, MYO19, G2E2, SARM1). Moreover, most of these causal gene annotations are linked to protein processing in endoplasmic reticulum and cytoskeletal function that is relevant to previously reported ALS aetiology. By comparing SMR result with the correlated traits, I also found that CP has causal gene overlap with ALS. Moreover, the SMR significant genes for CP have an inflated SMR P-value distribution which could suggest functional overlap in both traits. The gut microbiome study of MND (where the majority of cases are ALS), suggests that there is no significant difference in term of composition or richness between ALS and healthy controls. The result showed that 4 ALS patients had dysbiosis, but with our current sample size, the possibility that this dysbiosis is a chance result cannot be excluded. By analysing the ALS patient survival status with the microbiome richness index (Shannon index), ALS patients with higher microbiome richness were found to have faster progression and mortality. This finding challenges the common notion that higher microbiome richness leads to better health. Despite all of these findings, it is still not clear whether the gut microbiome composition contributes to the cause of ALS or is just a consequence of ALS.

4 Declaration by author

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution of others to jointly-authored works that I have included in my thesis. I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, study design, data analysis, significant technical procedures, professional editorial advice, financial support and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my higher degree by research candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award. I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School. I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis and have sought permission from co-authors for any jointly authored works included in the thesis.

5 Publications included in this thesis

1. Restuadi Restuadi, Fleur C. Garton, Beben Benyamin, Tian Lin, Kelly L Williams, Anna Vinkhuyzen, Wouter van Rheenen, Zhihong Zhu, Nigel G. Laing, Karen A. Mather, Perminder S. Sachdev, Shyuan T Ngo, Frederik J. Steyn, Leanne Wallace, Anjali K Henders, Peter M Visscher, Merrilee Needham, Susan Mathers, Garth Nicholson, Dominic B. Rowe, Robert D. Henderson, Pamela A. McCombe, Roger Pamphlett, Ian P Blair, Naomi R Wray, Allan F McRae : Polygenic Risk Score Analysis for Amyotrophic Lateral Sclerosis Leveraging Cognitive Performance, Educational Attainment and Schizophrenia European Journal of Human Genetics (Accepted for Publication) 2020

This publication has been incorporated as Chapter 2.

Contributor Statement of contribution Restuadi Restuadi (Candidate) Analysis and interpretation (80%) Drafting and production (70%) Naomi R Wray Conception and design Allan F McRae Analysis and interpretation (10% total) Drafting and production (15%) Leanne Wallace Genotype data generation Anjali K Henders Fleur C. Garton Analysis and interpretation (10% total) Beben Benyamin Tian Lin Zhihong Zhu Peter M Visscher Anna Vinkhuyzen Wouter van Rheenen Kelly L Williams These authors provided biological samples Nigel G. Laing Karen A. Mather Perminder S. Sachdev Shyuan T Ngo Frederik J. Steyn Merrilee Needham Susan Mathers Garth Nicholson Dominic B. Rowe Robert D. Henderson Pamela A. McCombe Roger Pamphlett Ian P Blair

6

2. Pamela A McCombe, Robert D Henderson, Aven Lee, John D Lee, Trent M Woodruff, Restuadi Restuadi, Allan F McRae, Naomi R Wray, Shyuan Ngo, Frederik J Steyn: Gut microbiota in ALS: possible role in pathogenesis? Expert review of neurotherapeutics, 1-21: 2019

This publication is a review article about the possible pathogenesis of ALS that involves the gut microbiome. I and my supervisors (Naomi R Wray and Allan F McRae) contributed the whole of chapter 2 of the paper (Gut microbiome methodology reviews) that makes up ~20% of the published review and describes methodology for analysis of microbiome data. I incorporated this methodology review into Chapter 4 (Gut microbiome chapter, specifically sections 4.1.1 and 4.1.2) in this thesis.

Contributor Statement of contribution Restuadi Restuadi (Candidate) Literature review (80%) Drafting and production (60%) Naomi R Wray Literature review (20% total) Allan F McRae Drafting and production (30%) Frederik J Steyn Drafting and production (10% total)

7

3. Shyuan T Ngo*, Restuadi Restuadi*, Allan F McRae, Ruben P Van Eijk, Fleur Garton, Robert D Henderson, Naomi R Wray, Pamela A McCombe, Frederik J Steyn: Progression and survival of patients with motor neuron disease relative to their faecal microbiota, Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, DOI: 10.1080/21678421.2020.1772825 : 2020

In this publication, I shared the first authorship (*) with Dr. Shyuan Ngo. I conducted the analyses and interpretation of the gut microbiome and this publication has been incorporated as Chapter 4 (specifically section 4.2 to 4.5).

Contributor Statement of contribution Restuadi Restuadi* (Candidate) Analysis and interpretation (60%) Drafting and production (40%) Shyuan T Ngo* Analysis and interpretation (20%) Drafting and production (40%) Conception and design (40%) Sample recruitment and processing (30%) Providing clinical data/record (30%) Frederik J Steyn Analysis and interpretation (10%) Drafting and production (10%) Conception and design (50%) Sample recruitment and processing (50%) Providing clinical data/record (30%) Allan F McRae Analysis and interpretation (10% total) Ruben P Van Eijk Drafting and production (10% total) Fleur Garton Conception and design (10% total) Naomi Wray Robert D Henderson Sample recruitment and processing (20% Pamela A McCombe total) Providing clinical data/record (40% total)

8 Submitted manuscripts included in this thesis

No manuscripts submitted for publication

9 Other publications during candidature

MF Nabais, T Lin, B Benyamin, KL Williams, FC Garton, AAE Vinkhuyzen, F Zhang, CL Vallerga, R Restuadi, A Freydenzon, RAJ Zwamborn, PJ Hop, MR Robinson, J Gratten, PM Visscher, E Hannon, J Mill, MA Brown, NG Laing, KA Mather, PS Sachdev, ST Ngo, FJ Steyn, L Wallace, AK Henders, M Needham, JH Veldink, S Mathers, G Nicholson, DB Rowe, RD Henderson, PA McCombe, R Pamphlett, J Yang, IP Blair, AF McRae, NR Wray : Significant out-of-sample classification from methylation profile scoring for amyotrophic lateral sclerosis, NPJ Genomic Medicine 5 (10) : 2020, https://doi.org/10.1038/s41525-020- 0118-3

S Mortlock, RI Kendarsari, JN Fung, G Gibson, F Yang, R Restuadi, JE Girling, SJ Holdsworth-Carson, WT Teh, SW Lukowski, M Healey, T Qi, PAW Rogers, J Yang, B McKinnon, GW Montgomery: Tissue specific regulation of transcription in endometrium and association with disease, Human Reproduction, pp. 1–17 : 2019, doi:10.1093

S Mortlock, R Restuadi, R Levien, JE Girling, SJ Holdsworth-Carson, M Healey, Z Zhu, T Qi, Y Wu, SW Lukowski, PAW Rogers, J Yang, AF McRae, JN Fung, GW Montgomery : Genetic regulation of methylation in human endometrium and blood and gene targets for reproductive diseases, Clinical Epigenetics 11 (1), 49: 2019

L Priliani, EL Prado, R Restuadi, DE Waturangi, AH Shankar, SG Malik : Maternal Multiple Micronutrient Supplementation Stabilizes Mitochondrial DNA Copy Number in Pregnant Women in Lombok, , J Nutrition 149(8), 1309–1316 : 2019

R Dhenni, MR Karyanti, ND Putri, B Yohan, FA Yudhaputri, CN Ma'roef, A Fadhilah, A Perkasa, R Restuadi, H Trimarsanto, I Mangunatmadja, JP Ledermann, R Rosenberg, AM Powers, KSA Myint, T Sasmono : Isolation and complete genome analysis of neurotropic dengue virus serotype 3 from the cerebrospinal fluid of an encephalitis patient, PLoS Neglected Tropical Diseases 12 (1), e00061983: 2018

A Agustiningsih, H Trimarsanto, R Restuadi, IM Artika, M Hellard, DH Muljono : Evolutionary study and phylodynamic pattern of human influenza A/H3N2 virus in Indonesia from 2008 to 2010, PloS One 13 (8), e0201427: 2018

10 Contributions by others to the thesis

Several other people have contributed significantly to this thesis. First and foremost, my supervisors Professor Naomi Wray and Dr. Allan McRae, and the co-authors listed on the included publications and manuscripts.

11 Statement of parts of the thesis submitted to qualify for the award of another degree

No work submitted towards another degree have been included in this thesis.

12 Research involving human or subjects

For Chapter 2 in this thesis, I present new data from an Australian ALS GWAS cohort comprising 836 cases and 665 controls. The sample includes the University of Sydney’s Australian Motor Neuron Disease DNA Bank (MND Bank) cohort recruited between April 2000 to June 2011 (462 cases, 449 controls), with study protocol approved by the Sydney South West Area Health Service Human Research Ethics Committee (HREC). The remainder of the cases (N=374) comprised ALS patients recruited from clinics across Australia between 2015 and 2017 under HREC approvals from University of Sydney, Western Sydney Local Health District, Royal Brisbane and Women’s Hospital, and Macquarie University.

13 Acknowledgments

First and foremost, I wish to express my sincere appreciation to my principal supervisor, Professor Naomi Wray, who has guided me with her wisdom and perseverance: she convincingly guided and encouraged me to be a proper professional scientist and to keep doing the right thing even when the road got tough. Without her persistence help and scientific insights, this thesis would not have been realised. I have been very lucky to have a supervisor that truly cares about her student well-being. Secondly, I want to thank my co-supervisors, Dr Allan McRae and Dr Beben Benyamin for their continuous support to my well-being during my PhD study. In particular, I would like to express my gratitude to Dr Allan McRae who has been a father-figure in my whole 4 years of PhD process. His cheerful and free-spirit attitudes will always be my inspiration to be a happy scientist that has a balanced life. I also highly appreciated the inspiration and support from Dr Beben Benyamin, who sparked my passion for research and who has being supportive in almost all of the technical matters of my PhD, from scientific writing, proof reading, to the understanding of the basic statistical methods. I also thank Dr Shyuan Ngo and Dr Frederik Steyn for their clinical insights and relentless tenacity to help ALS patients which gives a significant amount of inspiration for my research. I also thank my PhD committee members, Dr Cheong Xin Chan and Dr Nicole Warrington for their suggestions and thoughtful comments. Thirdly, I would like to thank all members of the Program of Complex Trait Genomics (PCTG) for all their support, friendship, and inspiration during my PhD. In particular, I would like to thank Dr Fleur Garton, Dr. Maciej Trzaskowski, and Professor Jian Yang for being great mentors. I also appreciate the great friendship and fun discussions from my fellow lab-mates (some former students) Dr. Yang Wu, Dr. Wenhan Chen, Angli Xue, Ying Wang, Huanwei Wang, Dr. Ting Qi, Marta Nabais, Dr. Zhihong Zhu, Tian Lin, and Dr. Joana Revez. I also acknowledge PCTG admin team, Earlene Ashton, Suzi Cheshire, and formerly Emily Thomson and Rebecca Richter for tremendous help in administrative matters. In addition to this, I also acknowledge PCTG Human Studies Unit (HSU) lead by Anjali Henders and Leanne Wallace for providing high-quality data for my PhD projects and my published works. Last but not least, I would like to thank UQ International Scholarship for the financial support during my PhD and to PCTG for the top-up scholarship. I also thank Dr Amanda Carrozzi for being a great graduate school representative who always helped me in tough situations.

14 Financial support

This research was supported by University of Queensland International Scholarships, funding from the National Health and Medical Research Council (NHMRC) (1078901, 1083187, 1113400, 1121962, 1405325,1084417, and 1079583), a NHMRC/Australian Research Council Strategic Award (401162), the Motor Neurone Disease Research Institute Australia (MNDRIA) Ice Bucket Challenge Grant and the MNDRIA Bill Gole Postdoctoral Fellowship (FCG).

15 Research Classifications

Keywords

ALS, genetics, GWAS, gut-microbiome, genetic-prediction, genetic-correlation, causal-genes.

Australian and New Zealand Standard Research Classifications (ANZSRC)

ANZSRC code: 060412, Quantitative Genetics, 70% ANZSRC code: 929201, Clinical Health, 20% ANZSRC code: 060605, Microbiology, 10%

Fields of Research (FoR) Classification

FoR code: 0604, Genetics, 70% FoR code: 1109, Neurosciences, 20% FoR code: 0605, Microbiology, 10%

16 Dedication

“For all ALS diagnosed patients around the world and people who work hard to end this terrible disease”

17 Table of Contents

CHAPTER 1: INTRODUCTION ...... 25 1.1 AMYOTROPHIC LATERAL SCLEROSIS (ALS) ...... 26 1.1.1 Description of ALS ...... 26 1.1.2 ALS clinical heterogeneity ...... 28 1.1.3 Current understanding of ALS aetiology ...... 28 1.1.4 Known risk factors of ALS ...... 29 1.2 GENETICS OF ALS ...... 37 1.2.1 Functional mechanisms of ALS known associated genes ...... 37 1.2.2 Cell-biology of ALS known genes ...... 41 1.2.3 ALS genetics current limitations ...... 45 1.3 OVERVIEW OF THESIS RESEARCH DIRECTION ...... 47 1.3.1 ALS genetic correlations with its risk factors ...... 47 1.3.2 Better understanding of ALS aetiology by integrating multi-omics data ...... 48 1.3.3 Explore the possibilities of the involvement of Microbiome data ...... 49 1.4 REFERENCES ...... 50 CHAPTER 2: ALS WITH RISK FACTORS ...... 63 2.1 ABSTRACT ...... 64 2.2 INTRODUCTION ...... 64 2.3 MATERIAL AND METHOD ...... 66 2.3.1 Australian ALS GWAS data ...... 66 2.3.2 GWAS meta-analysis : Australian cohort and ALS European ...... 67 2.3.3 Selection of correlated traits ...... 67 2.3.4 Polygenic risk scores ...... 68 2.4 RESULTS ...... 70 2.4.1 GWAS meta-Analysis : Australian cohort and European ...... 70 2.4.2 Selection of correlated traits ...... 73 2.4.3 Polygenic Risk Scores ...... 73 2.5 DISCUSSION ...... 76 2.6 ACKNOWLEDGEMENTS ...... 78 2.7 REFERENCES ...... 79 2.8 SUPPLEMENTARY MATERIALS ...... 84 CHAPTER 3: NEW INSIGHTS FROM POST-GWAS ANALYSIS OF ALS ...... 90 3.1 INTRODUCTION ...... 91 3.2 MATERIAL AND METHOD ...... 94 3.2.1 ALS European GWAS 2019 dataset (unpublished) ...... 94 3.2.2 Polygenic risk scores ...... 95 3.2.3 Genetic architecture ...... 96 3.2.4 Cell-type analysis ...... 97 3.2.5 SMR analysis of ALS ...... 97 3.2.6 SMR results comparison of ALS, CP, EA, and SCZ ...... 98 3.3 RESULTS ...... 100 3.3.1 Polygenic risk scores ...... 100 3.3.2 Genetic architecture ...... 102 3.3.3 Cell-Type analysis ...... 102 3.3.4 SMR analysis ...... 105 3.4 DISCUSSION ...... 112 3.4.1 Prediction and genetic architecture parameter estimates ...... 112 3.4.2 Cell-type analysis of ALS and its correlated traits ...... 113 3.4.3 SMR results of ALS and their overlap with CP, EA, SCZ ...... 114 3.4.4 Study limitations ...... 117 3.5 CONCLUSIONS ...... 118 3.6 REFERENCES ...... 119 3.7 SUPPLEMENTARY MATERIALS ...... 125

18 CHAPTER 4: MND GUT MICROBIOME STUDY ...... 127 4.1 INTRODUCTION ...... 128 4.1.1 Methods for detection and analysis of the gut microbiota ...... 129 4.1.2 Current MND gut microbiome studies ...... 135 4.1.3 Study rationale ...... 139 4.2 MATERIAL AND METHODS ...... 140 4.2.1 Sample collection ...... 140 4.2.2 Anthropometric, metabolic, and clinical measures ...... 140 4.2.3 DNA extraction ...... 141 4.2.4 PCR amplification and sequencing ...... 141 4.2.5 Reads quality control ...... 142 4.2.6 Microbiome analysis using QIIME 2 ...... 142 4.2.7 Metagenomic prediction using PICRUSt ...... 144 4.2.8 Microbiome case-control ...... 144 4.2.9 MND cases only statistical analysis ...... 144 4.3 RESULTS ...... 146 4.3.1 summary of participants with samples that passed QC ...... 146 4.3.2. Sequencing read quality Control ...... 147 4.3.3 Alpha and Beta diversity ...... 149 4.3.4 analysis ...... 151 4.3.5 Metagenomic prediction ...... 153 4.3.6 MND case-control regression analysis ...... 154 4.3.7 MND cases specific regression analysis ...... 154 4.3.7 Association of gut microbiota composition with baseline clinical features ...... 155 4.3.8 Association of gut microbiota composition with progression of disease ...... 158 4.3.9 Association of gut microbiota composition with survival ...... 159 4.4 DISCUSSION ...... 161 4.5 CONCLUSION ...... 165 4.6 REFERENCES ...... 167 4.7 SUPPLEMENTARY MATERIAL ...... 177 CHAPTER 5: THESIS DISCUSSION ...... 178 5.1 THESIS SUMMARY ...... 179 5.1.1 Thesis goal and general strategy ...... 179 5.1.2 Thesis findings ...... 179 5.2 THESIS LIMITATIONS ...... 185 5.2.1 Low SNP-based estimates of ALS GWAS ...... 185 5.2.2 The limited availability of ALS GWAS and expression level data with sufficient power ...... 186 5.2.3 The lacked power of gut microbiome study and difficulty in inferring causality ...... 187 5.3 IMPLICATIONS FROM RESULTS ...... 188 5.4 FUTURE OF ALS RESEARCH ...... 190 5.4.1 Population scale whole genome sequencing to investigate rare variants ...... 190 5.4.2 Single-cell expression sequencing data to pin-point ALS pathology ...... 191 5.4.3 Application of Induced Pluripotent Stem Cell (IPSC) ...... 192 5.4.4 Longitudinal gut microbiome data from various location of gastrointestinal tract ...... 193 5.5 CONCLUSIONS ...... 193 5.6 REFERENCES ...... 195

19 List of Figures and Tables

--- Figures ---

Chapter 2

Figure 1. ALS European GWAS Manhattan plot (left) comparison with ALS GWAS Meta- analysis of European and Australian cohort (right) ...... 71

Figure 2. The new ALS loci from ALS meta-analysis of European and Australian cohort .. 72

Figure 3. Prediction accuracy of single-trait predictors of ALS in the Australian cohort ..... 74

Figure 4. Prediction accuracy of multi-trait predictors compared to the ALS only predictor75

Supplementary Figure 1. Effect on ALS prediction accuracy with different SNP selection thresholds and SNP set choice ...... 84

Chapter 3

Figure 1. Effect on ALS prediction accuracy with different SNP selection thresholds and SNP set choice ...... 100

Figure 2. Prediction accuracy statistics of PRS of ALS in the stratum-7 cohort using PRS derived from various methods ...... 101

Figure 3. Cell-type enrichment of total ALS GWAS total summary statistics (all strata) .. 103

Figure 4. Cell-type enrichment of total ALS GWAS total summary statistics (all strata) compared to cell-type enrichment of its genetically correlated traits ...... 104

Figure 5. Results of the SMR+HEIDI analysis that combines the data from GWAS and eQTL studies. In this locus on chromosome-17, there are 4 SMR significant genes that also passed HEIDI test in blood eQTLgen ...... 107

Figure 6. Results of the SMR+HEIDI analysis that combines the data from GWAS and eQTL studies. In this locus on chromosome-2, a new gene (RESP18) that passed SMR and HEIDI test in brain meta-eQTL data is reported in the genomic region that very dense with genes ..... 108

Figure 7. Results of the SMR+HEIDI analysis that combines the data from GWAS and eQTL studies. In this locus on chromosome-14 the G2E2 gene did not pass SMR and HEIDI test in brain eQTL data (upper) but is significant in PsychENCODE data (below) ...... 109

Figure 8. QQ plots of ALS SMR P-values of (A)Brain meta-eQTL and (B)Blood eQTL, compared to QQ plots of ALS P-values that enriched by cognitive performance (CP) significant probes in (C)brain and (D)blood. Educational attainment (EA) enriched significant probes in

20 (E)brain and (F)blood. Schizophrenia (SCZ) enriched significant probes in (G)brain and (H)blood ...... 111

Chapter 4

Figure 1. Flow-chart/work-flow illustration of Microbiome analysis using QIIME 2 pipeline ...... 143

Figure 2. Quality Scores distribution graphs for sequencing reads that passed initial Trimmomatic quality control ...... 148

Figure 3. Rarefaction graphs. (a) Rarefaction effects on number of samples. (b) Rarefaction effects on alpha diversity (Shannon Index). (c) Rarefaction effects on observed OTU ...... 149

Figure 4. Alpha Diversity comparison plot between controls and MND patients ...... 150

Figure 5. Beta Diversity PCoA plot for MND patients (red) and controls (green), the size of each dot represents alpha diversity measured by Shannon Index ...... 151

Figure 6. OTU Taxonomy histogram at Phylum level between MND patients and controls. The overall Phylum distribution of each group ...... 152

Figure 7. The OTU taxonomy of case-control OTU composition in phylum level (left box) compared with the OTU composition of the four beta-diversity outliers (right box) ...... 153

Figure 8. PICRUSt2 prediction result with KEGG functional enrichment. No significant different found between case and control groups ...... 154

Figure 9. (A-C) Shannon index, (D-F) Chao1 index, and (G-I) logFirmicutes/Bacteroidetes (F/B) ratio relative to (A, D and G) site of onset, (B, E and H) King’s stage, or (C, F and I) metabolic status of patients with MND ...... 156

Figure 10. Survival probability of MND patients relative to the (A) Shannon index, (B) Chao1 index, and (C) logFirmicutes/Bacteroidetes (F/B) ratio. Crude Kaplan-Meier curves represent time since symptom onset ...... 160

--- Tables ---

Chapter 2

Table 1. Genetic correlations between ALS and traits used in prediction analysis ...... 73

Supplementary Table 1. Significant ALS genetic correlations (P-value < 0.05) from the LDHub platform ver 1.9 ...... 85

21 Supplementary Table 2. Single-trait prediction results into the Australian cohort of 836 ALS cases and 665 controls ...... 87

Supplementary Table 3. Multi-traits predictor prediction results for ALS single traits and combined ALS+correlated traits in Australian cohort of 836 ALS cases and 665 controls ... 88

Supplementary table 4. ALS prediction results for clumped SNPs in various P-value Thresholds (PT) ...... 89

Chapter 3

Table 1. Significantly enriched cell-type of ALS presented in Figure 3 sorted by -log(P-values) ...... 103

Table 2. The details of significantly enriched Central Nervous System (CNS) cell-type of ALS in the correlated traits (CP, EA, and SCZ) sorted by -log(P-values) ...... 104

Table 3. SMR analysis result across the eQTL summary statistics using ALS GWAS total summary statistics ...... 105

Table 4 . The number of enriched-probes for each ALS correlated traits in brain meta-eQTL and blood eQTLgen ...... 110

Supplementary Table 1. Genotyping chip used and the sample number for each strata ... 125

Supplementary Table 2. Significant genes from SMR analysis (without HEIDI test) ...... 126

Chapter 4

Table 1. Published studies of MND/ALS gut dysbiosis in humans ...... 136

Table 2. Clinical data record and metabolic assessment result that used in analysis ...... 141

Table 3. Passed QC samples demography table ...... 147

Table 4. Correlations between Shannon and Chao1 indices and patient information collected at the time of metabolic assessment ...... 157

Table 5. Correlations between Firmicutes and Bacteroidetes (logF/B) ratio and patient information collected at the time of metabolic assessment ...... 158

Supplementary Table 1. MND microbiome beta-diversity outliers clinical data ...... 177

22 List of Abbreviations used in the thesis

AD: Alzheimer’s disease ALS: amyotrophic lateral sclerosis ALSFRS-R: ALS Functional Rating Scale Revised AHR: aryl-hydrocarbon receptor ASD: autism spectrum disorder AUC: area under curve

BLUP: Best Linear Unbiased Prediction BMAA: beta-methylamino-L-alanine BMI : body mass index BMR: basal metabolic rat

CHR: chromosome CI: confidence interval CNS: central nervous system CP: cognitive performance

DC: dendritic cell DNA: deoxyribonucleic acid

EA: education attainment ENS: enteric nervous system ER: endoplasmic reticulum eQTL: expression quantitative loci

F/B: Firmicutes-Bacteroidetes ratio FDR: false discovery rate FTD: frontotemporal dementia FVC: Forced Lung Capacity

GBA: gut-brain axis GTEx: Genotype-Tissue expression GWAS: genome-wide association study

HM3: HapMap 3 HG19: human genome reference version 19 HPA: hypothalamic pituitary adrenal HR: hazard ratio HRC: Haplotype Reference Consortium HREC: Human Research Ethics Committee HRS: Human Retirement Study HWE: Hardy-Weinberg Equilibrium

IBD: identity by descent IBMPFD: Inclusion body myopathy – Paget disease – Frontotemporal dementia IPSC: Induced Pluripotent Stem Cell IQ: intelligence quotient

23

KEGG: Kyoto Encyclopedia of Genes and Genomes

LD: LDSR: LD-score regression LRT: likelihood ratio test

MALDI-TOF: matrix assisted laser desorption ionization-time of flight mass spectrometry MAMP: microbe associated molecular patterns MHC: major histocompatibility complex MND: motor neuron disease MR: Mendelian randomisation mRNA: messenger RNA MZ: monozygotic

NKR2: Nagelkerke-R2

OTU: operational taxonomic unit

RRM: RNA recognition motifs RNA: ribonucleic acid

SBLUP: Summary-based Best Linear Unbiased Prediction SCFA: short chain fatty acid SCZ: schizophrenia ScRNA-seq : single cell RNA sequencing SE: standard error S-LDSR: stratified linkage disequilibrium score regression SNP: single-nucleotide polymorphism SMR : Summary-based Mendelian randomisation

PC: principal component PCA: principal component analysis PCoA: principal coordinate analysis PCR: polymerase chain reaction PD: Parkinson’s disease PRS: polygenic risk score

QC: Quality Control

WGS: Whole-genome sequencing

24 1

Chapter 1: Introduction

25 1.1 Amyotrophic Lateral Sclerosis (ALS)

1.1.1 Description of ALS

Amyotrophic Lateral Sclerosis (ALS) was first reported by in 1824 by a Scottish physician and anatomist Charles Bell1. In his report, he described one of the most common symptoms of ALS, the progressive weakness of the limbs. Despite his descriptions being mainly on muscle weakness, he already made an educated guess on the involvement of the damaged motor neurons that cause this weakening muscle1. About half century after being first recorded, a French physician Jean-Martin Charcot, confirmed Charles Bell’s guess using his expertise on clinical autopsy1,2. Charcot reported his findings on ALS patient autopsies in a series of papers that showed the evidence of motor neuron death1. This series of papers led to the formal definition of ALS and the name of this disease in Greek words in 1874. The word “amyotrophic” can be translated as muscle tissue breakdown because of no nourishment, “lateral” means sides (not central), and “sclerosis” means stiffening of structure1. Therefore, Charcot defined ALS as a muscle tissue progressive wasting/weakening with the implication of lack of nourishment/activity due to lateral nerve death/stiffening. This formal definition also categorised ALS as a motor neuron disease (MND). MNDs include progressive bulbar palsy, pseudobulbar palsy, progressive muscular atrophy, primary lateral sclerosis, and monomelic amyotrophy3, but ALS is the most common form of MND4. As defined by Charcot, ALS is a progressive paralytic disorder characterised by motor neuron degeneration in brain and spinal cord5–7. The progressive nature of this disease usually begins in one part of the voluntary muscle in limbs or bulbar (a set of control muscles needed for swallowing, speaking, chewing, other mouth function), but spreads relentlessly to almost all voluntary muscles in the body, including the diaphragm6. The diaphragm loss of function typically happens 3-5 years after disease onset and often causes lethal respiratory failures in ALS patients5,6. In most ALS patients, the first symptoms usually emerge in the limbs, in form of weakening or twitching muscle (upper/lower limb onset)6,7. However, in about one-third of patients, the symptoms begin in the bulbar area which manifests as difficulty in chewing, swallowing , and speaking (bulbar onset)6,8. In the advanced stages of ALS, almost all of the motor neurons that control voluntary movement have died; the eye and sphincter muscles (circular muscle that maintains constriction of a natural body passage) are among the last affected muscle5,6.

26 Diagnosis of ALS is relatively straightforward when there is clear sign of weakness of upper or lower motor neurons6. Motor neuron cells are clustered in two main locations. The upper motor neurons are located in the motor cortex of the brain and the lower motor neurons are located in the brain stem and spinal cord9. The lower motor neurons connect and stimulate the muscle9. Both these two types of motor neurons can be affected in ALS. When the upper motor neuron fails, it usually shows as muscle stiffness and spasm. Upper motor neuron onset is more associated with bulbar onset of ALS8. Lower motor neuron failure usually manifests as uncontrollable muscle twitching as the muscle degenerates and with loss of synaptic control over the muscle, before total loss of muscle control6. The progression of motor neuron death can be diagnosed using electromyography, imaging, a series of neurologic examinations and laboratory tests6. In most cases, it typically takes 4-12 months from first symptom to ALS diagnosis6. According to epidemiological studies in Europe and United States, ALS affects 3 to 5 people in 100,000 people10–12 (prevalence ~4x10-5 on average). This statistic is fairly similar worldwide, although there are some areas where ALS is more common10,12. ALS is categorised as a late onset disease, since the incidence and prevalence are significantly increased between 40-70 years old (55 years old on average). With this late-onset, the cumulative lifetime risk of ALS is about 1 in 300 worldwide10,13. ALS incidence can be categorised into two main forms. First is the familial ALS, in which ALS segregates in a family, usually inherited as a dominant trait, and comprising up to 10% of the ALS cases6,7,14. Second is sporadic ALS which seems not to run in families (or at least known family history), having a more complex pattern of genetic inheritance, and comprising the remaining 90% of the ALS cases6,7,14. In the case of sporadic ALS, the ratio between male and female is around 2:1, while in familial ALS the ratio is closer to 1:16. Most of the unusually early onset (in teenage and early adulthood) of ALS cases are familial ALS6,15. Familial ALS also tend to have earlier (~5 years on average) age of onset compared to sporadic cases15. Recent review suggested that the familial and sporadic ALS categorizations were not accurate, since many sporadic cases were likely not aware of other/distant family members who had ALS, rather than following sporadic pattern16,17.

27 1.1.2 ALS clinical heterogeneity

The clinical presentation of ALS is very heterogeneous. Part of this heterogeneity can be explained by differences between people in the motor neuron population that is affected by ALS17. The ALS progression on the upper motor neuron is generally categorised as the “primary lateral sclerosis” form of ALS6. This form is the slowest progressing form of ALS and is mainly characterised by spastic muscle stiffness and little muscle atrophy. Both limb and bulbar onset case are equally common in this category6,7,10. On the other hand, ALS progression in lower motor neurons is usually associated with the “progressive muscular atrophy” form of ALS, because the main symptom is progressing relatively fast, i.e., the loss of muscle (atrophy) with minimum muscle spasm6. Other than “primary lateral sclerosis” and “progressive muscular atrophy”, the broader ALS progression on upper motor neuron population could affect the surrounding area of frontal lobe and sensory strip which resulting in dementia and loss of sensory functions on limbs. In fact, ALS with frontotemporal dementia is observed in 15 to 20% of ALS patients7. These patients show additional frontotemporal dementia (FTD) symptoms such as behavioural changes and cognitive decline. Moreover, about 30% of FTD diagnosed patients also progress to ALS. Some cases of ALS involve frontopontine neurons (neurons that connect frontal lobes and pons). Here, patients have severe bulbar onset progression with pseudobulbar palsy symptoms which affects emotional lability (excessive laughing or sadness over minor emotional stimuli)6,7.

1.1.3 Current understanding of ALS aetiology

To date, ALS exact aetiology remains elusive and as a result there is no effective treatment. However, with current available data, updates on biological knowledge and genomic technology, it is possible to investigate the aetiology of ALS. The involvement of a number of famous people diagnosed with ALS has raised the awareness of ALS. For example, famous baseball player Lou Gehrig and famous physicist Stephen Hawking were diagnosed with ALS. Moreover, increased awareness came from the community that inspired the Ice-bucket Challenge in 2014, which successfully raised over 220 million US dollar worldwide for ALS research. This increased awareness and financial support also builds momentum for ALS research to date.

28 From current research findings, ALS is known for its complexity, involving genetic and non-genetic risk factors6,14,18. Given the unusual late onset (mean age of onset 55 years old), it has been hypothesized that ALS is an accumulation of up to six exposures with the final exposure triggering the onset of disease18. Some reported environmental risk factors include elite athleticism, beta-methylamino-L-alanine (BMAA) pesticide exposure, and lifestyle factors (including smoking status, diet, and body mass index)19. However, smoking, exposure to heavy metals, and military-deployment job were the only replicated lifestyle risk factors19– 25. Larger study samples are needed to validate other reported putative environmental risk factors. The genetic factors will be described in detail later in part 1.2.

1.1.4 Known risk factors of ALS

1.1.4.1 Metabolism dysfunction

A growing body of evidence is converging in support of metabolic dysregulation as a characteristic of disease progression, at least in a proportion of patients4,26. Indeed, abnormal metabolism that leads to low BMI in ALS patient has been reported since 197727. Subsequent studies suggest that ALS patients may have impairment in glucose, lipid metabolism, and energy homeostasis4,28. This impairment has a significant impact on influencing disease progression and survival rates. Clinical studies by Lim et al. (2012) found that ALS patients tend to have metabolic dysfunctions that significantly reduce their lean weight and subcutaneous fat29, which may be consistent with observations of hypermetabolism in ALS patients30,31. Metabolism is also a complex trait that involves many factors, with genetics as one of the dominant factors. The latest research suggests that metabolism also could be affected by gut microbiome32–35.

1.1.4.1.1 Metabolism measurement and genetics

Metabolism is generally defined as the sum of chemical reactions to produce energy and synthesise organic material to sustain a life36. These chemical reactions can be measured in at least two ways. The first approach is measuring the overall caloric consumption that spent on all metabolic activity and the other approach concentrates on measuring the specific substance to measure its metabolic activity.

29 Metabolism is a very complex process involving substances and interconnected process. Therefore, concentrating only on a few specific substances might lead to an underestimation of the overall effects of metabolism. To avoid this problem, many studies measured metabolism rate using basal metabolic rate (BMR). The definition of BMR is the measurement of energy per time unit that a person uses to keep the body functioning at rest37. These functions include breathing, blood circulation, body temperature control, cell growth and repair, brain and nerve electrical function, and essential contraction of muscles (mostly involuntary), can be 60-75% of daily energy expenditure38. The measurement of BMR requires a strict set of criteria. These criteria include that the person should be measured in a physically and psychologically undisturbed state, in a thermally neutral environment, and in the post- absorptive state (not actively digesting food or on medication)37. Due to these criteria, BMR usually measured in a specialised facility or using a Bodpod(TM)39,40. With the current advance on biomolecular detection technology, it is possible to detect a specific substance of interest. This approach might be not reflect the overall metabolic rates (like the BMR), but provides insight on specific metabolism function that is relevant to disease or drug testing41–43. The most used approach to define metabolism of a substance is by measuring it directly from blood/urine samples. A higher measured amount of a substance indicates a lower rate of metabolism for that specific substance. There are many ways to measure the substance of interest, mostly relying on metabolomic methodology that utilise spectrometry in combination with chromatography and nuclear magnetic resonance44. Twin- studies suggest that not all variance between individuals in substance metabolism is heritable, but glucose, amino acids, and lipid metabolism measured in twin studies were found to be heritable, with heritability estimate ranging from 20-80%45. Another approach to define substance metabolism is to measure change of substance in blood samples. This requires at least two blood sampling time points, to obtain a baseline measure and assess its change over time. This approach is less prone to bias, but logistically harder and some substances are limited by the degree of measurement accuracy, so the application is limited to major substances that are available in high concentration (>0.1 mg/mL)46. This approach was applied in a twin study which estimated the genetic contribution to variation between people in carbohydrate metabolism was around 40%47,48. Therefore, metabolism has an important degree of genetic control. Metabolomic studies have shown that the human metabolome is sensitive to various genetic and environmental stimuli. Therefore, metabolic measurement is prone to confounding

30 effects. Environmental factors including lifestyle, diet, and stress are usually recorded as potential confounders49. Another confounder is though variation of gut microbiota. Some studies shown that gut microbiota can modify the host metabolome through co-metabolism of compounds, such as dietary phenols and phenylalanine and co-metabolize xenobiotics50–52.

1.1.4.1.2 Gut microbiome mediated metabolism

The human gastrointestinal tract contains more than 100 trillion microorganisms that define the human gut microbiome52. These huge numbers of microorganisms encode more than 3 million genes and produce thousands of metabolites, whereas the human genome only consists of around 23,000 genes49,52. Human life depends on these communities of microorganisms, since many important life-sustaining metabolism activities are not supported by the human genome alone53. With this big role in human metabolism, the gut microbiome can affect human metabolism and health in various ways, including our brain and the development of neurodegenerative disease.

1.1.4.1.2.1 Gut-brain axis

The bidirectional communications between the brain (central nervous system) and the gut has long been recognised. The clinical observation of this connection is supported by several studies in animal models54–56. Moreover, recent gut microbiome studies also suggest various possible mechanisms of this gut-brain communication, creating the term Gut-Brain Axis (GBA). The GBA itself is a very broad concept about the gut and brain connections. First, a possible explanation of this concept comes from the anatomical similarities between the central nervous system (CNS) and the enteric nervous system (ENS). A second clue comes from the many studies on how human gut microbiome affects the CNS by various pathways. The human gut is a very complex organ. A broad spectrum of activity occurs controlling digestion, absorption, and filtering of toxic components57. To coordinate all of these complex activities, the enteric nervous system (ENS) provides autonomic control, which is almost independent of the central nervous system (CNS)57,58. In the human nervous system, ENS is the largest component of the autonomic nervous system, which almost resembles the CNS itself59.

31 This resemblance means that many of their signalling pathways and anatomical features are common58. Consequently, the disease mechanism in the CNS is quite likely to have similar manifestation on the ENS. This might explain why some brain related diseases like Parkinson’s disease (PD), Alzheimer’s disease (AD), Autism Spectrum Disorder (ASD), schizophrenia (SCZ), and amyotrophic lateral sclerosis (ALS) also have digestion or metabolic abnormalities58. Other than the anatomic similarity with CNS, the complexity of the human gut also comes from the presence of over 100 trillion microorganisms that live in human intestinal tracts, forming a complex microbiome community60. These organisms have critical contributions to human metabolism, the immune system, and affect CNS function61. For example, the gut microbiome may assist brain immune cell maturation and also communicate with the CNS via various pathways, adding the complexity of gut-brain axis61,62.

1.1.4.1.2.2 Gut microbiome role on development of brain immune cells

There are indications of microbiome can affect the CNS resident immune system. Mice model studies have observed some possible involvement of the gut microbiome during development of brain immune cells such as microglia and astrocytes61,63. Microglia are the most abundant resident immune cells in the brain63. They have a wide range of immune and other functions, such as phagocytosis, antigen presentation, cytokine production, inflammation response, clearing protein debris and removing infection agent. During the neurodevelopmental period, microglia also assist neuronal circuit wiring61,62. For example, Germ-Free mice had significantly more immature microglia compared to normal mice with a complex gut microbiome64, suggesting a gut microbiome contribution to microglial maturation. Further studies suggest this maturation process is not driven by a single microbe class, but by the greater diversity of the gut microbiome61. One proposed mechanism to explain this microbiome assisted maturation, involves the microbiome fermentation product, short chain fatty acids (SCFAs)64. Indeed, SCFAs are involved in many metabolic pathways that influence gastrointestinal physiology and blood-brain barrier integrity that might affect microglia in CNS. Morphologically, immature microglia have more segments and branches, which leads to increased process length and slower activation64. These characteristics impair the proper immune mechanism of microglia, including its ability to clear protein debris which support

32 accumulation of misfolded protein ALS pathology. It is possible that immature microglia might contribute in risk of ALS development by being slow, or failing, to remove accumulating protein debris in the brain. Astrocytes are the other brain immune cells that are affected by the gut microbiome. As the most abundant glial cell in the CNS, they have an important role in the regulation of blood-brain barrier, ion gradient balance, cerebral blood flow and nutrient transport65. Astrocytes also have immune related function by expressing pattern recognition receptors for microbe associated molecular patterns (MAMPs), and modulating neuroinflammation through cytokine production and antigen presentation via MHC II66. Astrocytes communicate with the gut microbiome through one of their MAMP receptors, the aryl-hydrocarbon receptor (AHR)67. Gut microbes metabolise tryptophan to produce a natural AHR ligand, including indole-3- aldehydeand indole-3-propionic acid68,69. These ligands have been found to activate astrocyte AHR and reduce inflammation in experimental autoimmune encephalomyelitis67. In ALS pathology, the involvement of astrocytes in disease mechanism has been proposed in the context of failing to maintain ion balance due to SOD1 mutation14. However, in sporadic ALS cases, it is also possible that astrocytes fail to be stabilised by AHR ligands which leads to excessive inflammation that cause autoimmune encephalomyelitis-like neuron death. Demyelination of neurons, which a hallmark of autoimmune encephalomyelitis, is also detected in ALS70.

1.1.4.1.2.3 Gut microbiome communication with CNS

Three ways have been proposed to explain microbiome interaction with CNS61. First, microorganisms in the gut are able to produce metabolites that are absorbed and transmitted to CNS via blood vessels71,72. Some of those metabolites are neurotransmitters that affect the CNS73,74. Second, gut microbiota can communicate with CNS by the vagus nerve in the gastro- intestinal wall61. The vagus nerve is a parasympathetic motor and sensory nerve bundle that sends signals from peripheral organs to the CNS that can be affected by microbiome activity75. Third, biochemical changes in CNS can also affect the intestinal physiology which also affect the microbiome composition by the hypothalamic pituitary adrenal (HPA) axis61. For example, physical and psychological stress can stimulate the HPA axis to change intestinal permeability, motility, and mucus production which can lead to the change of microbiome composition76,77. All these mechanisms allow the gut and brain to communicate and influence each other.

33 The gut-brain communication involving metabolites that are absorbed into blood vessels, may provide some possible explanation to ALS pathology. The gut microbiota have capabilities to assist metabolism by degrading complex molecules and metabolising chemicals (including drugs)50. While this capability can assist metabolism, it can also result in production of toxic molecules50,78. For example, Methanoarchea methylate bismuth to make tri-methyl- bismuth which has strong neurotoxic properties79. If this metabolite can enter blood vessels and reach the CNS, it will increase the risk of any neurodegenerative disorder, including ALS. Many other lines of evidence suggest that the gut microbiome can produce neurotoxins. For example, Clostridium can produce many neurotoxins like tetanus and botulinum toxin80. Cyanobacteria produce saxitoxin and anatoxin are reported to be toxic and involve in neurodegenerative disease81.

1.1.4.2 Physical activity

Physical activity is also proposed as an ALS risk factor, especially the frequency of rigorous exercise has been linked with higher risk of ALS. This link has been proposed since 1930, when a legendary baseball player Lou Gehrig, famous for his physical feats and nicknamed “The Iron Horse”, died from ALS82. More support for this association comes from reported increased risk of ALS in professional soccer and football players, and Gulf-war veterans21,83,84, although the latter has other potential environmental causes. The first large cohort study was of young Swedish men (aged 16–25, n = 1,819,817) who enlisted in the military between 1968- 2005 (Swedish Conscript Register) and showed increased risk of ALS in individuals of low- Body Mass Index (BMI) and higher physical capabilities. The results from this cohort were still not conclusive22,85. Another more comprehensive study involving five different populations in The Netherlands, Ireland, and Italy, and found that vigorous and moderate activity are linearly correlated with the risk of ALS. They also found that the strength of correlation between physical activity and ALS risk was different among the population cohorts. Previous studies from Japan and UK also found similar correlation using smaller sample size86,87. In-line with this, some professions that demand high physical workload tend to have a higher proportion of ALS patients than others. Professional sport athlete and military related jobs recorded to have higher risk of developing ALS in later ages88–90. A systematic study on physically demanding and sports with a high probability of head trauma showed a higher risk of subsequent ALS83,89. A similar observation was reported in military service related

34 occupations22,89,90. Another study from 2003-2016 of war veterans in US21,91, Sweden22, and Denmark92, found that risk of developing ALS was significantly higher than in control populations. Moreover, the front-line military personnel tend to have higher risk than military supporting role93. This raises a concern that chemicals used in battlefield and head trauma might be responsible for increasing risk of ALS92,93.

1.1.4.3 Chemical and heavy-metals exposures

The exposure of heavy metals, particularly lead, has been associated with increased risk of ALS94. This observation also replicated in the latest systematic study in Spain that spanned over 10 years (2007-2016). They found significant increased risk of developing ALS in area with high exposure of heavy metals25. In particular, the exposure of lead and zinc were observed to increased risk of ALS by ~20%25. Previous smaller studies also link the exposure to magnesium, selenium, aluminium, and cadmium with increasing risk of ALS24,89. Other than heavy metals exposure, organic solvent exposure95 and pesticide89,96 exposure were also link to ALS risk. A study in 2009-2012 of populations in the Gulf dessert and New Hampshire in the US, showed that geographical distribution of Beta-N_methylamino-l-alanine (BMAA), an neurotoxin amino acid produced by Cyanobacteria, were correlated with ALS incidence97. In 2013, a similar observation was reported in France98. Consistent with possible dangerous chemical exposure in military services, the association of both heavy metals and other chemical had also been previously reported. Smoking has been linked with ALS, but not consistently replicated20,99,100.

1.1.4.4 Education attainment related

Other risk factors that are often mentioned in ALS studies are related to cognition. Cognitive impairment affects approximately 30% of ALS patients. From these cognitively impaired ALS patients, about 10% of them progress to develop frontotemporal dementia (FTD)85,101. However, most of the studies reporting this association discussed the cognitive impairment as a consequence of ALS and FTD, rather than cognitive performance as a risk factor of ALS102– 104. Some recent studies reported that lower education attainment increased ALS-FTD risk in Italian and United States ALS cohort105,106. However, the larger cohort of Swedish Conscript

35 Register reported the opposite result, the higher IQ in young conscript was linked with higher ALS risk in their old age85. These studies suggest that education attainment related traits are associated to ALS risk, but the direction is not yet clear.

36 1.2 Genetics of ALS

Genetics is one of the dominant contributors of ALS disease aetiology. The first evidence of genetic contribution to ALS is the fact that about 10% of ALS cases are found within families of multiple affected individuals1–3. In 1993, the first genetic mutations found to be the cause of ALS in familial cases109 were mapped to the gene SOD1. With the advance of genome wide technology, more than 50 other genes have been found to be potential causal for ALS, although validating the causality of these genes remains challenging14,107. The remaining 90% of ALS patients are not aware of other family members affected by ALS, which make ALS seem sporadic14. This notion of ‘sporadic ALS’ is often mistakenly used to refer ALS without any genetic basis. Even though only 10% of ALS cases occur in families with many affected members, genetic variants identified in familial ALS are sometimes also detected in sporadic cases14. From numerous studies, mutations in the SOD1 gene that are frequently reported in familial cases appear in 3% of the sporadic cases, and another 10% more might be caused by the massive intronic hexanucleotide repeat expansion of C9orf727,14,107. Similarly, other ALS variants identified through segregation analysis of families with many affected members (i.e., in genes such as TARDBP, FUS, HNRNPA1, SQSTM1, VCP, OPTN, and PFN1) are also reported in those presenting as sporadic ALS cases, although they are rare14. GWAS has found more sporadic ALS associated genes, like GPX3/TNIP1, TBK1, UNC13A, and C21orf21110,111. The associated genetic variants might only increase the risk ALS; they are not necessarily sufficient for developing ALS14. However, several genetic studies on sporadic ALS are converging to support the oligogenic inheritance of sporadic ALS, since there are evidences of interaction and cumulative effects of mutation of sporadic ALS genes112,113.

1.2.1 Functional mechanisms of ALS known associated genes

While precise functional pathways to ALS remain elusive, genes associated with ALS susceptibility indicate three possible disease mechanisms: disturbed protein homeostasis, perturbation of RNA stability, disturbed cytoskeletal dynamics.

37 1.2.1.1 Disturbed protein homeostasis

The normal functioning cell maintains proteostasis (the balance of protein concentration inside of cell) by removal of unused or no-longer needed proteins. In ALS, an affected person may have a genetic variant that increases the risk of proteins misfolding14,107. These misfolded proteins are harder to remove and in some instances the protein removal function is compromised, which leads to accumulation of protein deposit that disturbs the proteostasis of the cell107. The first gene in this category is identified to be the cause of familial ALS, Superoxide dismutase 1 (SOD1) encodes protein SOD1 whose main function is to protect the cell against reactive oxygen radicals by catalysing dismutation (partitioning) reaction of the superoxide radical into its less harmful forms such as ordinary oxygen109. Some genetic variants that are likely causal for ALS, appear to make this protein more prone to misfolding. The misfolded SOD1 cannot be refolded by the cell-chaperone system that assists the protein folding, and since the misfolded SOD1 has changed its shape, the ubiquitin proteasome system and autophagy cannot tag it for removal. This leads to accumulation of misfolded SOD1 in cells. This accumulation is toxic and causes cell death14,107. The ALS disturbed protein homeostasis involves genes that tag unused proteins, such as Ubiquilin-2 (UBQLN2) and Sequastosome-1 (SQSTM1) that often found mutated in ALS cases. Normal UBQLN2 protein tags proteins that are no longer needed by the cell or need to be removed. This tagging is a signal for proteasome and autophagosome activation to degenerate the tagged protein (ubiquitin proteasome system and autophagy)114. Mutations in the UBQLN2 gene make the ubiquitin less likely to tag proteins for elimination, leading to a reduction in protein removal efficiency and accumulation of proteins inside the cell107. The other gene that often works together with UBQLN2 is SQSTM1, has a similar structural construct as SQSTM1, will give signal to Nuclear Factor kappa-B (NF-kB) to degrade the ubiquitinated protein (unused protein that tagged by Ubiquilin)114. The mutations in SQSTM1 found in both ALS and FTD cases, but the exact mechanism involving this pathway that increases FTD and ALS risk is not well understood. Some hypotheses suggest that rare SQSTM1 mutations found in some ALS cases cause loss/reduced function of Sequastosome-1, which leads to protein removal impairment and protein accumulation14,107. Mutations in genes that control the unused protein removal pathway like Valosin Containing Protein (VCP), Optineurin (OPTN), and Charged Multivesicular Body Protein 2B (CHMP2B), often result in loss of function and make the protein removal process less efficient

38 and increase the possibilities of accumulation of ubiquitinated protein in cell. The first gene involved in this mechanism, VCP, is a multifunctional AAA+-ATPase that controls a number of processes, including autophagy and proteasomal degeneration107. In the proteostasis pathway, VCP directs ubiquitin-tagged-protein (ubiquitinated protein) to proteasome and catalyses the maturation of autophagosomes114. Traditional positional cloning approaches demonstrated that VCP was involved in inclusion body myopathy (IBM) with Paget’s disease of the bone (PDB) and FTD (IBMPFD)115. A subsequent exome sequencing study in familial ALS cases found that VCP mutations are also involved in ALS107,116. The OPTN gene is involved in membrane and vesicle trafficking, NF-kB regulation, and autophagic clearance protein aggregates107,114. OPTN homozygous deletions, non-sense mutations, and heterozygous missense mutations have been reported in various ALS cases107,117. Lastly, the mutation of regulatory protein CHMP2B, which its main function is to sorting integral membrane protein in multivesicular bodies and autophagic degradation, causes the protein to lose some of its function and results in failure of protein breakdown sufficient to induce cell dysfunction. Mutations in this gene are associated with FTD, but may give rise to ALS107.

1.2.1.2 Perturbation of RNA stability

Some genetic variants associated with ALS are involved with various metabolism of mRNA in cell. Two genes associated with familial ALS encode proteins with structural similarities and are heavily involved in RNA metabolism are TAR DNA-binding Protein (TARDBP/TDP- 43) and Fused in Sarcoma (FUS). TARDBP gene translates to TDP-43 protein which is involved in almost all RNA processing mechanisms, such as transcription, splicing, transport, and translation14,107. In animal model studies, this protein is known for having two RNA recognition motifs (RRM) involved in ~600 mRNA processing and 956 splicing events118. In normal conditions TDP-43 is found predominantly in the nucleus114, but in a proportion of ALS patients, the TDP-43 concentration moves from nucleus to cytoplasm14,107. This misplaced TDP-43 protein in cells causes reduction of RNA processing in the nucleus and creates toxic aggregates in the cytoplasm. The toxic aggregate might enhance its toxicity by binding RNA or protein and make them lose their function119. The other similar protein, FUS, also has a crucial role in RNA metabolism14,107. This protein interacts with more than 5,500 genes through its binding motif. FUS depletion or loss of function affects the splicing of more than 950 mRNAs120. In ALS cases, mutations in this gene affect 4% of familial and 1% of sporadic cases.

39 Although ALS patients with FUS mutations have a low chance of cognitive impairment, the overall survival is significantly shorter compared with other ALS forms107. Mutations in genes that encode proteins known to interact with TDP-43 and contribute to its aberrant activity, like Matrin-3 (MATR3) and Heterogeneous nuclear ribonucleoprotein complex (hnRNP), are also proposed to be potential causes of ALS. MATR-3 is a highly conserved protein and often associated with the nuclear matrix. Its function is to bind and stabilize several mRNA processing proteins, including TDP-43107. The exact involvement of this protein still needs further investigation. But, its involvement with TDP-43, might results in TDP-43 abnormal activity in ALS patients. Mutations in the prion-like domain of hnRNP protein variants have been reported in ALS and IBMPFD (Inclusion body myopathy – Paget disease – Frontotemporal dementia) patients107. These mutations enhance hnRNP protein to form stress granules by increasing their tendency to form self-seeding fibrils. Its abnormal activity while interacting with TDP-43 (prion-like) can alter RNA metabolism, which is reported cause to neuron cell death in neurodegenerative disease14. Although the exact mechanism involving this protein is still not clear, it gives the rise of a hypothesis of prion-link mechanism in ALS development. Moreover, SOD1, TDP-43, and FUS also have similar prion- like domain as hnRNP14,107. A mutation in C9orf72, which is a hexanucleotide repeat (GGGGCC), is the most common ALS associated genetic variant. In its normal state, the repeat number is below 30, but in ALS patients the repeat number could be hundreds to thousands14,107,108. This variant has a high penetrance (40% familial and 10% sporadic cases has this mutated variant), and it also strongly correlated with ALS bulbar onset and cognitive decline28,107. The excessive C9orf72 toxicity involves the multiple RRM-containing proteins like hnRNP, FUS, SFPQ, ILF, NONO, PurA. It causes dysregulation of splicing of various mRNA by acting like a sink for those proteins.

1.2.1.3 Cytoskeletal defects

The previous sections explained plausible mechanism for how some genetic variants can cause cell death in ALS patients, but still cannot explain why only motor neuron cells are damaged in ALS. One of the most plausible explanations is because motor neuron cells are the most asymmetrical of all cells in human body. Motor neuron axons can extend more than one metre, which make them highly dependent on axonal transport for delivering organelles and vesicles

40 between soma and synapses in order to survive and function properly. Other ALS associated genes might be involved in cytoskeletal defects. Some mRNA metabolism involves neurofilamen. Mutant proteins TDP-43 and FUS bind to neurofilamen light (NEFL) and decrease the expression of light neurofilamen, which lead to slow cell reparation after cell damage. Mutations in neurofilamen heavy (NFH) create overexpression of intermediate filament peripherin that often create additional inclusion bodies that hard to remove by autophagosome and potentially create aggregation that toxic for cell107. Mutations in genes that regulate cytoskeleton dynamics like Profilin (PFM1), Ephirin receptor (EPHA4), Tubulin Alpha 4A (TUBA4A), and Dynactin (DCTN1), might explain how the asymmetrical structure of neurons could contribute to ALS aetiology. Firstly, PFN1 is the main regulatory protein that maintains actin dynamics in cell. Some proportion of ALS patient have mutations in this gene that cause reduced actin binding which cause cytoskeletal disturbance14,107. Moreover, cytoskeletal mechanism is important for formation and disassemble of RNA containing stress granule, and its lost or reduced functionalities could contribute to RNA aggregation in cell107. Mutations in TUBA4A disturb microtubule dynamics and stability of cytoskeleton107. EPHA4 receptor is a part of ephrin axonal repellent system which induces cytoskeleton modelling through RhoA-GTPase107. In animal model studies, EPHA4 expression has been found to inversely correlate with ALS onset and survival121. Some rare forms of ALS showed involvement of ephirin signalling abnormal activity in vesicle- associated membrane protein B (VAPB) and C (VAPC), which potentially disturb vesicle transport122. Lastly, the function of DCTN1 has an important role to stabilise the binding of cargo protein to motor protein dynein, which important to substrate transport in cell, especially for long and asymmetric cell like neuron). DCTN1 mutations reported in ALS may contribute to axonal transport deficit that is often seen in ALS107.

1.2.2 Cell-biology of ALS known genes

As discussed above, a possible mechanism involved in ALS aetiology is the alteration of proteostasis and protein quality controls. The main gene that is involved in this mechanism is the first gene that was identified in familial cases, SOD1. This gene encodes Cu-Zn superoxide dismutase, which is the important antioxidant to catalyse the conversion of highly toxic hydrogen peroxide or oxygen from mitochondrial process waste14,108. The missense mutation in this gene expresses misfolded version of SOD1 protein and makes its toxic conversion

41 (dismutase) less efficient14. However, most motor neuron cell death in ALS is not caused by the reduced dismutase activity, but more likely by the accumulation of this misfolded protein in the cell14,107. From current knowledge of other neurodegenerative diseases (Parkinson’s and Alzheimer’s diseases), there is a possibility that large aggregates of misfolded SOD1 are not sufficient to cause ALS, and the elimination of these cumulated misfolded proteins do not improve the conditions of mice expressing SOD1 mutation14,123. In order to develop ALS, there are likely to be other biological mechanisms involved. These biological systems include: neuron supporting cell system, endoplasmic reticulum (ER) stress, and neuron cell shape related mechanism.

1.2.2.1 Neuron supporting cell and immune system (glial cell and microglia)

One of the most plausible hypotheses of neuron supporting cell involved in ALS aetiology is for the involvement of glial cells. The glial cell is a support system for neuron activities. It plays important roles to supply nutrients, insulate one neuron from another, destroy pathogens, and remove dead neurons124. Glial cells consist of microglia, oligodendrocytes, astrocytes, and ependymal cells. Each has its own specialised functions and might contribute to ALS aetiology14. In all ALS cases, the innate immunity of glial cells called microglia is activated. There are several mechanisms related to microglia interactions. The first possible mechanism involves the production of mutant SOD1 in microglial cell. This production of misfolded SOD1 makes the neuron toxicity worse and eventually kills the neuron cells14,125. The second possible mechanism involves the stimulation of excessive toxic superoxide from microglia. The accumulated misfolded SOD1 can associate with the small GTP-ase RAC1 which controls the activation of NADPH-oxidase, a complex that produces the toxic superoxide126. In this case, the normal SOD1 that removes intracellular superoxide, drives microglia to produce a high level of extracellular superoxide instead14. The third mechanism involves the C9orf72 mutation. C9orf72 encodes a potential guanine exchange factor for one or more unidentified G proteins. The massive expansion of the C9orf72 hexanucleotide repeats make the G proteins inactivated, because they lack guanine exchange factors which potentially causes abnormal microglia and age-related neuroinflammation14,127.

42 Mutant SOD1 might also be associated with dysfunction in oligodendrocytes of glial cells leading to ALS. In the normal condition, oligodendrocytes create a myelin (fatty substance around the axon as electric insulating layers) for upper motor neuron and the initial axonal segment of lower motor neuron9. Oligodendrocytes also support motor neuron function by supplying the energy metabolite lactate, through the action of monocarboxylate transporter- 1 (MCT1)14. In a mouse model, mutant SOD1 was found to impair the expression of MCT1128, which potentially inhibits energy supply and starves the motor neuron cell to death. Another type of glial cell, the astrocyte, provides motor neurons with essential nutrients, ion buffering, and recycling the neurotransmitter glutamate9. The exact mechanism of the involvement of this cell with SOD1 is still unclear. There are several proposed mechanisms that might explain the contribution of SOD1 and astrocytes in ALS. The first proposed mechanism is the involvement of astrocytes and mutant SOD1 in causing excitotoxicity (excessive firing motor neuron that eventually cause accumulation of by-product that toxic to cell)129. Astrocytes can limit the motor neuron firing by swift recovery of glutamate, which is mediated by excitatory amino acid transporter-2 (EAAT2). Mutant SOD1 seems to cause the loss of EAAT2 in several studies on rodent models130,131. This loss of EAAT2 cause astrocytes failure to quickly clear the glutamate, which make motor neurons to continue to fire action potentials and causes an excessive calcium influx as a by-product of the firing potentials. This excessive accumulation of calcium influx causes mitochondrial and endoplasmic reticulum (ER) stress14.

1.2.2.2 Endoplasmic reticulum stress

The Endoplasmic reticulum (ER) is an organelle that responsible for protein folding, post- translational modification, trafficking secretary protein, synthesis of lipid, and regulation of intracellular calcium levels132. Due to this many important cellular function the ER stress mechanism can happen in several ways. First is the excessive calcium influx described above. As the mutant SOD1 caused EAAT2 loss of function and in-turn causes an excessive motor neuron fire actions and excessive calcium influx, ER is continuously burdened to remove this calcium excess that lead to ER stress14. The second way is by jamming the ER’s protein degradation mechanism. Misfolded SOD1 can bind to the cytoplasmic surface of ER integral membrane protein Derlin-1, which is important to catalyse misfolded protein degradation and extraction14,133. This binding causes an ER stress, since ER cannot remove the misfolded protein

43 and its accumulations. In this state, ER will try to return the physiological balance by activating the unfolded protein response (UPR) signalling pathway134. If the UPR fails to restore the cell physiological balance, this condition triggers cell death signalling pathway and make the cell undergoes apoptosis (biochemical events lead to cell-death)135,136. Similar mechanisms can be observed in other ALS associated proteins that are involved in protein clearance system in ER, since several potential ALS-causing mutations occur in genes that are involved directly on protein degradation. Two main examples for these cases are ALS associated protein Ubiquilin-2 and Sequestosome-1 which serve as adapter that attach to polyubiquitinated protein to proteasome or autophagosome for enter the protein degradation137. The mutation on these two protein lead to less efficient protein degradation system that causes protein accumulation in ER14,137. Similarly, other mutated proteins that are involved in this protein degradation pathways such as Optineurin, Valosin containing protein, and proteins encoded by genes like CHMP2B and VAPB, cause protein accumulation by inefficient vesicle- mediated protein transport138. Inevitably, all of these protein accumulation mechanism trigger cell death signalling pathway that lead to apoptosis135.

1.2.2.3 Motor neuron shape and active transport system

In the previous discussion about mechanism, the pathways are relevant to almost all cells, but what makes motor neuron cells special? Why do motor neuron cells break down first compared to other nerve cells in this mechanism? One of the most plausible explanations comes from the structure of motor neuron cells that is the most asymmetric cell in nature which cause heavy reliance to cell active transport system14,107. This transport system is prone to be disturbed by accumulation of mutant SOD1 inside cell. Although, accumulation of misfolded protein is generally fatal for any neuron cells, but it is especially fatal for motor neuron. The accumulated misfolded protein also inhibit the properly folded protein to bind to cell active transport system, resulting a less effective transport to the further tip of asymmetric neuron cell. In case of ALS, the accumulation of mutant SOD1 had shown to slow down the transport and starve the long axon, taking a couple of months for the neurodegeneration to occur139. The spatial distribution of mRNA depends on the microtubule-dependent transport granules and other factors. The cell active transport regulation itself is regulated by several RNA-binding proteins associated with ALS, including TDP-43, FUS, and heterogeneous nuclear ribonucleoprotein (hnRNP)14 which make the transport system very prone. Moreover, genetic studies (discussed above) provide

44 support that some genetic variants could disturb the dynamic of the cytoskeleton, which makes the asymmetric motor neuron more vulnerable.

1.2.3 ALS genetics current limitations

The genetic studies to date have made a major contribution to the current understanding of ALS aetiology, however, there is still no effective treatment for ALS. Despite the available knowledge, the exact aetiology of ALS remains elusive which is reflected in at least two core problems. The first problem is reflected in ALS “missing heritability”. Although GWAS seems to be a good approach to disentangle ALS complex trait nature (especially the sporadic ALS), there is a huge gap in the heritability estimates obtained from twin studies and GWAS. An ALS twin study estimated the heritability of ALS of 171 (49 Monozygotic, 122 Dizygotic) pairs was 0.61 (95%CI = 0.38-0.78)140 and the latest ALS family-based study of Danish population (5.808 probands and 580,800 matched controls) the estimated ALS heritability was 0.43 (95%CI=0.34-0.53)141, while the published ALS SNP-heritability (from GWAS) estimate was only around 0.08 (95%CI=0.07-0.09)111. The ALS SNP-based heritability estimated from newer published GWAS study110 with larger sample size was significantly smaller (0.018 with 95%CI =0.014-0.024). These observations suggest that ALS has a huge portion of missing heritability that cannot be explained by common SNPs available in most GWAS chip. Some part of this missing heritability is likely attributed to rare variants142. Second problem emerges from heterogeneous nature of ALS. This heterogeneity cause the phenotyping of ALS to be less precise which could create larger variations (and standard errors) in ALS GWAS results than expected under a homogeneous disease phenotype. As a consequences, larger sample sizes are needed to achieve enough statistical power to accommodate these larger variations. The increased statistical power that come from increased sample size is very important to detect the small genetic effects that have not reached statistical significance yet in the current sample sizes. Acknowledging the involvement of rare-variants and increasing sample size needed to understand ALS, a large consortium in Europe established “Project MINE” that aims to do whole genome sequencing of a large ALS European ancestry case-control cohort143 (aiming for 15K cases, 7.5K controls). However, this effort might take a long time to complete. Therefore,

45 other feasible strategies are needed to increase our understanding of ALS aetiology and translate it to clinical application.

46 1.3 Overview of Thesis Research Direction

Even though the latest ALS GWAS only detected a small number of associated SNPs and the estimate of SNP-based heritability110,111 is low, it has still provided valuable information to the aetiology of ALS6,14,144,145. There is a lot of potential information that can be gained from integration of ALS GWAS result with other data resources146–150. Currently, ALS GWAS interpretation is limited to basic gene association and its functional annotations110,111. Considering the wealth of human genome annotation and the multi-omics references that has recently become available146, it is possible to expand the ALS GWAS findings to more detailed and meaningful interpretations.

1.3.1 ALS genetic correlations with its risk factors

One of the most interesting insights from many complex trait disease GWAS results was that many genes seem to have pleiotropic effects on many other traits151,152. With the LD-score regression (LDSR) methods, I am able to investigate these possible relationships153. Since ALS known risk is partly heritable it is possible to confirm if that genetic risk is shared with other traits. Moreover, it is also possible for LDSR to give information on the direction of the correlations. This will add the understanding of ALS risk factors, since ALS might be linked with many risk factors, but the direction of the correlations often conflicting from different observational studies. Pleiotropic associations can be leveraged in risk in profile risk score methods154,155. The low SNP-based heritability from current ALS GWAS studies, might also be caused by lack of sample size to detect a more subtle risk loci. Overcoming this problem requires collection a large sample size, which for a rare disease like ALS might be not practical in short term. Therefore, using other genetic studies that correlate with ALS that mostly has bigger sample size, easier to collect, and publicly available is a more practical and feasible strategy to leverage ALS genetic predictors. The improved prediction capabilities of combined predictor of ALS and correlated traits also provide a better evidence of genetic correlations. This hypothesis is addressed in Chapter 2.

47 1.3.2 Better understanding of ALS aetiology by integrating multi-omics data

The main information provided by GWAS is the association effect of an at each DNA variant in relation to the phenotype (disease)156. With the availability of the current human genome reference map, it is possible to infer the genes that may be involved in the development of traits based on genes located in the significant loci. However, the biological mechanism in humans has proven to be more complex157,158. Genes that are located in significant GWAS loci do not necessary means they are the causal genes to the disease150,159–161. Moreover most of the GWAS significant loci are not within the protein expressed regions of the genome150,161. Combined with current knowledge in cell biology, these insights lead to the integration of regulatory-region annotations (histone-modification, promoters, enhancers, and methylation site) with GWAS results 146,150. Moreover, availability of results from the various studies on expression Quantitative Trait Loci (eQTL) (i.e. SNP-gene expression associations), it is possible to narrow down GWAS results to a more precise gene set that might be causal to the disease147,150. From basic knowledge of cell biology and backed by current expression level data from various human and animal tissues, we know that only some part of genome is expressed different tissues types. Therefore, GWAS significant loci could have more impact on the expression of genes in some specific tissues. This insight led to the expansion of LDSR methodology, by partitioning the heritability based only the certain region that expressed by a specific tissues (Cell-type analysis)149, and testing for an enrichment of SNP-based heritability compared to the expectation for the region or annotation based of SNP count. Although, ALS affected tissue is already clear (motor neuron cell)6,7, we cannot neglect the possibilities that ALS aetiology could be more complex and involve several other biological systems14. For example, aberrant activity of the immune system is suspected to of contribute to motor neuron death61. ALS risk factors also include physical activity85 and hypermetabolism4, which suggests some involvement of metabolic related biological system (as discussed above). Cell-type LDSR analysis is reported in Chapter 3. By combining eQTL data and GWAS data, it is possible to infer the causality of genes using a Mendelian Randomisation (MR) framework147. MR imitates a randomised controlled trials by using genetic variants as an “instrument” to support causal inferences about the effects of modifiable risk factors (exposure), instead of randomly assign individuals in randomised controlled trials160,162. Currently, this framework is implemented in SMR (Summary-based

48 Mendelian Randomisation) software that only requires summary statistics of GWAS and eQTL of relevant tissues147. Here, the SNP is the instrument, and the SNP-eQTL and SNP-disease associations are used to infer causality between the SNP and the disease. The results of this analysis will provide a more precise suggestion for the gene that causal in ALS in specific tissues, which leads to better understanding of ALS aetiology. The method is applied in Chapter 3.

1.3.3 Explore the possibilities of the involvement of Microbiome data

Over the last several years, many studies have linked gut microbiome dysbiosis with many metabolic and neurodegenerative disorders in human60,163,164. The accumulation of gut microbiome studies in Parkinson’s Disease and Alzheimer’s disease provide some convergent evidence to support for the role of gut microbiome dysbiosis in the development of disease165– 167. Some of the smaller cohort studies of gut microbiome in ALS have reported conflicting results168–170 (likely due to small sample size). With the availability of the latest and biggest cohort of ALS gut microbiome study (100 case-control faecal samples), it is possible to investigate the involvement of gut microbiota in the pathology of ALS. This work is reported in Chapter 4.

49 1.4 References

1. Rowland, L. P. How Amyotrophic Lateral Sclerosis Got Its Name: The Clinical- Pathologic Genius of Jean-Martin Charcot. Arch Neurol 58, 512–515 (2001). 2. Cleveland, D. W. & Rothstein, J. D. From charcot to lou gehrig: deciphering selective motor neuron death in als. Nat Rev Neurosci 2, 806–819 (2001). 3. ICD-11 - Mortality and Morbidity Statistics. https://icd.who.int/browse11/l- m/en#/http://id.who.int/icd/entity/661720689. 4. Ngo, S. T. & Steyn, F. J. The interplay between metabolic homeostasis and neurodegeneration: insights into the neurometabolic nature of amyotrophic lateral sclerosis. Cell Regeneration 4, 5 (2015). 5. Rowland, L. P. & Shneider, N. A. Amyotrophic Lateral Sclerosis. New England Journal of Medicine 344, 1688–1700 (2001). 6. Brown, R. H. & Al-Chalabi, A. Amyotrophic Lateral Sclerosis. http://dx.doi.org/10.1056/NEJMra1603471 http://www.nejm.org/doi/10.1056/NEJMra1603471 (2017) doi:10.1056/NEJMra1603471. 7. van Es, M. A. et al. Amyotrophic lateral sclerosis. Lancet (2017) doi:10.1016/S0140- 6736(17)31287-4. 8. Shellikeri, S. et al. The Neuropathological Signature of Bulbar-onset ALS: A Systematic Review. Neurosci Biobehav Rev 75, 378–392 (2017). 9. Purves, D. et al. Neuroscience. (Sinauer Associates is an imprint of Oxford University Press, 2011). 10. Longinetti, E. & Fang, F. Epidemiology of amyotrophic lateral sclerosis: an update of recent literature. Curr Opin Neurol 32, 771–776 (2019). 11. Mehta, P. Prevalence of Amyotrophic Lateral Sclerosis — United States, 2015. MMWR Morb Mortal Wkly Rep 67, (2018). 12. Chiò, A. et al. Global epidemiology of amyotrophic lateral sclerosis: a systematic review of the published literature. Neuroepidemiology 41, 118–130 (2013). 13. Alonso, A., Logroscino, G., Jick, S. S. & Hernán, M. A. Incidence and lifetime risk of motor neuron disease in the United Kingdom: a population-based study. Eur J Neurol 16, 745–751 (2009).

50 14. Taylor, J. P., Brown Jr, R. H. & Cleveland, D. W. Decoding ALS: from genes to mechanism. Nature 539, 197–206 (2016). 15. Mehta, P. R. et al. Younger age of onset in familial amyotrophic lateral sclerosis is a result of pathogenic gene variants, rather than ascertainment bias. J Neurol Neurosurg Psychiatry 90, 268–271 (2019). 16. Al-Chalabi, A. Perspective: Don’t keep it in the family. Nature 550, S112–S112 (2017). 17. Al-Chalabi, A. et al. Amyotrophic lateral sclerosis: moving towards a new classification system. Lancet Neurol 15, 1182–1194 (2016). 18. Al-Chalabi, A. & Hardiman, O. The epidemiology of ALS: a conspiracy of genes, environment and time. Nat Rev Neurol 9, 617–628 (2013). 19. Wang, M.-D., Little, J., Gomes, J., Cashman, N. R. & Krewski, D. Identification of risk factors associated with onset and progression of amyotrophic lateral sclerosis using systematic review and meta-analysis. Neurotoxicology 61, 101–130 (2017). 20. Calvo, A. et al. Influence of cigarette smoking on ALS outcome: a population-based study. J. Neurol. Neurosurg. Psychiatr. 87, 1229–1233 (2016). 21. Weisskopf, M. G. et al. Prospective study of military service and mortality from ALS. Neurology 64, 32–37 (2005). 22. Åberg, M. et al. Risk factors in Swedish young men for amyotrophic lateral sclerosis in adulthood. J. Neurol. 265, 460–470 (2018). 23. Belbasis, L., Bellou, V. & Evangelou, E. Environmental Risk Factors and Amyotrophic Lateral Sclerosis: An Umbrella Review and Critical Assessment of Current Evidence from Systematic Reviews and Meta-Analyses of Observational Studies. NED 46, 96– 105 (2016). 24. Sutedja, N. A. et al. Exposure to chemicals and metals and risk of amyotrophic lateral sclerosis: a systematic review. Amyotroph Lateral Scler 10, 302–309 (2009). 25. Sánchez-Díaz, G. et al. Geographic Analysis of Motor Neuron Disease Mortality and Heavy Metals Released to Rivers in Spain. Int J Environ Res Public Health 15, (2018). 26. Dupuis, L., Pradat, P.-F., Ludolph, A. C. & Loeffler, J.-P. Energy metabolism in amyotrophic lateral sclerosis. Lancet Neurol 10, 75–82 (2011). 27. Van den Bergh, R., Swerts, L., Hendrikx, A., Boni, L. & Meulepas, E. Adipose tissue cellularity in patients with amyotrophic lateral sclerosis. Clin Neurol Neurosurg 80, 226–239 (1977).

51 28. Ahmed, R. M. et al. Amyotrophic lateral sclerosis and frontotemporal dementia: distinct and overlapping changes in eating behaviour and metabolism. Lancet Neurol 15, 332– 342 (2016). 29. Lim, M. A. et al. Reduced Activity of AMP-Activated Protein Kinase Protects against Genetic Models of Motor Neuron Disease. J Neurosci 32, 1123–1141 (2012). 30. Desport, J. C. et al. Factors correlated with hypermetabolism in patients with amyotrophic lateral sclerosis. Am. J. Clin. Nutr. 74, 328–334 (2001). 31. Funalot, B., Desport, J.-C., Sturtz, F., Camu, W. & Couratier, P. High metabolic level in patients with familial amyotrophic lateral sclerosis. Amyotroph Lateral Scler 10, 113– 117 (2009). 32. Forum, I. of M. (US) F. Influence of the Microbiome on the Metabolism of Diet and Dietary Components. (National Academies Press (US), 2013). 33. Chassaing, B. & Gewirtz, A. T. Chapter 35 - Gut Microbiome and Metabolism. in Physiology of the Gastrointestinal Tract (Sixth Edition) (ed. Said, H. M.) 775–793 (Academic Press, 2018). doi:10.1016/B978-0-12-809954-4.00035-9. 34. Visconti, A. et al. Interplay between the human gut microbiome and host metabolism. Nat Commun 10, 1–10 (2019). 35. den Besten, G. et al. The role of short-chain fatty acids in the interplay between diet, gut microbiota, and host energy metabolism. J Lipid Res 54, 2325–2340 (2013). 36. metabolism | Definition, Process, & Biology. Encyclopedia Britannica https://www.britannica.com/science/metabolism. 37. McNab, B. K. On the utility of uniformity in the definition of basal rate of metabolism. Physiol. Zool. 70, 718–720 (1997). 38. Manini, T. M. Energy Expenditure and Aging. Ageing Res Rev 9, 1 (2010). 39. Albersen, M. et al. Whole body composition analysis by the BodPod air-displacement plethysmography method in children with phenylketonuria shows a higher body fat percentage. J Inherit Metab Dis 33, 283–288 (2010). 40. Collins, M. A. et al. Evaluation of the BOD POD for assessing body fat in collegiate football players. Med Sci Sports Exerc 31, 1350–1356 (1999). 41. Rinschen, M. M., Ivanisevic, J., Giera, M. & Siuzdak, G. Identification of bioactive metabolites using activity metabolomics. Nat Rev Mol Cell Biol 20, 353–367 (2019). 42. Clish, C. B. Metabolomics: an emerging but powerful tool for precision medicine. Cold Spring Harb Mol Case Stud 1, (2015).

52 43. Alonso, A., Marsal, S. & Julià, A. Analytical Methods in Untargeted Metabolomics: State of the Art in 2015. Front. Bioeng. Biotechnol. 3, (2015). 44. Gowda, G. A. N. & Djukovic, D. Overview of Mass Spectrometry-Based Metabolomics: Opportunities and Challenges. Methods Mol Biol 1198, 3–12 (2014). 45. Alul, F. Y. et al. The heritability of metabolic profiles in newborn twins. 110, 253–258 (2013). 46. Caesar, L. K., Kellogg, J. J., Kvalheim, O. M. & Cech, N. B. Opportunities and Limitations for Untargeted Mass Spectrometry Metabolomics to Identify Biologically Active Constituents in Complex Natural Product Mixtures. J. Nat. Prod. 82, 469–484 (2019). 47. Wang, G. et al. Tracking Blood Glucose and Predicting Prediabetes in Chinese Children and Adolescents: A Prospective Twin Study. PLOS ONE 6, e28573 (2011). 48. Schousboe, K. et al. Twin study of genetic and environmental influences on glucose tolerance and indices of insulin sensitivity and secretion. Diabetologia 46, 1276–1283 (2003). 49. JOHNSON, C. H. & GONZALEZ, F. J. Challenges and Opportunities of Metabolomics. J Cell Physiol 227, 2975–2981 (2012). 50. Wilson, I. D. & Nicholson, J. K. Gut microbiome interactions with drug metabolism, efficacy, and toxicity. Translational Research 179, 204–222 (2017). 51. Wilson, I. D. & Nicholson, J. K. The role of gut microbiota in drug response. Curr. Pharm. Des. 15, 1519–1523 (2009). 52. Wallace, B. D. et al. Alleviating Cancer Drug Toxicity by Inhibiting a Bacterial Enzyme. Science 330, 831–835 (2010). 53. Goodrich, J. K., Davenport, E. R., Clark, A. G. & Ley, R. E. The Relationship Between the Human Genome and Microbiome Comes into View. Annual Review of Genetics 51, 413–433 (2017). 54. Thion, M. S. et al. Microbiome Influences Prenatal and Adult Microglia in a Sex- Specific Manner. Cell 172, 500-516.e16 (2018). 55. Luczynski, P. et al. Growing up in a Bubble: Using Germ-Free to Assess the Influence of the Gut Microbiota on Brain and Behavior. Int. J. Neuropsychopharmacol. 19, (2016). 56. Nithianantharajah, J., Balasuriya, G. K., Franks, A. E. & Hill-Yardin, E. L. Using Animal Models to Study the Role of the Gut–Brain Axis in Autism. Curr Dev Disord Rep 4, 28–36 (2017).

53 57. Avetisyan, M., Schill, E. M. & Heuckeroth, R. O. Building a second brain in the bowel. J Clin Invest 125, 899–907 (2015). 58. Rao, M. & Gershon, M. D. The bowel and beyond: the enteric nervous system in neurological disorders. Nat Rev Gastroenterol Hepatol 13, 517–528 (2016). 59. Jänig, W. Autonomic Nervous System. in Human Physiology (eds. Schmidt, R. F. & Thews, G.) 333–370 (Springer Berlin Heidelberg, 1989). doi:10.1007/978-3-642-73831- 9_16. 60. Tremlett, H., Bauer, K. C., Appel-Cresswell, S., Finlay, B. B. & Waubant, E. The gut microbiome in human neurological disease: A review. Ann. Neurol. 81, 369–382 (2017). 61. Fung, T. C., Olson, C. A. & Hsiao, E. Y. Interactions between the microbiota, immune and nervous systems in health and disease. Nature Neuroscience 20, 145–155 (2017). 62. Powell, N., Walker, M. M. & Talley, N. J. The mucosal immune system: master regulator of bidirectional gut–brain communications. Nature Reviews Gastroenterology and Hepatology 14, 143 (2017). 63. Cryan, J. F. & Dinan, T. G. Gut microbiota: Microbiota and neuroimmune signalling— Metchnikoff to microglia. Nature Reviews Gastroenterology & Hepatology 12, 494–496 (2015). 64. Erny, D. et al. Host microbiota constantly control maturation and function of microglia in the CNS. Nat. Neurosci. 18, 965–977 (2015). 65. Khakh, B. S. & Sofroniew, M. V. Diversity of astrocyte functions and in neural circuits. Nat Neurosci 18, 942–952 (2015). 66. Dong, Y. & Benveniste, E. N. Immune function of astrocytes. Glia 36, 180–190 (2001). 67. Rothhammer, V. et al. Type I interferons and microbial metabolites of tryptophan modulate astrocyte activity and central nervous system inflammation via the aryl hydrocarbon receptor. Nat. Med. 22, 586–597 (2016). 68. Zelante, T. et al. Tryptophan Catabolites from Microbiota Engage Aryl Hydrocarbon Receptor and Balance Mucosal Reactivity via Interleukin-22. Immunity 39, 372–385 (2013). 69. Roager, H. M. & Licht, T. R. Microbial tryptophan catabolites in health and disease. Nature Communications 9, 3294 (2018). 70. Oliveira Santos, M., Caldeira, I., Gromicho, M., Pronto-Laborinho, A. & de Carvalho, M. Brain white matter demyelinating lesions and amyotrophic lateral sclerosis in a

54 patient with C9orf72 hexanucleotide repeat expansion. Multiple Sclerosis and Related Disorders 17, 1–4 (2017). 71. Lyte, M. Microbial Endocrinology in the Microbiome-Gut-Brain Axis: How Bacterial Production and Utilization of Neurochemicals Influence Behavior. PLOS Pathogens 9, e1003726 (2013). 72. Sandrini, S., Aldriwesh, M., Alruways, M. & Freestone, P. Microbial endocrinology: host–bacteria communication within the gut microbiome. Journal of Endocrinology 225, R21–R34 (2015). 73. Gershon, M. D. & Tack, J. The serotonin signaling system: from basic understanding to drug development for functional GI disorders. Gastroenterology 132, 397–414 (2007). 74. Neuman, H., Debelius, J. W., Knight, R. & Koren, O. Microbial endocrinology: the interplay between the microbiota and the endocrine system. FEMS Microbiol Rev 39, 509–521 (2015). 75. Bonaz, B., Bazin, T. & Pellissier, S. The Vagus Nerve at the Interface of the Microbiota-Gut-Brain Axis. Front Neurosci 12, (2018). 76. Park, A. J. et al. Altered colonic function and microbiota profile in a mouse model of chronic depression. Neurogastroenterol. Motil. 25, 733-e575 (2013). 77. Fourie, N. H. et al. Structural and functional alterations in the colonic microbiome of the rat in a model of stress induced irritable bowel syndrome. Gut Microbes 8, 33–45 (2017). 78. Wilkinson, E. M., Ilhan, Z. E. & Herbst-Kralovetz, M. M. Microbiota–drug interactions: Impact on metabolism and efficacy of therapeutics. Maturitas 112, 53–63 (2018). 79. Huber, B. et al. Production of Toxic Volatile Trimethylbismuth by the Intestinal Microbiota of Mice. Journal of Toxicology https://www.hindawi.com/journals/jt/2011/491039/ (2011) doi:10.1155/2011/491039. 80. Carter, A. T. & Peck, M. W. Genomes, neurotoxins and biology of Clostridium botulinum Group I and Group II. Res Microbiol 166, 303–317 (2015). 81. Dittmann, E., Fewer, D. P. & Neilan, B. A. Cyanobacterial toxins: biosynthetic routes and evolutionary roots. FEMS Microbiology Reviews 37, 23–43 (2013). 82. Huisman, M. H. B. et al. Lifetime physical activity and the risk of amyotrophic lateral sclerosis. J. Neurol. Neurosurg. Psychiatry 84, 976–981 (2013). 83. Chio, A. et al. ALS in Italian professional soccer players: the risk is still present and could be soccer-specific. Amyotroph Lateral Scler 10, 205–209 (2009).

55 84. Lehman, E. J., Hein, M. J., Baron, S. L. & Gersic, C. M. Neurodegenerative causes of death among retired National Football League players. Neurology 79, 1970–1974 (2012). 85. Longinetti, E. et al. Physical and cognitive fitness in young adulthood and risk of amyotrophic lateral sclerosis at an early age. European Journal of Neurology 24, 137– 142 (2017). 86. Okamoto, K. et al. Lifestyle Factors and Risk of Amyotrophic Lateral Sclerosis: A Case-Control Study in Japan. Annals of Epidemiology 19, 359–364 (2009). 87. Harwood, C. A. et al. Long-term physical activity: an exogenous risk factor for sporadic amyotrophic lateral sclerosis? Amyotroph Lateral Scler Frontotemporal Degener 17, 377–384 (2016). 88. Yu, B. & Pamphlett, R. Environmental insults: critical triggers for amyotrophic lateral sclerosis. Translational Neurodegeneration 6, 15 (2017). 89. Bozzoni, V. et al. Amyotrophic lateral sclerosis and environmental factors. Funct Neurol 31, 7–19 (2016). 90. Oskarsson, B., Horton, D. K. & Mitsumoto, H. Potential Environmental Factors in Amyotrophic Lateral Sclerosis. Neurol Clin 33, 877–888 (2015). 91. Beard, J. D. & Kamel, F. Military service, deployments, and exposures in relation to amyotrophic lateral sclerosis etiology and survival. Epidemiol Rev 37, 55–70 (2015). 92. Seals, R. M., Kioumourtzoglou, M.-A., Gredal, O., Hansen, J. & Weisskopf, M. G. ALS and the Military: A Population-Based Study in the Danish Registries. Epidemiology 27, 188–193 (2016). 93. Weisskopf, M. G., Cudkowicz, M. E. & Johnson, N. Military Service and Amyotrophic Lateral Sclerosis in a Population-based Cohort. Epidemiology 26, 831–838 (2015). 94. Trojsi, F., Monsurrò, M. R. & Tedeschi, G. Exposure to environmental toxicants and pathogenesis of amyotrophic lateral sclerosis: state of the art and research perspectives. Int J Mol Sci 14, 15286–15311 (2013). 95. Burns, C., Beard, K. & Cartmill, J. Mortality in chemical workers potentially exposed to 2,4-dichlorophenoxyacetic acid (2,4-D) 1945-94: an update. Occup Environ Med 58, 24–30 (2001). 96. Capozzella, A. et al. Work related etiology of amyotrophic lateral sclerosis (ALS): a meta-analysis. Ann Ig 26, 456–472 (2014).

56 97. Caller, T. A. et al. A cluster of amyotrophic lateral sclerosis in New Hampshire: a possible role for toxic cyanobacteria blooms. Amyotroph Lateral Scler 10 Suppl 2, 101– 108 (2009). 98. Masseret, E. et al. Dietary BMAA Exposure in an Amyotrophic Lateral Sclerosis Cluster from Southern France. PLOS ONE 8, e83406 (2013). 99. Armon, C. Smoking may be considered an established risk factor for sporadic ALS. Neurology 73, 1693–1698 (2009). 100. Wang, H. et al. Smoking and risk of amyotrophic lateral sclerosis: a pooled analysis of five prospective cohorts. Arch Neurol 68, 207–213 (2011). 101. Rippon, G. A. et al. An observational study of cognitive impairment in amyotrophic lateral sclerosis. Arch. Neurol. 63, 345–352 (2006). 102. Irwin, D., Lippa, C. F. & Swearer, J. M. Cognition and amyotrophic lateral sclerosis (ALS). Am J Alzheimers Dis Other Demen 22, 300–312 (2007). 103. Raaphorst, J., de Visser, M., Linssen, W. H. J. P., de Haan, R. J. & Schmand, B. The cognitive profile of amyotrophic lateral sclerosis: A meta-analysis. Amyotroph Lateral Scler 11, 27–37 (2010). 104. Ringholz, G. M. et al. Prevalence and patterns of cognitive impairment in sporadic ALS. Neurology 65, 586–590 (2005). 105. Yu, Y. et al. Environmental Risk Factors and Amyotrophic Lateral Sclerosis (ALS): A Case-Control Study of ALS in Michigan. PLOS ONE 9, e101186 (2014). 106. Montuschi, A. et al. Cognitive correlates in amyotrophic lateral sclerosis: a population- based study in Italy. J Neurol Neurosurg Psychiatry 86, 168–173 (2015). 107. Eykens, C. & Robberecht, W. The genetic basis of amyotrophic lateral sclerosis: recent breakthroughs. Advances in Genomics and Genetics https://www.dovepress.com/the- genetic-basis-of-amyotrophic-lateral-sclerosis-recent-breakthrough-peer-reviewed- article-AGG (2015) doi:10.2147/AGG.S57397. 108. Brown, R. H. & Al-Chalabi, A. Amyotrophic Lateral Sclerosis. N. Engl. J. Med. 377, 162–172 (2017). 109. Hayyan, M., Hashim, M. A. & AlNashef, I. M. Superoxide Ion: Generation and Chemical Implications. Chem. Rev. 116, 3029–3085 (2016). 110. Nicolas, A. et al. Genome-wide Analyses Identify KIF5A as a Novel ALS Gene. Neuron 97, 1268-1283.e6 (2018).

57 111. van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat Genet 48, 1043–1048 (2016). 112. van Blitterswijk, M. et al. Evidence for an oligogenic basis of amyotrophic lateral sclerosis. Hum Mol Genet 21, 3776–3784 (2012). 113. Cady, J. et al. Amyotrophic lateral sclerosis onset is influenced by the burden of rare variants in known amyotrophic lateral sclerosis genes. Ann Neurol 77, 100–113 (2015). 114. Lehninger, A., Nelson, D. & Cox, M. Lehninger Principles of Biochemistry. (W. H. Freeman, 2008). 115. Watts, G. D. J. et al. Inclusion body myopathy associated with Paget disease of bone and frontotemporal dementia is caused by mutant valosin-containing protein. Nat Genet 36, 377–381 (2004). 116. Johnson, J. O. et al. Exome Sequencing Reveals VCP Mutations as a Cause of Familial ALS. Neuron 68, 857–864 (2010). 117. Maruyama, H. et al. Mutations of optineurin in amyotrophic lateral sclerosis. Nature 465, 223–226 (2010). 118. Polymenidou, M. et al. Long pre-mRNA depletion and RNA missplicing contribute to neuronal vulnerability from loss of TDP-43. Nat. Neurosci. 14, 459–468 (2011). 119. Afroz, T. et al. Functional and dynamic polymerization of the ALS-linked protein TDP- 43 antagonizes its pathologic aggregation. Nat Commun 8, 45 (2017). 120. Lagier-Tourenne, C. et al. Divergent roles of ALS-linked proteins FUS/TLS and TDP- 43 intersect in processing long pre-mRNAs. Nat. Neurosci. 15, 1488–1497 (2012). 121. EPHA4 is a disease modifier of amyotrophic lateral sclerosis in animal models and in humans : Nature Medicine : Nature Research. http://www.nature.com/nm/journal/v18/n9/full/nm.2901.html. 122. A Mutation in the Vesicle-Trafficking Protein VAPB Causes Late-Onset Spinal Muscular Atrophy and Amyotrophic Lateral Sclerosis. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182111/. 123. Yamanaka, K. et al. Mutant SOD1 in cell types other than motor neurons and oligodendrocytes accelerates onset of disease in ALS mice. Proc. Natl. Acad. Sci. U.S.A. 105, 7594–7599 (2008). 124. Jäkel, S. & Dimou, L. Glial Cells and Their Function in the Adult Brain: A Journey through the History of Their Ablation. Front Cell Neurosci 11, 24 (2017).

58 125. Frakes, A. E. et al. Microglia induce motor neuron death via the classical NF-κB pathway in amyotrophic lateral sclerosis. Neuron 81, 1009–1023 (2014). 126. Harraz, M. M. et al. SOD1 mutations disrupt redox-sensitive Rac regulation of NADPH oxidase in a familial ALS model. J. Clin. Invest. 118, 659–670 (2008). 127. Jiang, J. et al. Gain of Toxicity from ALS/FTD-Linked Repeat Expansions in C9ORF72 Is Alleviated by Antisense Oligonucleotides Targeting GGGGCC-Containing RNAs. Neuron 90, 535–550 (2016). 128. Lee, Y. et al. Oligodendroglia metabolically support axons and contribute to neurodegeneration. Nature 487, 443–448 (2012). 129. Van Damme, P. et al. Astrocytes regulate GluR2 expression in motor neurons and their vulnerability to excitotoxicity. Proc. Natl. Acad. Sci. U.S.A. 104, 14825–14830 (2007). 130. Wang, L., Gutmann, D. H. & Roos, R. P. Astrocyte loss of mutant SOD1 delays ALS disease onset and progression in G85R transgenic mice. Hum. Mol. Genet. 20, 286–293 (2011). 131. Howland, D. S. et al. Focal loss of the glutamate transporter EAAT2 in a transgenic rat model of SOD1 mutant-mediated amyotrophic lateral sclerosis (ALS). Proc. Natl. Acad. Sci. U.S.A. 99, 1604–1609 (2002). 132. Walker, A. K. & Atkin, J. D. Stress signaling from the endoplasmic reticulum: A central player in the pathogenesis of amyotrophic lateral sclerosis. IUBMB Life 63, 754–763 (2011). 133. Nishitoh, H. et al. ALS-linked mutant SOD1 induces ER stress- and ASK1-dependent motor neuron death by targeting Derlin-1. Genes Dev. 22, 1451–1464 (2008). 134. Montibeller, L. & de Belleroche, J. Amyotrophic lateral sclerosis (ALS) and Alzheimer’s disease (AD) are characterised by differential activation of ER stress pathways: focus on UPR target genes. Cell Stress and Chaperones 23, 897–912 (2018). 135. Schröder, M. & Kaufman, R. J. The mammalian unfolded protein response. Annu. Rev. Biochem. 74, 739–789 (2005). 136. Jaronen, M., Goldsteins, G. & Koistinaho, J. ER stress and unfolded protein response in amyotrophic lateral sclerosis—a controversial role of protein disulphide isomerase. Front Cell Neurosci 8, (2014). 137. Kwok, C. T., Morris, A. & de Belleroche, J. S. Sequestosome-1 ( SQSTM1 ) sequence variants in ALS cases in the UK: prevalence and coexistence of SQSTM1 mutations in ALS kindred with PDB. European Journal of Human Genetics 22, 492–496 (2014).

59 138. Cox, L. E. et al. Mutations in CHMP2B in Lower Motor Neuron Predominant Amyotrophic Lateral Sclerosis (ALS). PLoS One 5, (2010). 139. Williamson, T. L. & Cleveland, D. W. Slowing of axonal transport is a very early event in the toxicity of ALS-linked SOD1 mutants to motor neurons. Nat. Neurosci. 2, 50–56 (1999). 140. Al-Chalabi, A. et al. An estimate of amyotrophic lateral sclerosis heritability using twin data. Journal of Neurology, Neurosurgery & Psychiatry jnnp.2010.207464 (2010) doi:10.1136/jnnp.2010.207464. 141. Trabjerg, B. B. et al. ALS in Danish Registries: Heritability and links to psychiatric and cardiovascular disorders. Neurol Genet 6, e398 (2020). 142. Young, A. I. Solving the missing heritability problem. PLOS Genetics 15, e1008222 (2019). 143. Project MinE: study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis. European Journal of Human Genetics 1 (2018) doi:10.1038/s41431-018-0177-4. 144. Mejzini, R. et al. ALS Genetics, Mechanisms, and Therapeutics: Where Are We Now? Front Neurosci 13, (2019). 145. Bandres-Ciga, S. et al. Shared polygenic risk and causal inferences in amyotrophic lateral sclerosis. Ann. Neurol. 85, 470–481 (2019). 146. Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179–189 (2020). 147. Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics 48, 481–487 (2016). 148. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome- wide association summary statistics. Nature Genetics 47, 1228–1235 (2015). 149. Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics 50, 621–629 (2018). 150. Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. The American Journal of Human Genetics 101, 5–22 (2017). 151. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nature Genetics 47, 1236–1241 (2015). 152. Rheenen, W. van, Peyrot, W. J., Schork, A. J., Lee, S. H. & Wray, N. R. Genetic correlations of polygenic disease traits: from theory to practice. Nat Rev Genet 20, 567– 581 (2019).

60 153. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015). 154. Turley, P. et al. MTAG: Multi-Trait Analysis of GWAS. bioRxiv 118810 (2017) doi:10.1101/118810. 155. Maier, R. M. et al. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nature Communications 9, 989 (2018). 156. Caballero, A., Tenesa, A. & Keightley, P. D. The Nature of Genetic Variation for Complex Traits Revealed by GWAS and Regional Heritability Mapping Analyses. Genetics 201, 1601–1613 (2015). 157. Cowie, P., Ross, R. & MacKenzie, A. Understanding the Dynamics of Gene Regulatory Systems; Characterisation and Clinical Relevance of cis-Regulatory Polymorphisms. Biology (Basel) 2, 64–84 (2013). 158. Gene Expression and Regulation | Learn Science at Scitable. https://www.nature.com/scitable/topic/gene-expression-and-regulation-15/. 159. Battle, A. & Montgomery, S. B. Determining causality and consequence of expression quantitative trait loci. Human Genetics 133, 727–735 (2014). 160. Davies, N. M., Holmes, M. V. & Smith, G. D. Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ 362, k601 (2018). 161. De, R., Bush, W. S. & Moore, J. H. Bioinformatics challenges in genome-wide association studies (GWAS). Methods Mol. Biol. 1168, 63–81 (2014). 162. Teumer, A. Common Methods for Performing . Front. Cardiovasc. Med. 5, (2018). 163. Hills, R. D. et al. Gut Microbiome: Profound Implications for Diet and Disease. Nutrients 11, (2019). 164. Fredricks, D. N. The Human Microbiota: How Microbial Communities Affect Health and Disease. (Wiley-Blackwell, 2013). 165. Klingelhoefer, L. & Reichmann, H. Pathogenesis of Parkinson disease—the gut–brain axis and environmental factors. Nature Reviews Neurology 11, 625 (2015). 166. Zhao, Y., Jaber, V. & Lukiw, W. J. Secretory Products of the Human GI Tract Microbiome and Their Potential Impact on Alzheimer’s Disease (AD): Detection of Lipopolysaccharide (LPS) in AD Hippocampus. Front Cell Infect Microbiol 7, 318 (2017).

61 167. Yang, X., Qian, Y., Xu, S., Song, Y. & Xiao, Q. Longitudinal Analysis of Fecal Microbiome and Pathologic Processes in a Rotenone Induced Mice Model of Parkinson’s Disease. Front. Aging Neurosci. 9, (2018). 168. Rowin, J., Xia, Y., Jung, B. & Sun, J. Gut inflammation and dysbiosis in human motor neuron disease. Physiol Rep 5, (2017). 169. Brenner, D. et al. The fecal microbiome of ALS patients. Neurobiology of Aging 61, 132–137 (2018). 170. Wright, M. L. et al. Potential Role of the Gut Microbiome in ALS: A Systematic Review. Biol Res Nurs 20, 513–521 (2018).

62 2

Chapter 2: ALS genetic correlation with risk factors

Published as:

Amyotrophic Lateral Sclerosis Genetic Correlation with Cognitive Performance, Educational Attainment and Schizophrenia: Evidence from Polygenic Risk Score Analysis

European Journal of Human Genetics (Accepted for Publication) 2020

Contribution details at “List of publication that included in this thesis” number 1 (page 6).

63 2.1 Abstract

Amyotrophic Lateral Sclerosis (ALS) is recognised to be a complex neurodegenerative disease involving both genetic and non-genetic risk factors. The underlying causes and risk factors for the majority of cases remain unknown, however, ever-larger genetic data studies and methodologies promise an enhanced understanding. Recent analyses using published summary statistics from the largest ALS genome-wide association study (GWAS) (20,806 ALS cases and 59,804 healthy controls) identified that schizophrenia (SCZ), cognitive performance (CP) and educational attainment (EA) related traits were genetically correlated with ALS. To provide additional evidence for these correlations, we built single and multi-trait genetic predictors using GWAS summary statistics for ALS and these traits, (SCZ, CP, EA) in an independent Australian cohort (846 ALS cases and 665 healthy controls). We compared methods for generating the risk predictors and found that the combination of traits improved the prediction (Nagelkerke-R2) of the case-control logistic regression. The combination of ALS, SCZ, CP, and EA, using the SBayesR predictor method gave the highest prediction (Nagelkerke-R2) of 0.027 (P-value = 4.6x10-8), with the odds-ratio for estimated disease risk between the highest and lowest deciles of individuals being 3.15 (95% CI 1.96 - 5.05). These results support the genetic correlation between ALS, SCZ, CP, and EA providing a better understanding of complexity of ALS.

2.2 Introduction

Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease, with death typically occurring within 3 to 5 years of symptom onset1,2. Currently, there is no effective treatment for ALS, likely in part due a limited understanding of its underlying causes1–3. Approximately 10% of ALS cases are considered to have a Mendelian form of the disease1,3 carrying a highly penetrant variant, however the remaining, and vast majority of cases, likely have a more complex aetiology reflecting both genetic and environmental susceptibility risk factors4. Cognitive impairment affects approximately 30% of ALS patients, with ~10% diagnosed with frontotemporal dementia (FTD)5,6. However, most studies consider cognitive impairment as a consequence of the ALS-FTD spectrum, rather than discussing attenuated cognitive performance as a risk factor of ALS7–9. Recent studies that link ALS and cognitive performance remain inconclusive. For example, lower educational attainment was reported to

64 increase ALS-FTD risk in Italian and United States (US) ALS cohorts10,11. However, a larger cohort from the Swedish Conscript Register reported the opposite result; higher IQ in young conscripts was associated with higher ALS risk in later life5. Traditional prospective studies designed to investigate if cognitive decline is a risk factor that precedes the diagnosis of ALS are hard to establish since ALS is a relatively low frequency disorder with a lifetime risk of ~0.3%1,2. Demonstrating a genetic relationship between ALS and cognitive traits would provide more conclusive evidence for the direction of association between them. Cognitive related traits have high heritability estimates (~60%)12,13 and are frequently measured in large population or community samples, particularly using proxies such as educational attainment. In contrast, while ALS has a substantial genetic component (heritability of 40-45%)14, it is a late onset disease of relatively low frequency, making the collection of large cohorts difficult. Advances in genotyping technology allow estimation of the genetic contribution to traits associated with SNPs measured genome-wide. These estimates of so-called SNP-based heritability use genome-wide association study (GWAS) data to capture the contribution from common genetic variants and so are smaller than heritability estimates from family studies. The meta-analysis of the ALS published GWAS study reported a SNP heritability estimate of

15 ~8% . Moreover, we can use GWAS summary statistics to estimate the genetic correlation (rg) between traits using independently collected samples; for other diseases and traits correlation estimates made from genome-wide SNPs have been found to be similar to estimates from traditional epidemiology16. Using these approaches ALS has been reported to be negatively correlated with fluid intelligence (rg= –0.34) and academic or professional qualifications (rg =

17 18 –0.25) , and positively correlated with schizophrenia (rg = 0.14) . Combining information from genetically correlated traits can improve genetic predictors of disease risk19,20, particularly for diseases such as ALS where cohort sizes are relatively small21. Here, we investigate the genetic relationship between ALS and over 700 traits using linkage disequilibrium score regression (LDSR) as implemented in LDhub ver1.90 platform22, confirming previously identified correlations with ALS. We provide independent evidence for a genetic relationship between ALS and these traits using out-of-sample polygenic risk prediction into an independent data set of 846 ALS cases and 665 controls to demonstrate improvements in genetic prediction when combining multiple traits.

65 2.3 Material and Method

2.3.1 Australian ALS GWAS data

We present new data from an Australian ALS GWAS cohort comprising 836 cases and 665 controls, and independent of all published ALS GWAS data. The sample includes the University of Sydney’s Australian Motor Neuron Disease DNA Bank (MND Bank) cohort recruited between April 2000 to June 2011 (462 cases, 449 controls), with study protocol approved by the Sydney South West Area Health Service Human Research Ethics Committee (HREC). The remainder of the cases (N=374) comprised ALS patients recruited from clinics across Australia between 2015 and 2017 under HREC approvals from University of Sydney, Western Sydney Local Health District, Royal Brisbane and Women’s Hospital, and Macquarie University. The ALS cases were diagnosed with definite or probable ALS according to the revised El Escorial criteria23. Those with a recorded family history of ALS or had tested positive for genetic variants with strong support for ALS causality were excluded. Some controls (N=127) were recruited as either partners or friends of patients, healthy individuals free of neuromuscular diseases. Additional controls were included from the Older Australian Twin Study ()24 comprising 89 monozygotic (MZ) twin pairs from QIMR Berghofer Medical Research Institute, University of New South Wales and the University of Melbourne, and was approved by their respective HRECs. Twin pair data helped in quality control checks but only one twin from each pair was used in analyses. DNA was extracted using standard protocols and was genotyped using Infinium CoreExome-24 version 1.1 producing ~300,000 informative whole genome SNP markers. Standard GWAS quality control (QC) steps were performed, including sex-checks (incompatible sex between genotyping result on the X-chromosome and the individual’s clinical record) and the removal of SNPs that were genotyped < 95% of individuals, had a low minor allele frequency (MAF<0.01) or deviated from Hardy-Weinberg Equilibrium (HWE) p<1x10-6, using PLINK version 1.925. A total ~250,000 SNPs passed quality control and were imputed to the Haplotype Reference Consortium reference panel (Version r1.1 2016)26 implemented in Sanger Institute Imputation Server. SNPs with poor imputation accuracy (info score<0.8) and low frequency SNPs (MAF<0.01) were removed, leaving 6,681,912 SNPs for later analysis. The QC on individuals included filtering related individuals (Identity By Decent, IBD > 0.05, PLINK 1.9 “--genome” command) and individuals known to harbour Mendelian- like variants associated with ALS. To remove ancestry outliers, we projected our case-control

66 cohort onto first two principal components (PCs) of the 1000 Genomes cohort27 using GCTA’s PC loading method28. We removed the ancestry outliers that deviated more than four standard deviations from the European population mean (calculated using 1000 Genomes Northern European (CEU), British (GBR), Finnish (FIN), Iberian Spanish (IBS), and Toscani Italian (TSI) samples), leaving 1501 individuals (836 cases and 665 controls) to be used for further analyses.

2.3.2 GWAS meta-analysis : Australian cohort and ALS European

The GWAS analysis on the Australian cohort was performed using logistic regression (--logit) implemented in PLINK1.925. Population structure was controlled for by adjusting for the first ten principal component (PLINK1.9 -- covar) calculated using the Australian cohort imputed genotyping data. The GWAS summary statistics of Australian cohort were meta-analysed with ALS European29 using the software METAL30 with default settings. The genes in the region surrounding significant loci were visualised using locuszoom31.

2.3.3 Selection of correlated traits

The genetic correlation between a range of traits with ALS was estimated using Linkage Disequilibrium Score Regression (LDSR)32 between the European ALS GWAS summary statistics (20,806 ALS cases and 59,804 healthy controls of European ancestry)29 to over 700 traits as implemented in the LDHub platform (v1.9.0)22. Despite the previous report of LDHub results for ALS18, the GWAS data held within LDHub are regularly updated. Here, we report all genetic correlations significant at p< 0.05 for test of null hypothesis rg =0, with traits that have SNP-based heritability estimate > 10%. We chose the minimum SNP-based heritability 10% because our interest was to improve out-of-sample prediction and hence correlated traits need to have sufficient genetic contribution, noting that the SNP-based heritability of ALS estimated from the results of the latest GWAS29 is only 1.76% (SE= 0.38%), calculated using a lifetime risk of 0.00326.

67 2.3.4 Polygenic risk scores

We calculated a polygenic risk score (PRS) for all individuals in our Australian ALS case- control sample. The SNPs taken into the PRS calculations were limited to those SNPs found in HapMap 3 (HM3)33 as these were common across the summary statistics of all traits analysed. PRSs were calculated using different methods to decide SNPs included and their effect sizes, but in each case, the PRS is the sum of risk weighted by SNP effect sizes calculated using the PLINK 1.9 “--score”. The efficacy of the predictor was measured by the Nagelkerke- R2 of the logistic regression of PRS on case-control status (R glm package34 for logistic regression and fmsb function35 for Nagelkerke-R2 calculation) and by comparison of the odds of being a case in the 10th decile vs 1st decile or ordered PRS. In the basic PRS approach, the SNPs were clumped (PLINK --clump), which selects a quasi-independent SNP set by taking the most associated SNP in a genomic region and excluding any SNP with r2 > 0.01 with already selected SNPs. We considered a range of P- value thresholds for selection of SNPs into the PRS (see Supplementary Figure 1), but report in results from PRS using all HapMap3 SNPs, we call this standard PRS (STD_PRS). Including all SNPs in our prediction model rather than selecting the P-value threshold based on results from the data prevents the variance captured from PRS being biased due to winner’s curse36, allowing fairer comparison across the methods. Since the clumping r2 threshold is arbitrary, we also used BLUP (Best Linear Unbiased

Prediction) estimates of all SNPs to calculate a PRSBLUP, an approach that appropriately accounts for linkage disequilibrium (LD) of the SNPs, but assumes SNP effects are normally distributed (which is a valid assumption for highly polygenic traits). Approximate BLUP estimates were derived from GWAS summary statistics using the SBLUP37 method implemented in the GCTA software with the Human Retirement Study (HRS) cohort38 used as the reference sample to calculate the LD structure. We also calculated SNP effects using LDPred-Funct (LDPF)39 a method that includes functional annotation to weight SNPs effects. We used the Baseline-LD functional annotation provided by Gazal et al.40 and the HRS cohort for the LD structure reference to calculate LDPF-inf SNP weightings. Lastly, we calculated SNP effects using Summary-based BayesR (SBayesR)41, a method that models effect sizes using a mixture of normal distributions with different variances. This allows greater flexibility in the underlying model, potentially providing a better reflection of the underlying genetic architecture of ALS. We used sparse LD-matrix built from 10,000 UK Biobank42 unrelated individuals for the LD reference41.

68 Out-of-sample prediction for a trait can be improved by using information from correlated traits19–21, with multiple-trait prediction implemented in MTAG19 and SMTpred20 software. MTAG and SMTpred use similar methodologies to develop a multi-trait predictor. Here, we use MTAG to combine basic SNP effects of ALS and correlated traits generating a single effect size per SNP from which to generate a PRS. We use SMTpred (with --blup option) to combine single trait scores generated by SBLUP (PRSBLUP), LDPred-Funct (PRSLDPF), and

SBayesR (PRSSBayesR) for each individual using the estimated genetic correlation and SNP- based of each single trait.

69 2.4 Results

2.4.1 GWAS meta-Analysis : Australian cohort and European

The GWAS Meta-analysis between ALS Australian cohort and the ALS European GWAS found additional significant loci underlying ALS risk; one locus on chromosome 5 (nearest gene ERGIC1 ; P-value= 3.4x10-8) (Figure 1 and Figure 2). ERIGIC1 gene in chromosome 5 was annotated as intermediate or transport protein that link endoplasmic reticulum and golgi apparatus43. There was a second locus on chromosome 14 (nearest gene SCFD1 ; P- value=6.4x10-8) whose association P-value was narrowly greater than the genome-wide significance threshold (Figure1, Figure 2), but is noted here because it has a rather similar functional annotation to ERGIC1. SCFD1 also involved in protein metabolism by transporting protein to golgi apparatus43. The other gene in this chromosome 14 locus is G2E3 (Figure 2). Similarly, G2E2 is also a protein that found on golgi appratus however, the main function is more on ubiquitin transfer43.

70 C9orf72 C9orf72

UNC13A UNC13A C21orf2 C21orf2

TBK1 TBK1 GPX3/TNIP1 GPX3/TNIP1 ERGIC1 SCFD1

Figure 1. ALS European GWAS Manhattan plot (left) comparison with ALS GWAS Meta-analysis of European and Australian cohort (right). There was a signal in chromosome 5 that pass the significant threshold (ERGIC1) and a gene that marginally significant in chromosome 14 (SCFD1), both marked in red text.

71

Figure 2.The new ALS loci from ALS meta-analysis of European and Australian cohort. A new significant loci in chromosome 5 that close to ERGIC1 gene (upper). There is marginally significant locus in chromosome 14 (lower). Based on LD pattern in this chromosome 14 locus, it contains two genes (SCFD1 and G2E2).

72 2.4.2 Selection of correlated traits

A genetic correlation analysis between ALS and more than 700 traits available in the LD Hub ver1.90 platform identified 85 traits that were nominally significantly correlated with ALS (Supplementary table 1) at P-value < 0.05. After applying additional filters of SNP-based heritability estimate>10% and a 5% False Discovery Rate (FDR) (nominal P-value < 5.6x10- 3), three traits remained, all related to cognition (Table 1). As larger GWAS sample sizes are available for both education attainment (EA) and cognitive performance (CP)44, and the results

from these studies show a significant rg with ALS of -0.28 and -0.24 respectively (Table 1), we took these GWAS summary statistics forward to improve statistical power in our prediction analysis. While the genetic correlation between ALS and schizophrenia45 was not significant

-2 after correction for multiple testing in our LDhub analysis (rg =0.14, p= 1.2x10 ), the genetic correlation estimated with the latest schizophrenia GWAS46 results was significant (Table 1). This observation combined with a previous report of a genetic correlation between ALS and SCZ18, led us to take SCZ through to our prediction analysis.

Table 1. Genetic correlations between ALS and traits used in prediction analysis

N (Case + Traits rG SE SNP-h2 P-value Control) Cognitive Performance44 -0.28 0.06 0.20 1.11E-06 257,828 Educational Attainment44 -0.24 0.05 0.11 1.10E-06 766,345 Schizophrenia46 0.15 0.05 0.42 2.6E-03 105,318 SNP-h2 : SNP-based heritability

2.4.3 Polygenic Risk Scores

Results from the single-trait PRS prediction into the Australian sample of 836 cases and 665 controls for each of the four selected traits (ALS, CP, EA, and SCZ) are summarised in Figure 3 & Supplementary Table 2. As expected, the ALS discovery sample (HM3 SNP set) gives the best single trait prediction performance with a Nagelkerke-R2 (NKR2) of 0.010 for standard PRS, 0.010 for SBLUP, 0.011 for LDPF, and 0.022 for SBayesR. The CP PRS explained a significant (P-value < 0.05) proportion of variance for all four methods, while the association

73 between the EA predictor and ALS case-control status was nominally significant only for the LDPF method. Their regression coefficient had the expected negative sign, providing independent confirmation that ALS is genetically negative correlated with these traits. The predictors calculated from SCZ GWAS statistics are also significantly associated with ALS case-control, and the sign of the regression coefficient was positive as expected. In each single- trait prediction, the highest variance explained was always from either SBLUP, LDPF, or SBayesR methods.

0.03

***

Predictor Method 0.02 STD_PRS SBLUP LDPF

e R−sq SBayesR k r e k

* P-value < 0.05

Nagel *** *** *** ** P-value < 0.01 0.01 *** P-value < 0.001

** ** * ** * * * *

0.00

ALS CP EA SCZ Predictors

Figure 3. Prediction accuracy of single-trait predictors of ALS in the Australian cohort. Predictors constructed using GWAS summary statistics of CP, EA, and SCZ had small but significant predictive ability for ALS case-control status.

Combining these traits into multiple trait predictors of ALS generated higher NKR2 than the single trait predictors (Figure 4 and Supplementary Table 3). Direct comparison on single trait vs multi-trait predictors is given by the PRS vs MTAG results, both of which do not consider the LD structure between SNPs, and the SBLUP vs SBLUP-SMTPRED, which do account for LD structure. The LDPF and LDPF-SMTPRED results include the functional annotations into SNP weights. The SBayesR and SBAYESR-SMTPRED results demonstrate the utility of a flexible distribution of effect sizes (modelled as a mixture of normal distributions) rather than a single underlying normal distribution as use by SBLUP. In all cases, the NKR2 of multi-traits predictors were higher than the predictors using ALS alone.

74 Combining all correlated traits (ALS, CP, EA, and SCZ) gave the best predictor (NKR2 of 0.027) with the SBayesR method. For the best predictor, the calculated risk odds ratio for those in the top 10% of estimated risk when compared to those in the bottom 10% was 3.15 (95% CI 1.96 - 5.05).

0.03 ***

*** *** *** *** Predictor Method STD_PRS SBLUP 0.02 *** LDPF *** SBayesR MTAG *** *** *** SBLUP−SMTPRED_BLUP

e R−sq *** *** k

r *** LDPF−SMTPRED_BLUP e k *** *** *** SBayesR−SMTPRED_BLUP *** *** Nagel *** *** 0.01 * P-value < 0.05 ** P-value < 0.01 *** P-value < 0.001

0.00

ALS ALS+CP ALS+EA ALS+SCZ ALL Predictors

Figure 4. Prediction accuracy of multi-trait predictors compared to the ALS only predictor. Predictors constructed using combined SNP effects (multi-trait) of ALS and correlated traits, improved predictive ability for ALS case-control status.

75 2.5 Discussion

Meta-analysis between ALS Australian cohort and ALS European 2018 provided a couple of new insight to the ALS understanding. The new loci that found to be significant (ERGIC1) and the other loci that marginally significant (SCFD1) in this metanalysis were both involved in protein metabolism, especially in transport mechanism from endoplasmic reticulum (ER) and golgi apparatus. This might support the existing ALS hypothesised pathology involving ER stress3 and interaction with other ALS-associated genes that involved in ER stress mechanism like UBQLN2, SQSTM1, CHMP2B and VAPB47. Given that ALS is a complex disease, understanding its genetic relationship with other traits provides some insight into this complexity. Analyses using summary statistics from GWAS allow the study of the genetic relationship between traits using independently collected samples. We found that ALS had significant negative genetic correlation (after multiple FDR correction) with cognitive related traits like fluid intelligence, years of schooling, and university/college qualification measured in large cohorts sampled from the general population, for example, rg = -0.24 with educational attainment (Table 1). This observation

10 11 supports the earlier US and Italian cohort studies. It is notable that the rg with schizophrenia is positive (0.15) (Table 1), and that a negative rg is also found between schizophrenia and educational attainment (-0.17)48. We also found that some physical activity traits, such as walking to work (SNP-based h2=2.2%), measured in the UK- has a significant negative genetic correlation with ALS (Supplementary Table 1), but other measures of exercise (including the duration and frequency of walking and vigorous or moderate exercise) did not show a significant genetic correlation to ALS (p>0.05). Many studies provide support for an association between high levels of physical activity with increased risk of ALS. For example, an increased risk of ALS is reported for professional soccer and football players, and Gulf-war veterans49–51. Comprehensive epidemiological studies involving five different populations in the Netherlands, Ireland, and Italy found that vigorous and moderate activity are linearly correlated with the risk of ALS52. Previous epidemiological studies from Japan and UK also found a similar correlation using a smaller sample size53,54. These traits were not taken forward into our prediction analysis because the physical activity traits show low SNP-based heritability (< 10%), and so would not be expected to improve out of sample prediction. Similarly, we also found smoking related traits (ranging from smoking status, age of smoking initiation, exposure of smoking at home)

76 have significant positive correlations with ALS, but were not included due to low SNP-based

2 heritability estimates (h SNP=1.0%-8.0%), and hence less likely to help in prediction. However, the significant correlation between ALS and smoking is interesting, since smoking is inconsistently reported as an ALS risk factor in many studies55–59. In addition, the estimates for genetic correlation between these traits and ALS should be treated with caution, given that ALS also has a small SNP based heritability estimate60. Our goal was to provide independent evidence of the genetic correlation of ALS with schizophrenia and cognitive related traits, through out-of-sample prediction into an independent ALS Australian cohort. We show that there is a significant out-of-sample prediction for ALS when using PRS built from EA, SCZ and especially CP SNP effect estimates, with the sign of regression coefficients matching the sign of the rg estimates. These results provide independent validation of the genetic relationship between ALS, CP, EA and SCZ. As expected, out-of-sample prediction is maximised by combining all the traits to make a multi-trait predictor. We compared methods for generating PRS, and found in this context SBayesR gave highest out-of-sample prediction accuracy. In single trait prediction SBayesR decile OR (odd ratio between the lowest 10% and highest 10% of PRS) is 2.32 and the combined all traits SBayesR is 3.15 which is a considerable increase. However, this decile OR is relatively small compared to other diseases like Parkinson (3.74-6.25)61, Schizophrenia (7.7- 15.0)45, and Type-2 Diabetes (4.52)62 with OR ranges reflecting estimates across different cohorts. Since the best methodology for PRS depends on the genetic architecture of the trait63, this conclusion may not be true in other disease applications. While the out-of-sample prediction was found to be highly significant (smallest P-value = 4.8x10-8), the variance explained by the predictor was still small (maximum NKR2 = 0.027, maximum AUC=0.580). This study had several limitations. We used a single cohort to test the out-of-sample prediction. Application of PRS prediction in other disorders, such as schizophrenia45 and major depression disorder64, has found variability in results between cohorts. Hence, other, European ancestry ALS cohort/s would be useful to confirm this observation. The Australian cohort control participants did not have information on education attainment and premorbid IQ information which means case-control bias on education and cognitive effects cannot be evaluated. However, this effect might be minimised as the recruitment protocol for controls included spouse and close friends of the cases, which likely implies some degree of matching on these factors. In addition, the estimated SNP-based heritability of ALS using common SNPs (HM3) was very low (1.76%, SE=0.38%, assuming lifetime risk of 0.3%) from the latest

77 29 2 published GWAS , smaller than from the previously published GWAS (h SNP=8%, SE=0.52%)15. This low SNP-based heritability may reflect the genetic architecture of ALS, and previous analyses have suggested that low frequency variants may be relatively more important in ALS than other common diseases15. In conclusion, we found that ALS had a significant negative genetic correlation with cognitive performance and educational attainment. These correlations were supported by the significant prediction of ALS when using the GWAS summary statistics for both traits, and improvements were made in prediction accuracy for ALS when included in a multi-trait predictor. However, there is still limited clinical utility in these ALS predictors due to the relatively small proportion of risk they capture. Larger GWAS for ALS are needed in order to provide a stronger baseline from which multi-trait predictors can be built. The GWAS meta- analysis of ALS European and Australian cohort found two new significant loci (ERGIC1 and SCFD1) that support the ER stress mechanism in ALS pathology.

2.6 Acknowledgements

We kindly thank all those who contributed to this research including the participants for providing a blood sample and clinical data, the research nurses and support staff for participant recruitment across clinic sites and the laboratory researchers for their care in generating the DNA data. We acknowledge funding from the National Health and Medical Research Council (NHMRC) (1078901, 1083187, 1113400, 1121962, 1405325,1084417, and 1079583), a NHMRC/Australian Research Council Strategic Award (401162), the Motor Neurone Disease Research Institute Australia (MNDRIA) Ice Bucket Challenge Grant and the MNDRIA Bill Gole Postdoctoral Fellowship (FCG). The twin study (OATS) was facilitated through Twins Research Australia, a national resource in part supported by a NHMRC Centre for Research Excellence.

78 2.7 References

1. Brown, R. H. & Al-Chalabi, A. Amyotrophic Lateral Sclerosis. N. Engl. J. Med. 377, 162–172 (2017). 2. van Es, M. A. et al. Amyotrophic lateral sclerosis. Lancet (2017) doi:10.1016/S0140- 6736(17)31287-4. 3. Taylor, J. P., Brown Jr, R. H. & Cleveland, D. W. Decoding ALS: from genes to mechanism. Nature 539, 197–206 (2016). 4. Byrne, S. et al. Rate of familial amyotrophic lateral sclerosis: a systematic review and meta-analysis. J. Neurol. Neurosurg. Psychiatr. 82, 623–627 (2011). 5. Longinetti, E. et al. Physical and cognitive fitness in young adulthood and risk of amyotrophic lateral sclerosis at an early age. European Journal of Neurology 24, 137– 142 (2017). 6. Rippon, G. A. et al. An observational study of cognitive impairment in amyotrophic lateral sclerosis. Arch. Neurol. 63, 345–352 (2006). 7. Irwin, D., Lippa, C. F. & Swearer, J. M. Cognition and amyotrophic lateral sclerosis (ALS). Am J Alzheimers Dis Other Demen 22, 300–312 (2007). 8. Raaphorst, J., de Visser, M., Linssen, W. H. J. P., de Haan, R. J. & Schmand, B. The cognitive profile of amyotrophic lateral sclerosis: A meta-analysis. Amyotroph Lateral Scler 11, 27–37 (2010). 9. Ringholz, G. M. et al. Prevalence and patterns of cognitive impairment in sporadic ALS. Neurology 65, 586–590 (2005). 10. Yu, Y. et al. Environmental Risk Factors and Amyotrophic Lateral Sclerosis (ALS): A Case-Control Study of ALS in Michigan. PLOS ONE 9, e101186 (2014). 11. Montuschi, A. et al. Cognitive correlates in amyotrophic lateral sclerosis: a population- based study in Italy. J Neurol Neurosurg Psychiatry 86, 168–173 (2015). 12. Deary, I. J., Spinath, F. M. & Bates, T. C. Genetics of intelligence. European Journal of Human Genetics 14, 690–700 (2006). 13. Krapohl, E. et al. The high heritability of educational achievement reflects many genetically influenced traits, not just intelligence. PNAS 111, 15273–15278 (2014). 14. Wingo, T. S., Cutler, D. J., Yarab, N., Kelly, C. M. & Glass, J. D. The Heritability of Amyotrophic Lateral Sclerosis in a Clinically Ascertained United States Research Registry. PLoS One 6, (2011).

79 15. van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat Genet 48, 1043–1048 (2016). 16. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nature Genetics 47, 1236–1241 (2015). 17. Bandres-Ciga, S. et al. Shared polygenic risk and causal inferences in amyotrophic lateral sclerosis. Ann. Neurol. 85, 470–481 (2019). 18. McLaughlin, R. L. et al. Genetic correlation between amyotrophic lateral sclerosis and schizophrenia. Nat Commun 8, (2017). 19. Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet 50, 229–237 (2018). 20. Maier, R. M. et al. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nature Communications 9, 989 (2018). 21. Maier, R. et al. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 96, 283–294 (2015). 22. Zheng, J. et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017). 23. Brooks, B. R., Miller, R. G., Swash, M., Munsat, T. L., & World Federation of Neurology Research Group on Motor Neuron Diseases. El Escorial revisited: revised criteria for the diagnosis of amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. Other Motor Neuron Disord. 1, 293–299 (2000). 24. Sachdev, P. S. et al. A comprehensive neuropsychiatric study of elderly twins: the Older Australian Twins Study. Twin Res Hum Genet 12, 573–582 (2009). 25. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 1–16 (2015). 26. Consortium, the H. R. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48, 1279–1283 (2016). 27. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015). 28. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A Tool for Genome- wide Complex Trait Analysis. Am J Hum Genet 88, 76–82 (2011).

80 29. Nicolas, A. et al. Genome-wide Analyses Identify KIF5A as a Novel ALS Gene. Neuron 97, 1268-1283.e6 (2018). 30. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010). 31. Pruim, R. J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26, 2336–2337 (2010). 32. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015). 33. The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). 34. R: The R Project for Statistical Computing. https://www.r-project.org/. 35. Nakazawa, M. fmsb: Functions for Medical Statistics Book with some Demographic Data. (2019). 36. Balding, D. J., Moltke, I. & Marioni, J. Handbook of Statistical Genomics. (John Wiley & Sons, 2019). 37. Robinson, M. R. et al. Genetic evidence of assortative mating in humans. Nature Human Behaviour 1, 0016 (2017). 38. Sonnega, A. et al. Cohort Profile: the Health and Retirement Study (HRS). Int J Epidemiol 43, 576–585 (2014). 39. Marquez-Luna, C. et al. Modeling functional enrichment improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. bioRxiv 375337 (2018) doi:10.1101/375337. 40. Gazal, S. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature Genetics 49, 1421–1427 (2017). 41. Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10, 1–11 (2019). 42. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 43. Stelzer, G. et al. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Curr Protoc Bioinformatics 54, 1.30.1-1.30.33 (2016). 44. Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics 50, 1112 (2018).

81 45. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014). 46. Pardiñas, A. F. et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet 50, 381–389 (2018). 47. Mejzini, R. et al. ALS Genetics, Mechanisms, and Therapeutics: Where Are We Now? Front Neurosci 13, (2019). 48. Trampush, J. W. et al. GWAS meta-analysis reveals novel loci and genetic correlates for general cognitive function: a report from the COGENT consortium. Molecular Psychiatry 22, 336–345 (2017). 49. Chio, A. et al. ALS in Italian professional soccer players: the risk is still present and could be soccer-specific. Amyotroph Lateral Scler 10, 205–209 (2009). 50. Lehman, E. J., Hein, M. J., Baron, S. L. & Gersic, C. M. Neurodegenerative causes of death among retired National Football League players. Neurology 79, 1970–1974 (2012). 51. Weisskopf, M. G. et al. Prospective study of military service and mortality from ALS. Neurology 64, 32–37 (2005). 52. Visser, A. E. et al. Multicentre, cross-cultural, population-based, case–control study of physical activity as risk factor for amyotrophic lateral sclerosis. J Neurol Neurosurg Psychiatry jnnp-2017-317724 (2018) doi:10.1136/jnnp-2017-317724. 53. Okamoto, K. et al. Lifestyle Factors and Risk of Amyotrophic Lateral Sclerosis: A Case-Control Study in Japan. Annals of Epidemiology 19, 359–364 (2009). 54. Harwood, C. A. et al. Long-term physical activity: an exogenous risk factor for sporadic amyotrophic lateral sclerosis? Amyotroph Lateral Scler Frontotemporal Degener 17, 377–384 (2016). 55. Armon, C. Smoking may be considered an established risk factor for sporadic ALS. Neurology 73, 1693–1698 (2009). 56. Wang, H. et al. Smoking and risk of amyotrophic lateral sclerosis: a pooled analysis of five prospective cohorts. Arch Neurol 68, 207–213 (2011). 57. Calvo, A. et al. Influence of cigarette smoking on ALS outcome: a population-based study. J. Neurol. Neurosurg. Psychiatr. 87, 1229–1233 (2016). 58. Opie-Martin, S. et al. UK case control study of smoking and risk of amyotrophic lateral sclerosis. Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 21, 222– 227 (2020).

82 59. Opie-Martin, S. et al. Relationship between smoking and ALS: Mendelian randomisation interrogation of causality. J Neurol Neurosurg Psychiatry 91, 1312–1315 (2020). 60. Rheenen, W. van, Peyrot, W. J., Schork, A. J., Lee, S. H. & Wray, N. R. Genetic correlations of polygenic disease traits: from theory to practice. Nat Rev Genet 20, 567– 581 (2019). 61. Nalls, M. A. et al. Identification of novel risk loci, causal insights, and heritable risk for Parkinson’s disease: a meta-analysis of genome-wide association studies. The Lancet Neurology 18, 1091–1102 (2019). 62. Udler, M. S., McCarthy, M. I., Florez, J. C. & Mahajan, A. Genetic Risk Scores for Diabetes Diagnosis and Precision Medicine. Endocrine Reviews 40, 1500–1520 (2019). 63. Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet 45, 400–405 (2013). 64. Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).

83 2.8 Supplementary materials

Supplementary Figure 1. The effects of different SNP selection thresholds and SNP set choices on ALS prediction accuracy. ALS PRS using the European ALS GWAS summary statistics were built using either all HRC imputed SNPs (5.6 million, “Full” – red) or HapMap 3 SNPs (1.2 million, “HM3” - blue). A SNP significance threshold of 5x10-6 provided the most accurate out of sample prediction, but selecting this threshold based on the accuracy in our test cohort would introduce bias into our baseline prediction model. So, we instead took forward the PRS using all HM3 SNPs without exclusion based on significance threshold (See also Supplementary Table 4).

84 Supplementary Table 1. Significant ALS genetic correlations (P-value < 0.05) from the LDHub platform ver 1.9. Traits passing the FDR correction and with SNP based heritability > 0.1 are highlighted.

Trait rg z P-value h2_obs Amyotrophic lateral sclerosis (2016) 0.75 10.45 1.4E-25 0.05 Fluid intelligence score -0.34 -5.14 2.7E-07 0.24 Types of transport used (excluding work): Walk -0.40 -4.82 1.5E-06 0.03 Qualifications: College or University degree -0.25 -4.76 2.0E-06 0.17 Qualifications: A levels/AS levels or equivalent -0.26 -4.41 1.1E-05 0.10 Types of transport used (excluding work): Public transport -0.40 -4.39 1.1E-05 0.02 Qualifications: None of the above 0.26 4.39 1.1E-05 0.10 Light smokers_ at least 100 smokes in lifetime 0.43 4.23 2.3E-05 0.08 Physical activity in last 4 weeks: Light DIY -0.29 -4.18 2.9E-05 0.04 Years of schooling 2016 -0.23 -3.82 1.0E-04 0.13 Number of incorrect matches in round 0.23 3.88 1.0E-04 0.05 Qualifications: Other professional qualifications -0.26 -3.74 2.0E-04 0.05 Qualifications: O levels/GCSEs or equivalent -0.24 -3.52 4.0E-04 0.05 Physical activity in last 4 weeks :Walking for pleasure -0.28 -3.52 4.0E-04 0.04 Job involves mainly walking or standing 0.22 3.41 6.0E-04 0.08 Age completed full time education -0.23 -3.42 6.0E-04 0.08 Exposure to tobacco smoke at home 0.42 3.37 7.0E-04 0.01 Frequency of tenseness / restlessness in last 2 weeks 0.23 3.34 8.0E-04 0.04 Fathers age at death -0.47 -3.30 1.0E-03 0.04 Duration of moderate activity 0.28 3.29 1.0E-03 0.03 Age of first birth -0.23 -3.19 1.4E-03 0.06 Duration of vigorous activity 0.29 3.16 1.6E-03 0.04 Age at first live birth -0.22 -3.15 1.6E-03 0.17 Diagnoses - main ICD10: I21 Acute myocardial infarction -0.38 -3.03 2.5E-03 0.01 Transport type for commuting to job workplace: Cycle -0.29 -3.02 2.6E-03 0.03 Intelligence -0.22 -2.97 2.9E-03 0.19 Overweight 0.22 2.94 3.3E-03 0.11 Home area population density - urban or rural -0.46 -2.82 4.9E-03 0.01 Time spent using computer -0.16 -2.79 5.2E-03 0.10 Diagnoses - main ICD10: M54 Dorsalgia 0.35 2.77 5.6E-03 0.01 Pain type(s) experienced in last month: None of the above -0.16 -2.74 6.1E-03 0.06 Average weekly champagne plus white wine intake -0.22 -2.74 6.2E-03 0.03 College completion -0.27 -2.73 6.3E-03 0.08 Time spent watching television (TV) 0.16 2.69 7.1E-03 0.10 Illnesses of father: Chronic bronchitis/emphysema 0.28 2.64 8.3E-03 0.01 Potassium in urine -0.18 -2.59 9.7E-03 0.04 Prospective memory result 0.25 2.58 9.9E-03 0.06

85 Overall health rating 0.13 2.53 1.1E-02 0.10 Alcohol usually taken with meals -0.17 -2.53 1.2E-02 0.11 Schizophrenia 0.14 2.50 1.3E-02 0.46 Qualifications: CSEs or equivalent 0.22 2.50 1.3E-02 0.02 Years of schooling 2013 -0.21 -2.45 1.4E-02 0.08 Smoking/smokers in household 0.33 2.42 1.6E-02 0.01 Maternal smoking around birth 0.17 2.42 1.6E-02 0.05 Seen a psychiatrist for anxiety_ tension or depression 0.18 2.40 1.6E-02 0.03 Smoking status: Previous 0.16 2.36 1.8E-02 0.05 Age started oral contraceptive pill -0.23 -2.35 1.9E-02 0.05 Frequency of unenthusiasm / disinterest in last 2 weeks 0.16 2.33 2.0E-02 0.04 Years of schooling (proxy cognitive performance) -0.19 -2.30 2.2E-02 0.11 Non-cancer illness code_ self-reported: enlarged prostate 0.36 2.29 2.2E-02 0.01 Smoking status: Current 0.17 2.27 2.4E-02 0.05 Financial situation satisfaction 0.22 2.23 2.6E-02 0.06 Duration of other exercises 0.30 2.23 2.6E-02 0.02 Types of physical activity in last 4 weeks: Heavy DIY -0.16 -2.22 2.6E-02 0.03 Employment status: Looking after home and/or family -0.34 -2.20 2.8E-02 0.01 Mean Pallidum -0.34 -2.19 2.9E-02 0.16 Diagnoses - main ICD10: R10 Abdominal and pelvic pain 0.24 2.19 2.9E-02 0.01 Fathers age at death -0.22 -2.15 3.1E-02 0.03 Non-cancer illness code : pernicious anaemia -0.44 -2.12 3.4E-02 0.00 Leg fat percentage (left) 0.10 2.10 3.5E-02 0.21 Employment status: Doing unpaid or voluntary work -0.30 -2.10 3.6E-02 0.01 Heel bone mineral density (BMD) 0.15 2.08 3.7E-02 0.30 Age of smoking initiation 0.41 2.08 3.7E-02 0.06 Frequency of depressed mood in last 2 weeks 0.15 2.08 3.7E-02 0.04 Physical activity in last 4 weeks: None of the above 0.18 2.07 3.8E-02 0.02 Past tobacco smoking -0.13 -2.07 3.8E-02 0.09 Hearing difficulty/problems with background noise -0.15 -2.07 3.8E-02 0.05 Lung cancer 0.22 2.07 3.8E-02 0.31 Job involves heavy manual or physical work 0.13 2.06 4.0E-02 0.09 Alcohol drinker status: Never -0.21 -2.06 4.0E-02 0.01 Loud music exposure frequency 0.27 2.03 4.3E-02 0.03 Current tobacco smoking 0.15 2.02 4.4E-02 0.06 Parents age at death -0.34 -2.01 4.5E-02 0.03 Transport type for commuting to job workplace: Walk -0.24 -2.00 4.5E-02 0.02 Alcohol intake versus 10 years previously 0.16 2.00 4.6E-02 0.03 Illnesses of mother: Chronic bronchitis/emphysema 0.21 1.99 4.6E-02 0.01 Medication for pain relief_ constipation_ heartburn 0.21 1.99 4.6E-02 0.01 Falls in the last year 0.16 1.99 4.7E-02 0.03 Pain experienced in last month: Neck or shoulder pain 0.15 1.98 4.7E-02 0.03 Body mass index 0.13 1.98 4.8E-02 0.19

86 Reason for glasses/contact lenses: For short-sightedness -0.16 -1.98 4.8E-02 0.03 Tobacco smoking: Ex-smoker 0.22 1.98 4.8E-02 0.08 Diagnoses - main ICD10: K29 Gastritis and duodenitis 0.28 1.97 4.9E-02 0.01 Number of full brothers 0.21 1.97 4.9E-02 0.02 Illness_ injury_ bereavement_ stress in last 2 years 0.15 1.97 4.9E-02 0.03

Supplementary Table 2. Single-trait prediction results into the Australian cohort of 836 ALS cases and 665 controls.

Predictor Method NKR2 AUC Decile OR (CI 95%) P-value STD_PRS 0.010 0.544 2.18 (1.37 - 3.47) 1.1E-03 SBLUP 0.010 0.546 2.18 (1.37 - 3.46) 7.8E-04 ALS LDPF 0.011 0.547 2.29 (1.44 - 3.65) 6.0E-04 SBayesR 0.022 0.572 2.32 (1.46 - 3.71) 8.9E-07 STD_PRS 0.005 0.536 0.59 (0.38 - 0.94) 1.8E-02 SBLUP 0.007 0.539 0.57 (0.36 - 0.91) 6.5E-03 CP LDPF 0.006 0.538 0.63 (0.40 - 0.99) 8.4E-03 SBayesR 0.005 0.535 0.56 (0.36 - 0.89) 1.7E-02 STD_PRS 0.003 0.529 0.66 (0.42 - 1.04) 5.3E-02 SBLUP 0.002 0.522 0.68 (0.43 - 1.07) 9.7E-02 EA LDPF 0.004 0.527 0.63 (0.40 - 0.99) 4.6E-02 SBayesR 0.002 0.519 0.78 (0.49 - 1.22) 1.7E-01 STD_PRS 0.003 0.527 1.34 (0.85 - 2.14) 5.5E-02 SBLUP 0.005 0.536 1.33 (0.84 - 2.09) 1.5E-02 SCZ LDPF 0.006 0.538 1.45 (0.92 - 2.30) 1.2E-02 SBayesR 0.006 0.542 1.36 (0.87 - 2.15) 8.5E-03

*Odds ratio comparing individuals in the highest risk decile to the lowest.

87 Supplementary Table 3. Multi-traits predictor prediction results for ALS single traits and combined ALS+correlated traits in Australian cohort of 836 ALS cases and 665 controls

Predictor Method NKR2 AUC Decile OR (CI 95%) P-value STD_PRS 0.010 0.544 2.18 (1.37 - 3.47) 1.1E-03 SBLUP 0.010 0.546 2.18 (1.37 - 3.46) 7.8E-04 ALS LDPF 0.011 0.547 2.29 (1.44 - 3.65) 6.0E-04 SBayesR 0.022 0.572 2.32 (1.46 - 3.71) 8.9E-07 MTAG 0.012 0.554 1.84 (1.16 - 2.90) 2.9E-04 SBLUP-SMTPRED 0.014 0.557 2.02 (1.27 - 3.21) 7.6E-05 ALS+CP LDPF-SMTPRED 0.015 0.561 1.90 (1.20 - 3.01) 4.4E-05 SBayesR-SMTPRED 0.024 0.575 2.89 (1.80 - 4.63) 2.4E-07 MTAG 0.010 0.551 1.60 (1.01 - 2.52) 6.9E-04 SBLUP-SMTPRED 0.012 0.552 1.83 (1.16 - 2.89) 2.5E-04 ALS+EA LDPF-SMTPRED 0.013 0.557 1.83 (1.16 - 2.89) 1.7E-04 SBayesR-SMTPRED 0.023 0.575 2.53 (1.58 - 4.04) 4.5E-07 MTAG 0.011 0.546 2.56 (1.59 - 4.11) 4.0E-04 SBLUP-SMTPRED 0.014 0.552 1.94 (1.22 - 3.11) 9.0E-05 ALS+SCZ LDPF-SMTPRED 0.015 0.554 2.29 (1.43 - 3.67) 5.8E-05 SBayesR-SMTPRED 0.025 0.577 3.27 (2.03 - 5.28) 1.5E-07 MTAG 0.014 0.559 2.10 (1.33 - 3.34) 8.3E-05 SBLUP-SMTPRED 0.018 0.563 2.58 (1.61 - 4.11) 1.1E-05 ALL LDPF-SMTPRED 0.018 0.568 2.26 (1.42 - 3.61) 7.8E-06 SBayesR-SMTPRED 0.027 0.580 3.15 (1.96 - 5.05) 4.6E-08

*Odds ratio comparing individuals in the highest risk decile to the lowest.

88 Supplementary table 4. ALS prediction results for clumped SNPs in various P-value Thresholds (PT). PT SNP Set NKR2 AUC Decile OR P-value HM3 0.010 0.555 2.181 1.05E-03 all Full 0.003 0.555 1.260 5.98E-02 HM3 0.003 0.526 1.212 6.57E-02 5.0E-02 Full 0.000 0.493 1.015 6.87E-01 HM3 0.005 0.534 1.540 2.37E-02 5.0E-03 Full 0.002 0.520 1.296 1.50E-01 HM3 0.006 0.539 1.451 1.18E-02 5.0E-04 Full 0.008 0.544 2.062 3.57E-03 HM3 0.010 0.549 1.718 7.68E-04 5.0E-05 Full 0.006 0.534 1.653 1.30E-02 HM3 0.017 0.559 2.822 1.96E-05 5.0E-06 Full 0.019 0.566 2.871 4.27E-06 HM3 0.013 0.553 1.931 1.91E-04 5.0E-07 Full 0.014 0.557 NA* 9.35E-05 HM3 0.003 0.527 NA* 5.26E-02 5.0E-08 Full 0.014 0.555 NA* 1.19E-04 These data correspond to Supplementary Figure 1.

*Decile odd-ratios not calculated due to lack of separation between deciles.

89 3

Chapter 3: New insights from post-GWAS analysis of ALS:

Genetic architecture, relevant cell-type, and causal genes

Unpublished : Post-GWAS analysis using latest (November 2019) unpublished European ALS summary statistics

90 3.1 Introduction

In Chapter 2, I reported post-GWAS analyses for ALS published studies (20,806 ALS cases and 59,804 healthy controls of European ancestry). These analyses include ALS prediction strategies and genetic correlation. Subsequently, I was given access to the latest and largest GWAS meta-analysis for ALS (27,434 ALS cases and 112,018 controls) that are currently unpublished. With the significantly larger sample size, the latest ALS GWAS data potentially have more statistical power to investigate ALS aetiology. In this Chapter, I repeated the ALS prediction analyses reported in chapter 2 and conducted more detailed post-GWAS analyses using the latest ALS GWAS that were previously not possible due to low statistical power. The results section of chapter 2 showed that ALS prediction using polygenic risk scores of significantly associated genetic variant effects (P-value threshold 5x10-8) from ALS GWAS explained smaller proportion of variations compared to less stringent P-value threshold (5x10- 6). This observation indicated that the current published ALS GWAS is still underpowered to detect all associated genetics variants, as the subtler variants have not passed genome-wide significance threshold (P-value < 5x10-8). One of the possible solutions to this problem is by increasing the sample sizes of ALS GWAS. This strategy of increasing sample size had been successful for the increasing understanding of psychiatric disorders, like schizophrenia1,2 and major depressive disorder3. Increasing GWAS sample size also means increased power to detect new ALS associated genes in gene-based tests and more power to do enrichment analysis using multi-omics methodologies. To date, GWAS analyses have successfully identified many genomic regions associated with ALS. Although, these associated regions are not necessarily causal, fine-mapping strategies can be applied to the associated region to identify possible causal genes. Moreover, the GWAS data provide rich information about ALS, for example about genetic architecture, signatures of selection, and cell-type enrichment. The current implementation of Bayesian methodology to GWAS summary statistics has allowed more flexibility in the modelling of genetic architecture assumptions and increased the predictive ability of polygenic risk score (PRS) predictors (e.g., SBayesR)4. Moreover, this approach also allow investigation of selection by observing the relationship between SNPs effect and its allele frequency (using new methodology from our lab SBayesS5). The combination of GWAS associated regions with expression level data from specific tissues, allows the inference of associated region to relevant cell-type and possible causal mechanisms. Currently, more than 50 genes spread across the

91 human genome have been reported to be associated with ALS, but the putative causality of those genes remains unknown6,7. There are several reasons why it is hard to test the causality of ALS associated genes. Firstly, ALS is a relatively uncommon disease with a lifetime risk of 1 in 3008. This rarity can lead to limited sample size for study that causes limited statistical power to infer genes causality through Mendelian randomisation9 and randomised clinical trials (if the medication that targeting the gene already existed)10. Although, this sample size problem can be overcome by international collaboration to collect multiple cohorts from around the world; the ethical constraints and logistics of data sharing of individual level human data (genotyping/sequencing with the demographic data) often make it impractical11,12. Secondly, the complex nature of ALS that involve many genes, often make the application of functional testing on animal model (especially mice model) rather limited. Animal models usually rely on single genes mutation to be compared to the wild-type13. While this approach is very effective for a study of Mendelian traits, the application to complex trait disorder is limited to validating possible pathways affected by single gene mutation. Current ALS animal models use worms, flies, zebrafish, and mice with single (or sometimes double) gene mutations (SOD1, C9orf72, TDP43, and FUS)14. While this approach can work for highly penetrant single gene mutation in familial ALS research, it is unable to simulate the complex trait nature of sporadic ALS15. Moreover, there are physiological differences between animal and human that potentially bias the results. One possible method to prioritise putative ALS associated genes is to use Summary- based Mendelian Randomisation (SMR)16. This method overcomes the first logistic problem by relying on summary statistics, which are easily accessible and have fewer ethical constraints17. Secondly, the GWAS and expression Quantitative Loci (eQTL) summary statistics of human genetic study are available16,17. Therefore, the physiological bias between human and animal models could be minimised. Moreover, the GWAS and eQTL data with big sample size are available from various consortia that boost the statistical power for causality inferences using the Mendelian Randomisation approach16. Much like I can use correlated traits to improve genetic prediction, I can gain insight into the shared genetic architecture between these traits and ALS by looking at the overlap of SMR results. To support SMR analysis, the knowledge of relevant cell-type/tissue is very important for selection of the eQTL summary statistics. Although, it seems obvious that ALS is a motor neuron disease, the complex trait nature of sporadic ALS might mean that biological

92 tissues/system other than the nervous system are involved. To investigate whether the ALS associated genomic regions expressed in certain cell-type or tissues, here Stratified Linkage Disequilibrium Regression (S-LDSR) methodology18 is applied to ALS GWAS summary statistics with various cell-specific expression and functional annotations. The cell-type enrichment is based on the SNP-based heritability of cell-specific annotation, relative to the SNP-based heritability if it was spread evenly based on number of SNPs allocated to each annotation19. Therefore, I can detect the most enriched cell-type and make best use of the data available to help prioritise biological systems that are relevant to ALS. Here, I had access to the latest European ALS GWAS of 27,434 ALS cases and 112,018 healthy controls, which is the largest ALS GWAS to date (unpublished). This ALS GWAS is significantly larger than previous published GWAS of 20,806 ALS cases and 59,804 healthy controls6. With the significant increase of sample sizes, the latest unpublished GWAS had identified a new genes associated with ALS (RESP18). Using the increased statistical power that comes with it, my aim is to add more insight to ALS genetics, by applying polygenic risk scores, Bayesian approach to estimate genetic architecture and selection coefficient, cell-type enrichment, and prioritise putatively associated genes using the SMR framework. The first genetic analysis is prediction using polygenic risk scores as described in single -trait prediction (similar the analysis described in Chapter 2). I also explore the genetic architecture insight using the Bayesian methods (SBayesR and SBayesS). For prioritisation of putative ALS causal genes, the relevant cell-type enrichment of ALS associated SNPs is tested using stratified LD-score regression approach. The cell-type enrichment of ALS correlated traits (cognitive performance, education attainment, and schizophrenia) that possibly explain the genetic correlation of ALS is considered. Then, SMR is used to assess and test the possible causality of ALS associated genes using latest ALS European summary statistics (unpublished) and related eQTL that data detected in cell-type enrichment. This chapter also reports overlapping SMR genes and significant enrichment of the genetically correlated traits in ALS study, which potentially add more insight to ALS aetiology.

93 3.2 Material and Method

3.2.1 ALS European GWAS 2019 dataset (unpublished)

The ALS 2019 European GWAS meta-analysis includes 113 ALS cohorts, totalling 139,452 individuals (27,434 ALS cases and 112,018 healthy controls) in total. These 113 cohorts were genotyped using 9 different SNP arrays including : IlluminaExome, Illumina317, Illumina370, Illumina550, Illumina610, Illumina660, IlluminaOmniExpress, Illumina2M, and IlluminaGSA. All samples in the study were classified to 7 different strata based on the similarity of genotyping array used (Supplementary Table 1). A common quality control process was applied to each of the 7 different strata individually. SNPs were annotated, with SNPs having multi-allelic and ambiguous alleles (AT/CG SNPs) removed, and then retaining SNPs that mapped to HG19 genome reference according to dbSNP150. The individual level QC first removed low quality genotyped individuals using PLINK1.9 (--geno 0.1 to removed SNPs that missing more than 10% of individuals and --mind 0.1 to filter individuals that miss more than 10% of their SNPs). The non-European ancestry outliers were removed based on the first 4 PCs (>2.5 SD) based on HapMap3 projected PCs calculated using EIGENSTRAT. A more stringent SNP QC to remove rare SNPs with minor allele frequency below 1% and to remove SNPs that significantly deviating from Hardy-Weinberg Equilibrium with P-value > 5E-6, followed by more stringent individual QC by removing individuals that has more than 2% SNP missingness, and removing samples that have outlying heterozygosity rate more than 0.2 from the average-heterozygosity using PLINK1.9. For the individual phenotype QC, I removed individuals that failed sex-check and had unclear or missing phenotypes). The autosomal regions (chromosome 1-22) were extracted. SNPs with a differential missingness between cases and controls (--test-missing midp) P-value < 1e-4 were excluded. Duplicated individuals (or possibly twins) were removed by calculating relatedness by PLINK 1.9 (--genome) and set the threshold at PI_HAT>0.8. Each stratum was imputed using HRC reference panel (r.1.1 2016) 20 and phased using EAGLE 2.3 on Michigan Imputation Server. SNPs with an INFO-score < 0.6 or deviating from Hardy-Weinberg equilibrium (PLINK1.9 --hwe midp 1e-5) were excluded. After quality control, case-control GWAS analysis were conducted by fitting logistic mixed model using SAIGE21 including the first 20 genetic PCs as covariates. The model was fit on imputed pruned data (--indep-pairwise 50 25 0.1) for each stratum via a leave-one-chromosome-out procedure. SAIGE internally calculates a genetic relationship matrix to correct for relatedness and

94 population structure. This means that SAIGE does not require removal of related samples since genetic relatedness is modelled appropriately21–24. Summary statistics from each stratum were meta-analysed using METAL25, inversely weighted by the standard error and without using genomic control correction. Two sets of meta-analysed summary statistics were created: the first was a meta-analysis of stratum 1 to 6 that serve as ALS discovery summary statistics (114,598 individuals); the second was a meta-analysis of all stratum that I called ALS total summary statistics (139,452 individuals). The discovery summary statistics were used in prediction analysis of stratum 7, while the total summary statistics was used for the rest of post- GWAS analysis (genetic architecture, cell-type, and SMR).

3.2.2 Polygenic risk scores

The same methods were used as in Chapter 2, repeated here for completeness of this chapter method. A polygenic risk score (PRS) was calculated for all individuals in the ALS case- control sample denoted stratum-7 (4,059 cases and 20,795 controls). The ALS GWAS discovery (~114K individuals) summary statistics was used to generate SNP weights. The SNPs taken into the PRS calculations were limited to those SNPs found in HapMap 3 (HM3)26 as these SNPs sets were found to produce better performance in ALS prediction (see Figure 1), consistent with observations in chapter 2. I only included the SNPs available in all strata in the meta-analysed ALS GWAS discovery cohort (only SNP with N~114K included). PRSs were calculated for the individuals in stratum 7, using different methods to decide SNPs included and their effect sizes, but in each case, the PRSs were calculated using the PLINK 1.9 “--score” as the sum of risk alleles weighted by the given SNP effect sizes. The efficacy of the predictor was measured by the Nagelkerke-R2 of the logistic regression case- control status on PRS (R glm package27 for logistic regression and fmsb function28 for Nagelkerke-R2 calculation) and by R2 converted to the liability scale assuming a lifetime risk of 0.25%. In the basic PRS approach, the SNPs of minor allele frequency (MAF) > 0.01 were clumped (PLINK --clump), which selects an independent SNP set by taking the most associated SNP in a genomic region and excluding any SNP with r2 > 0.01 with already selected SNPs. The SNP were clumped with a range of P-value thresholds for selection into the PRS (see Figure 1), but reported in the Results section are PRS derived from all SNPs and derived from only HapMap3 SNPs, I call this standard PRS (STD_PRS). Including all SNPs in our prediction model rather than selecting the P-value threshold based on results from the data prevents the

95 variance captured from PRS being biased due to winner’s curse29, allowing fairer comparison across the methods. Since the clumping r2 threshold is arbitrary, I also used BLUP (Best Linear Unbiased

Prediction) values of all SNPs to calculate a PRSSBLUP, an approach that appropriately accounts for linkage disequilibrium (LD) of the SNPs, but assumes SNP effects are normally distributed (which is a valid assumption for highly polygenic traits). Approximate BLUP estimates were derived from GWAS summary statistics using the SBLUP30 method implemented in the GCTA software. SBLUP is theoretically equivalent to the method LDPred-inf31 (in practice implementation decisions can generate slightly different results). The Human Retirement Study (HRS) cohort32 was used as the reference sample to calculate the LD structure. LDPred-Funct (LDPF)31 was also used to estimate SNP effects; this method includes functional annotation to weight SNPs effects. The baseline-LD functional annotation provided by Gazal et al.33 and the HRS cohort for the LD structure reference were used to calculate LDPF-inf SNP weightings. Lastly, calculated SNP effects were calculated using Summary-based BayesR (SBayesR)4, a method that models effect sizes using a mixture of normal distributions with different variances. This allows greater flexibility in modelling the underlying genetic architecture model, potentially providing a better reflection of the underlying genetic architecture of ALS. The sparse LD-matrix built from 10,000 UK Biobank34 unrelated individuals was used for the LD reference4.

3.2.3 Genetic architecture

The Summary-based BayesS (SBayesS)5 and SBayesR4 methods were used to investigate the genetic architecture of ALS from the distribution of SNP effect sizes. Both SBayesS and SBayesR estimate the polygenicity parameter (p), where p is the proportion of (HapMap3) SNPs estimated to be causal. SBayesS also estimates the S parameter, which describes the effect size-MAF relationship, an indication the strength of possible selection (over many generations). Given that only HapMap3 SNPs are used, the estimates are interpreted comparative to estimate of other diseases and traits. The full ALS GWAS summary statistics (~139K individuals) and UKBB sparse LD-matrix built from 10,000 UK Biobank34 unrelated individuals for the LD reference as an input for SBayesS and run with default options. The option “--multi-chains 4” was used which runs 4 chains with different starting values sampled

96 at random to check the convergence quality of parameter estimates. Both SBayesS and SBayesR provided estimates of SNP-based heritability.

3.2.4 Cell-type analysis

Cell-type analysis is a method to partition SNP-based heritability based on annotation of SNPs to genes and genes to cell-types where they are more expressed compared to other cell-types (derived from gene expression studies). The partitioned SNP-based heritability is estimated using stratified LD score regression methodology and expressed as enrichment of the SNP- based heritability relative to that expected proportional to the numbers of SNPs in each annotation category18,19. The cell-type-specific expression annotations implemented as default in the LDSC software19,35 were from GTEx version 636 and Franke Lab37. The cell-type enrichment significance threshold was set at P-value < 0.025 (-log(P-value) = 3.65) based on the False Discovery Rate, suggested by the authors19. The P-values of the cell-type enrichment group are plotted, with the annotation based on annotation by the same paper19. I used the total ALS GWAS (~139K individuals) summary statistics as the input for this analysis. I compared the cell-type enrichment of ALS correlated traits identified in chapter 2 : cognitive performance (CP), education attainment (EA), and schizophrenia (SCZ) .

3.2.5 SMR analysis of ALS

The Summary data-based Mendelian Randomisation (SMR) method16 was used to provide evidence that SNP associations with ALS are mediated through gene expression. This method applies a Mendelian Randomisation framework to infer causality/ by integration of GWAS summary statistics and expression Quantitative Trait Loci (eQTL) summary statistics. While significant SMR test results support co-localisation of SNP-ALS associations and eQTLs, the significant association could also reflect linkage rather than causality. Application of the HEeterogeneity InDependent Instrument test (HEIDI test) distinguishes, where possible given the data, causality (or pleiotropy) from linkage16. The HEIDI test is testing whether there is heterogeneity between SMR effects size different from the null hypothesis that SMR effect sizes in a region reflect the LD between each SNP and the top eQTL SNP. In this case, the lack of heterogeneity (not significant heterogeneity or HEIDI P-value > 0.01) supports the hypothesis that the SNP-trait association is mediated through gene expression. If the HEIDI P-

97 value is small, the gene may still be relevant to the trait but the mediation from SNP to trait via gene expression may be more complex. The total ALS 2019 GWAS (~139K individuals) summary statistics and eQTL summary statistics from brain and blood tissues were used as inputs for SMR analysis. The expression quantitative trait loci (eQTL) summary data were downloaded from the SMR resource website (http://cnsgenomics.com/software/smr/#eQTLsummarydata). For this analysis, two relevant ALS tissues (determined from the cell-type analysis) were selected using eQTL summary statistics from studies with the biggest sample size. Two brain eQTL summary statistics data sets were used. First, a meta-analysis of brain cis-eQTL data38 from GTEx version 739, the CommonMind Consortium40, and ROSMAP project41, from 10 human brain regions with effective sample size of 1,194 individuals (will be called here “brain meta- eQTL”). Second, PsychENCODE brain eQTL data from 1,387 individuals (SCZ, Bipolar Disorder, and Autism Spectrum Disorder) brain prefrontal cortex. The blood eQTL data were produced from eQTLgen consortium from 31,684 individuals42 (will be called here “blood eQTLgen”). For all of these eQTL data, I excluded cis-eQTL with MAF < 0.01. The MHC region was also excluded to avoid misinterpretation due to the complexity of LD pattern of this region. Significance for SMR analysis is declared based on a stringent Bonferroni corrected threshold of P < 0.05/number of probes in eQTL summary statistics. The HEIDI test threshold is P-value > 0.01 which retains only results with robust evidence that the SNP could play a causal role for disease mediated through gene expression.

3.2.6 SMR results comparison of ALS, CP, EA, and SCZ

Despite using the results from the largest GWAS for ALS to date, it is clear that the GWAS is still underpowered for genetic discovery. As a consequence, the current ALS GWAS also has limited capacity to detect causal genes in SMR analysis. To deal with this problem, I used SNP- based genetically correlated traits with ALS (Chapter 2) to leverage ALS GWAS power, since these traits (CP, EA, SCZ) have very large, well-powered GWAS. Here, I investigated if I can leverage power from these genetically correlated traits in analyses of ALS in two ways. First, I conducted SMR analyses for CP, EA, and SCZ using brain meta-eQTL and blood eQTLgen. Then, I extracted the significant genes (pSMR < 5x10-6) from those traits SMR analysis and compare them with ALS significant SMR genes. The overlapping genes from these significant SMR genes list, might signify the shared biological mechanism between ALS and its correlated

98 traits. Second, I further investigate the relationship between ALS and its correlated traits at a genome wide scale (not only a list of genes). The ALS SMR summary statistics were subsetted based on the list of significant SMR gene probes from its correlated traits. I call this subset of ALS SMR probes as “enriched-probes”. The enriched-probes P-values were sorted and plotted using QQ-plot and the lambda-value signifies the inflation of P-values from the expected. I set ALS SMR probes P-values lambda value as the baseline of inflation and compare it with the lambda values of enriched-probes of ALS correlated traits. The higher the lambda values of enriched-probes from the baseline lambda value, signify the leveraged power to detect the other ALS causal genes that not yet reach significant in ALS GWAS.

99 3.3 Results

3.3.1 Polygenic risk scores

The result from basic ALS PRS prediction into strata-7 of ALS European cohort of 4,059 cases and 20,795 healthy controls using clumping approach are given in Figure 1. In general, the trend was similar to the results from the analysis of the small ALS GWAS meta-analysis presented in chapter 2, with the HM3 SNPs sets having better prediction than the full set of SNPs, supporting chapter 2 results, and likely reflecting that this well-studied SNP set is better imputed.

Figure 1. Effect on ALS prediction accuracy with different SNP selection thresholds and SNP set choice. Either all imputed SNPs (9.4 million, “Full” – red) or HapMap 3 SNPs (0.7 million, “HM3” - blue) were used to build an ALS PRS using the European ALS GWAS discovery summary statistics. In x-axis, “all” represent all SNP set without any clumping, “None” represent clumped SNP without P-value thresholding, and the rest of x-axis labels (5e-1 etc) represent the clumped SNPs with P-value threshold on the label. While selecting a SNP significance threshold of 5x10-2 provided the highest out of sample prediction, selecting this threshold based on accuracy in our test cohort would introduce a bias into our baseline prediction model. So, I instead took forward the PRS using all HM3 SNPs without exclusion based on significance threshold (“all” label on x-axis) will later described as standard PRS (STD_PRS).

100

The result from ALS PRS prediction into strata-7 of ALS European cohort of 4,059 cases and 20,795 healthy controls using four methods (STD_PRS, SBLUP, LDPF, and SBayesR) are summarised in Figure 2. Again, the trend was similar to our previous result in chapter 2, SBayesR gives the best prediction performance with a Nagelkerke-R2 of 0.021, followed by LDPF at 0.017, SBLUP at 0.015, STD_PRS at 0.014. On the liability scale R2, the performance was scaled to 0.010, 0.008, 0.007, 0.006 for SBayesR, LDPF, SBLUP, and STD_PRS respectively.

Figure 2. Prediction accuracy statistics of PRS of ALS in the stratum-7 cohort using PRS derived from various methods. PRSs were constructed using ALS GWAS discovery summary statistics using methods on x-axis. The performance of predictors is measured using Nagelkerke-R2 (upper) and R2 in liability scale (lower). P-value of logistic regression on top of each bar.

101 3.3.2 Genetic architecture

The SNP-based heritability on the liability scale estimated from both SBayesS and SBayesR was ~0.027 (SD = 0.003). The polygenicity parameter (p) of SBayesS was 0.007 (SD=0.001) (i.e. 0.7% of HapMap 3 SNPs are causal) and selection parameter (S-value) was -1.00 (SD=0.09). These parameters are best interpreted relative to other diseases. The polygenicity parameter (p) of ALS was relatively low and the S-value ALS was very high compared to other diseases like Type-2 Diabetes (p = 0.01 , S-value= -0.30) and Depression (p = 0.08 , S-value = -0.24)5. These comparisons indicate that ALS is relatively less polygenic, and there is a strong relationship for high effect size with lower minor allele frequency that provides relatively strong evidence for negative selection.

3.3.3 Cell-Type analysis

The cell-type analysis showed that ALS GWAS associations are enriched in more than one tissue category (Figure 3 & Table 1). Most of the cell-type significant enrichment (7 out of 9) were from the CNS (central nervous system) tissue category, with frontal cortex showing the strongest enrichment in ALS. There were other significant enrichments from dendritic cell (blood and immune tissue category) and epidermis (other tissue category). In comparing the cell-type analysis between ALS and its correlated traits (CP, EA, SCZ), I found in general these traits were highly enriched in CNS tissues category (Figure 4). All the CNS cell-type enriched in ALS were also strongly enriched in CP, EA, and SCZ, although they have different sequence of enrichment significance (Table 2). The comparison of CNS cell-type enriched in ALS were not significantly different to each other, most likely due to limited power of ALS GWAS. However, ALS CNS enriched cell-type in CP, EA, and SCZ showed significant difference of significance between the CNS cell-types. For example, from Table 2, the most enriched cell- type for ALS is frontal cortex, while in CP, EA, and SCZ frontal cortex was not the strongest enriched cell-type. On the other hand, entorhinal cortex that comes second on ALS cell-type, had relatively strong enrichment (top 3) in CP, EA, and SCZ (Table 2).

102

Figure 3. Cell-type enrichment of total ALS GWAS total summary statistics (all strata). Cell- type analysis is based on stratified LD score regression using cell specific enrichment from GTEx version 6 and Franke Lab single cell references. Each dot represents a specific cell-type and its colour represents its tissue group. The significance line set at 3.65 (- log P-value of 0.025) based on Finucane (2015) suggestion (orange dotted line). Bigger dots represent the significantly enriched cell-type in ALS which also listed in Table 1.

Table 1. Significantly enriched cell-type of ALS presented in Figure 3 sorted by -log(P- values) Cell-type Name Tissue Category P-value -logPval Frontal Lobe CNS 6.22E-04 7.38 Entorhinal Cortex CNS 1.32E-03 6.63 Limbic System CNS 2.45E-03 6.01 Visual Cortex CNS 2.71E-03 5.91 Dendritic Cells Blood/Immune 4.85E-03 5.33 Cerebral Cortex CNS 5.23E-03 5.25 Hippocampus CNS 1.07E-02 4.54 Epidermis Other 1.79E-02 4.02 Mesencephalon CNS 1.98E-02 3.92

103

Figure 4. Cell-type enrichment of ALS GWAS total summary statistics (all strata) compared to cell-type enrichment of its genetically correlated traits: Cognitive Performance (CP), Education Attainment (EA), schizophrenia (SCZ). The specific cell-types that enriched in ALS were shown in larger dots.

Table 2. The details of significantly enriched Central Nervous System (CNS) cell-type of ALS in the correlated traits (CP, EA, and SCZ) sorted by -log(P-values)

ALS CP EA SCZ Cell-type name -log(P-value) Cell-type name -log(P-value) Cell-type name -log(P-value) Cell-type name -log(P-value) Frontal Lobe 7.4 Cerebral Cortex 33.6 Limbic System 43.5 Entorhinal Cortex 35.3 Entorhinal Cortex 6.6 Hippocampus 33.2 Cerebral Cortex 42.8 Cerebral Cortex 34.7 Limbic System 6.0 Entorhinal Cortex 33.0 Entorhinal Cortex 37.5 Limbic System 32.0 Visual Cortex 5.9 Limbic System 31.5 Hippocampus 36.8 Hippocampus 26.9 Cerebral Cortex 5.3 Visual Cortex 23.2 Frontal.Lobe 21.9 Frontal Lobe 19.4 Hippocampus 4.5 Frontal Lobe 23.1 Visual Cortex 21.7 Visual Cortex 19.0 Mesencephalon 3.9 Mesencephalon 7.3 Mesencephalon 5.4 Mesencephalon 7.5

104 3.3.4 SMR analysis

SMR analysis results that passed HEIDI test and SMR multiple testing P-value adjustment for ALS shown in Table 3. The list of all significant SMR genes without necessarily pass HEIDI test provided in Supplementary table 2. Some interesting patterns are found in Table 3. First, the SCFD1 gene from chromosome 14 was found as the most significant (smallest SMR P- value) across all eQTL summary statistics. Second, SLC98A and RESP18 genes are significant in brain related eQTL (Brain meta-eQTL and Prefrontal Cortex PsychENCODE). Third, a newly found associated gene (RESP18) is found from this analysis that has not been reported before (Figure 6). Fourth, the blood eQTLgen reported 4 significant genes (GGNBP2, DHRS11, ZNHIT3, MYO19) on chromosome 17 that might be causal to ALS as previously reported43 (Figure 5). Fifth, the gene G2E2 genes in in close proximity to SCFD1 and the eQTL SNPs are in LD, however it was only found significant in Prefrontal Cortex PsychENCODE - - not in brain meta-eQTL -- despite both eQTL data sets being from brain tissues (Figure 7); however, this difference may reflect power. Sixth, the brain meta-eQTL also confirms MOBP gene as one possible causal gene as found in previous ALS GWAS43,44.

Table 3. SMR analysis result across the eQTL summary statistics using ALS GWAS total summary statistics

eQTL Summary Chromo- Gene GWAS P- SMR P- HEIDI SNP Statistics some Name value value P-value 2 RESP18 rs2385405 2.2E-06 2.6E-06 0.65 3 MOBP rs631312 2.4E-10 4.7E-07 0.57 Brain meta-eQTL 14 SCFD1 rs2070339 2.1E-12 1.9E-08 0.16 20 SLC9A8 rs6020069 8.5E-08 2.9E-06 0.15 17 GGNBP2 rs11650008 1.2E-07 1.4E-07 0.13 14 SCFD1 rs7144204 3.3E-12 1.5E-11 0.57 Blood eQTLgen 17 DHRS11 rs11657469 1.9E-06 2.0E-06 0.30 17 ZNHIT3 rs4796224 4.8E-07 7.1E-07 0.35 17 MYO19 rs8882 4.1E-07 8.2E-07 0.69 14 G2E3 rs2045180 7.5E-13 2.9E-10 0.06 14 SCFD1 rs229176 1.1E-12 7.3E-11 0.03 Prefrontal cortex 17 PsychENCODE SARM1 rs35714695 1.0E-08 1.7E-07 0.59 2 RESP18 rs6436136 2.3E-06 3.7E-06 0.99 20 SLC9A8 rs6020193 1.3E-07 1.7E-06 0.07

105 SMR analysis using Blood eQTLgen had one region in chromosome 17 that has four SMR significant genes (GGNBP2, DHRS11, ZNHIT3, and MYO19) which was reported before in SMR analysis using smaller ALS GWAS and blood eQTL data43 (Figure 5). Compared to the previous published result43, the top SMR signal for GGNBP2 gene was more significant in our new analysis (previous analysis pSMR = 4.6x10-6 compared to new pSMR = 1.4x10-7) and the other two genes reported before (ZNHIT3 and MYO19) are also significant. SMR analysis in Brain meta-eQTL also able to suggest possible causality of ALS associated region in chromosome 2 (Figure 6), by detecting RESP18 in the genomic region that very dense with genes.

106 ) ) )

(ZNHIT3 (MYO19 (GGNBP2 (DHRS11

8 ENSG00000108278 ENSG00000141140ENSG00000005955ENSG00000108272 6 pSMR = 3.27e−06

GWAS or SMR) GWAS 4 P (

10 2 log

− 0 312 ENSG00000005955 (GGNBP2) 208 104 0 232 ENSG00000108278 (ZNHIT3) 155

eQTL) 77 P ( 0 10 128 log ENSG00000141140 (MYO19) − 85 43 0 403 ENSG00000108272 (DHRS11) 269 134 0

CCL4 TBC1D3B LOC102723471 CCL3L3 DHRS11 CCL4L1 ZNHIT3 LHX1AATF CCL4L2 MYO19 MIR2909 GGNBP2

34.4 34.6 34.8 35.0 35.2 35.4

Chromosome 17 Mb

Figure 5. Results of the SMR+HEIDI analysis that combines the data from GWAS and eQTL studies. In this locus on chromosome-17, there are 4 SMR significant genes that also passed HEIDI test in blood eQTLgen, which 3 of the genes (GGNBP2, ZNHIT3, and MYO19) were reported before43.

107 )

(RESP18

8 ENSG00000182698

6 pSMR = 7.09e−06 GWAS or SMR) GWAS P

( 4 10 log − 2

0

403 ENSG00000182698 (RESP18) 269

134 eQTL) P ( 10 0 log −

WNT6 IHH GLB1L DNPEPSPEG SLC4A3 LINC00608 FAM134A PTPRN DES TMEM198 WNT10A MIR3131 STK16 GMPPA CRYBA2 ABCB6MIR153−1 MIR3132 LINC01494 NHEJ1 TUBA4A LOC100996693 MIR375 ZFAND2B RESP18 OBSL1 CDK5R2 SLC23A INHA CFAP65 ATG9A ASIC4 FEV CNPPD1DNAJB2 STK11IP LOC100129175 ANKZF1 CHPF

219.8 220.0 220.2 220.4 220.6 220.8

Chromosome 2 Mb

Figure 6. Results of the SMR+HEIDI analysis that combines the data from GWAS and eQTL studies. In this locus on chromosome-2, a new gene (RESP18) that passed SMR and HEIDI test in brain meta-eQTL data is reported in the genomic region that very dense with genes.

108

Figure 7. Results of the SMR+HEIDI analysis that combines the data from GWAS and eQTL studies. In this locus on chromosome-14 the G2E2 gene did not pass SMR and HEIDI test in brain eQTL data (upper) but is significant in PsychENCODE data (below)

109 Although SMR analysis of ALS GWAS with brain and blood eQTL data were able to suggest some causal ALS genes, it is quite likely that the ALS GWAS is still underpowered for detecting all ALS causal genes. Therefore, I hoped to leverage ALS GWAS power for detecting causal genes by using well-powered GWAS of ALS genetically correlated traits (CP, EA, SCZ) in two ways. First, I compared the list of SMR significant genes that passed the SMR P-value threshold in ALS and its correlated traits (CP, EA, SCZ). The genes MYO19, GGNBP2, DHRS11, ZNHIT3 were found to be overlapping in ALS and CP significant genes in blood eQTL. There were no overlapping significant genes from EA and SCZ. These observations suggest overlapping causal genes between ALS and CP, and possible shared mechanism between the two. Second, I further investigate the relationship between ALS and its correlated traits at a genome wide scale (not only a list of genes). The ALS SMR summary statistics were subsetted based on the list of significant SMR gene probes from its correlated traits (CP, EA, SCZ). I called this subset of ALS SMR probes as “enriched-probes” and the detailed number of this enriched-probes in Table 4. The enriched-probes P-values were sorted and plotted using QQ-plot and the lambda-value signify the inflation of P-values from the expected. The baseline of P-value inflations for ALS are shown in figure 8A and 8B, where the ALS SMR P-values inflation from brain (Brain meta-eQTL) and blood (eQTLgen) are set around lambda ~1.2. The CP enriched-probes shown the highest inflation compared to other correlated traits (EA and SCZ), with lambda values of 3.0 and 2.6 for Brain and blood respectively (Figure 8C and 8D). EA enriched ALS SMR P-values inflation in the brain was weaker with lambda value of 1.8 compared to blood at 2.5 (Figure 8E and 8F). SCZ enriched ALS SMR P-values were not inflated (lambda values of 1.2 for brain and blood) compared to the ALS baseline (also 1.2) (Figure 8G and 8H).

Table 4 . The number of enriched-probes for each ALS correlated traits in brain meta-eQTL and blood eQTLgen

Trait Brain meta-eQTL Blood eQTLgen CP 74 161 EA 117 269 SCZ 31 113

110

Figure 8. QQ plots of ALS SMR P-values of (A)Brain meta-eQTL and (B)Blood eQTL, compared to QQ plots of ALS P-values that enriched by cognitive performance (CP) significant probes in (C)brain and (D)blood. Educational attainment (EA) enriched significant probes in (E)brain and (F)blood. schizophrenia (SCZ) enriched significant probes in (G)brain and (H)blood.

111 3.4 Discussion

3.4.1 Prediction and genetic architecture parameter estimates

The results from prediction and genetic architecture analyses were expected and consistent with the single-trait prediction result presented in Chapter 2. PRS built from SNP effect estimates from SBayesR, a prediction method that can account for non-infinitesimal genetic architectures, has higher prediction accuracy (Nagelkerke’s R2) for predicting ALS compared to other methods. The polygenicity parameter (p) estimate from SBayesS analysis was very low compared to other common diseases, which implies that ALS genetic architecture is not highly polygenic. Despite this, the result of standard PRS prediction using clumping with various P-value thresholds showed that the PRS that from significant clumped-SNPs (P-value lower than 5x10-8) have lower predictive capabilities than PRS from clumped-SNPs with less stringent P-value threshold. This observation implies that many genetic variants that have not reaching genome-wide significant yet with the current sample size. The S parameter of SBayesS suggest that ALS has strong selection (S parameter ~ - 1.0). This is surprising observation, since ALS is unlikely to be directly under selection due to typical late onset occurring after reproductive age. This may be explained by selection on traits genetically correlated with ALS. Although, most likely the fitness traits directly under selection are unknown, here I investigated the ALS-genetically correlated traits of CP, EA, and SCZ (Chapter 2). However, any interpretation needs to be treated with caution since the SNP-based heritability of ALS was very low (0.027), therefore the contribution of common genetic effects to ALS pathology is relatively small. One possible explanations for the low SNP-based heritability of ALS GWAS is ALS disease heterogeneity. As reviewed in chapter 1, ALS disease manifestation, onset, and progression are very heterogeneous. Genetic analyses assume that the ALS diagnosis results from a single, albeit complex, biological process. If, in fact, the ALS diagnosis can be given to symptoms which result from different biological processes, then by considering these heterogeneous ALS cases as a single case set could lead to low estimates of SNP-based heritability45. Another explanation for the low SNP-based heritability estimates might be the less than ideal study design of the ALS GWAS meta-analysis. As reviewed in chapter 1, ALS is a relatively uncommon disease with relatively fast lethal progression, therefore recruiting large numbers of participants (especially for ALS cases) for GWAS study takes a long time and is only achieved through international collaboration. Many of the contributing ALS cohorts

112 genotyped only cases, often in small batches and, and many different genotyping chips were used. The latest GWAS meta-analysis used in this chapter comprised 113 cohort of various European ancestries, which were genotyped using 9 different SNP-arrays. This different experimental design across cohorts is likely to introduce unintended batch effects and population stratification especially from the need to match the case samples with control samples that had been genotyped independently. The current solution to deal with this difference by grouping the cohort based on the cohort genotyping chip, most likely solved only some of the batch effects, but confounding of batch effects with case control status could mean that real genetic signal is removed when using genetic PCs to correct for population stratification. Despite the low SNP-based heritability estimate, the strong negative selection (S parameter of SBayesS of ALS ~ -1.0) still indicate the tendency of stronger ALS SNP-effects with lower minor allele-frequencies. This observation supports an ALS genetic architecture that involved rare-variants44. To investigate rare variants in ALS, whole genome sequencing is needed46,47.

3.4.2 Cell-type analysis of ALS and its correlated traits

From the cell-type analysis, the ALS GWAS associations showed enrichment in gene signatures of three tissues type classes (CNS, Blood/immune, and “other”). The two most noticeable enrichments were frontal lobe (the strongest enrichment in CNS) and dendritic cells (DC). The frontal lobe enrichment was expected, since ALS primarily affects motor neuron cells7 and motor neuron cells are concentrated in the motor cortex which is a part of brain frontal lobe48. The DC enrichment was potentially interesting. DC are “antigen presenting cells” -- cells that present the cell surface of unknown-cells to T-cells – which act as a messenger between innate and adaptive immune systems49. Some studies link DC with neuroinflammation in animal model for multiple sclerosis, stroke, brain tumours, Alzheimer's disease, Parkinson's disease, and epilepsy50. There were other weaker enrichments for skin related tissues (epidermis) which are likely correlated with the DC enrichment. There is high concentration of DC in tissues that are in contact with the external environment, such as the epidermis of nose, lungs, stomach and intestines51. Since there is possible high correlation between DC and epidermis enrichment in cell-type analysis, the epidermis (in “other” tissue category) enrichment might confounded/driven by DC enrichment. Despite of this possible

113 confounded cell enrichment, I cannot exclude the possibility that ALS might involve other cell- types that would classify in “other” tissue category. Cell-type analysis of CP, EA, and SCZ showed that the three traits had strong enrichment for CNS related tissues. All the significantly enriched CNS relate cell-types found for ALS were also enriched in CP, EA, and SCZ. In general, this similar pattern of enrichment of CNS tissues might explain some part of genetic correlation between ALS and these traits. Exploring the specific cell-types, I found the interesting observation that the frontal cortex which was found to be most significant in ALS was not the most significantly enriched CNS cell-type in other traits (CP, EA, SCZ). Instead, entorhinal cortex that was second most significant in ALS were consistently highly enriched in CP, EA, and SCZ. The entorhinal cortex is part of the temporal lobe that has function as a hub for memory formation, recall, spatial navigation, and its relationship with perception of time52. It is the main medium that connects the Hippocampus and Neocortex for accessing memory53. Dementia as a consequence of dysfunctional Entorhinal Cortex is well-known in Alzheimer’s disease, traumatic brain injury, and strokes patients54. While this memory functional roles were obvious in relation to CP, EA, and SCZ (schizophrenia cognitive disfunction55), it is also interesting for ALS in relation to frontotemporal dementia (FTD). As the mechanism of ALS-FTD spectrum is still unclear56, this overlapping enrichment in entorhinal cortex, adds some weight to the direction of future ALS-FTD research.

3.4.3 SMR results of ALS and their overlap with CP, EA, SCZ

The number of significant results from SMR analyses were approximately the same from integration with the brain eQTL (Brain meta-eQTL and PsychENCODE) and blood eQTL data. Notably, 31K samples contributed to the blood eQTLgen and hence 15,719 eQTLs passed the level of genome-wide significance (P-value<5x10-8) for inclusion in to SMR analyses. In contrast, the smaller brain meta-eQTL (1.1K samples) and PsychENCODE (1.3K samples) data sets had only 7,735 and 11,120 eQTLs passing genome-wide significance and used in SMR analyses. Since fewer eQTLs entered the SMR analyses for the brain data sets the threshold for declaring significance of the SMR test was higher for the brain data sets (i.e., significance threshold of 7.0x10-6, 4.7x10-6, and 3.2x10-6 for brain-meta, PsychENCODE and blood eQTLgen, respectively). Hence, the higher power of the eQTL data sets is partly off-set in the SMR analyses by the Bonferroni correction for the number of tests conducted being more

114 stringent for the blood eQTLen data. Another contributing factor to the observation of a similar number of SMR associations for the brain meta-eQTL data sets despite the considerably smaller sample size is that ALS is a neuron disease for which most of its pathology occurs in nerve-related cells. This explanation is consistent with the result of cell-type analyses where most of the significantly enriched cell-type were CNS related. The most significant SMR results for ALS across all eQTL (brain meta-eQTL, PsychENCODE, and blood eQTLgen) were for the SCFD1 gene. This gene was found to have the most significant P-value for the SMR test and the HEIDI P-value was large, which provided robust evidence in support of the SNP influencing variation in gene expression which in turn influences liability to ALS. SCFD1 is relatively well annotated in gene and protein databases, it has a crucial role in the metabolism of proteins and protein transport (via vesicle) between Golgi to Endoplasmic Reticulum57,58. Therefore, its function fits well with known ALS pathology through mitochondrial Endoplasmic Reticulum stress (ER) mechanism reviewed in chapter 1 (1.2.2 Cell-biology of ALS known genes). Interestingly, SCFD1 (SNP variant rs10139154) was also found to be associated with ALS in Chinese population (1074 cases and 927 controls) using PCR-RFLP analysis58. In the same genome region as SCFD1, SMR analysis found gene G2E3 to be significant using PsychENCODE prefrontal cortex, but it was not found to be significant using brain meta- eQTL, despite both of eQTL summary statistics derived from brain tissues. This observation might be caused by the larger variation within the brain meta-eQTL summary statistics, which is a meta-analysis of eQTL summary statistics from multiple brain regions38, while PsychENCODE gene expression was strictly from the Prefrontal Cortex region. The gene annotation for G2E2 is a ligase enzyme (that catalyse two large molecules by forming new chemical bond) that helps the attachment of ubiquitin (a small protein that help to regulate protein activity, mostly function as a signal for protein transport or protein degradation)59. This gene function is also consistent with the protein transport or protein debris accumulation pathology of ALS. The other genes, SLC98A and the new ALS associated gene RESP18 were found significant in SMR analysis using brain eQTLs (both brain meta-eQTL and PsychENCODE) also have ALS brain tissues related functions. SLC98A is not well studied and had limited annotation in the NCBI genes database. It is a part of SLC (SoLute Carrier) protein family, and is an intermembrane protein that regulates cellular compound exchange, which could be relevant to the debris-accumulation hypothesis of ALS proposed aetiology60. RESP18 gene is

115 annotated as an endocrine regulated specific protein found in the ER61 and according to Protein- Atlas database62 this gene is almost specifically expressed in brain-related tissues. Some studies on Parkinson’s disease, linked RESP18 gene mutations with ER stress mechanism in neuron cell death61,63,64, which might be relevant to ALS ER stress pathology and hence potentially could interact with SCFD1 which is also involved in ER stress . The SMR analysis using blood eQTLgen showed a set of four genes in same region in chromosome 17 (GGNBP2, DHRS11, ZNHIT3 and MYO19). Similar SMR results were reported by Benyamin et al. (2017)43 using ALS GWAS 2016 (Van-Rheenen et al. 2016 ~36K individuals44) and blood eQTL (by Westra et al. 2014, N=3511 individuals65). The results from the 2019 ALS GWAS results suggest that SMR analysis cannot pin point which gene is specifically relevant based on ALS associated SNPs from chromosome 17. From the gene annotation point of view all four genes have functions that might be considered relevant to ALS pathology. From GWAS database, GGNBP2 and DHRS11 genes were associated with Body Mass Index66 and cognitive performance67 (a trait correlated with ALS). GGNBP2 variants are also associated with schizophrenia68,69 and multiple sclerosis70. According to the NCBI genes database and the GeneCards database, both genes (GGNBP2 and DHRS11) were annotated to be expressed in ER, therefore might also contribute to the ER stress mechanism of ALS. ZNHIT3 gene is involved in zinc metabolism in mitochondria. This might be interesting to ALS pathology, since abnormal zinc homeostasis was reviewed to be a prognostic factor associated with worsened survival of ALS patients71,72. MYO19 gene encodes an Actin- based motor molecule which is heavily involved in mitochondrial transport and positioning. This function seems very relevant to ALS aetiology. One of the unique features of motor neuron cells are their long and asymmetrical shape, which make the cells heavily dependent on active mitochondrial energy transport to survive. It is possible that all these genes (GGNBP2, DHRS11, ZNHIT3 and MYO19) are involved in ALS, as it is known that genes with related functions are often co-located73. The last gene in the chromosome 17 region implicated by SMR analysis is SARM1 which is well annotated to be involved in Wallerian-like nerve degeneration (injury induced programmed degeneration of nerve cell – typically started at axon of injured cell)74. Although, Wallerian-like degeneration is triggered by intracellular metabolic stress or physical injury of cells, the degeneration process depends on the presence of SARM1 protein75. A study of SARM1 interaction with the ALS associated gene TDP-43 implies that nerve injury was significantly less progressive to cell-death after SARM1 deletion74. The SMR of this latest ALS

116 GWAS also identified the gene MOBP as possible causal gene. MOBP was already identified to be ALS associated gene from the published ALS GWAS, but not found significant in previous SMR analysis43. MOBP is annotated as a protein that stabilizes myelin which supports the cytoskeleton defects pathology. In comparing the overlap of SMR result between ALS and CP, I observed that the MYO19 gene was significantly associated with both traits which could indicate a functional basis to our previously discussed genetic correlation results (see chapter 2). When I focus on the subset of ALS SMR result that use only significant SNPs for CP (there were 74 CP enriched-SNPs – table 4), I observed a massive inflation (lambda value=3.0) compared to ALS only test statistics inflations (lambda value = 1.2). This seems to suggest that many genes that have SMR evidence for association with CP are also involved in ALS, more than expected by chance. Although most associations did not reach the significance threshold in ALS SMR results, except for MYO19, I have leveraged in power from the CP analysis (225 independent genome-wide significant SNPs67, 74 significant SMR associations) to indicate that there is association signal in the ALS results. I also detected an inflated lambda for ALS SMR results when subsetted to EA significant SMR results (117 significant SMR SNPs in brain and 269 significant SMR SNPs in blood), but the inflation in brain (lambda=1.8) was not as great as for the CP analysis (lambda=3.0).

3.4.4 Study limitations

This study has several limitations. First, the latest ALS GWAS is still underpowered to detect genetic variants with modest effect in ALS. Second, the ALS GWAS study design (small cohort sizes, cases and controls genotyping independently) could contribute to technical reasons that reduce effective power given the sample size. Third, in this study, I relied on GWAS summary statistics so these only consider association of common variants. There is some evidence to suggest that rare variants may be important, relatively more so than for other common diseases. The SNP-based heritability estimated from the GWAS summary statistics is only ~2%, while the heritability estimated from family data was estimated ~40%, another indicator that rare variants may be particularly importantly in ALS aetiology, although very large sample sizes are needed to detect these from GWAS case-control samples. Fourth, in term of SMR studies, currently only brain and blood have big enough sample sizes for eQTL detection which limited this study to these 2 tissues. From cell-type analysis, I know that ALS

117 genetic risk could involve more than just neuronal tissue. Despite these limitations, SMR analyses with brain and blood eQTL data were able to fine-map possible causal genes in associated genome regions detected by ALS GWAS, an important outcome as most genomic ALS associated regions have dense/complex gene structure (Figure 5-7)

3.5 Conclusions

The largest European ALS GWAS to date offers some interesting insights to ALS genetic architecture and possible aetiology. I found that ALS genetic architecture is not highly polygenic and rare variants are very likely to explain a large proportion of ALS heritability. The ALS aetiological insights come from analyses that involved cell-type specific enrichment. I found that ALS genetic risk factors could implicate more than just a motor neuron disfunction. The results could suggest possible involvement of the immune system in ALS and by comparing the cell-type enrichment of ALS with correlated traits (CP, EA, SCZ) entorhinal cortex enrichment might provide future direction to ALS-FTD research. I suggest further investigation of the expression of significant SMR results, since they are not widely studied in contrast to the well-known ALS genes of SOD1 or C9orf72. I was able to identify novel putative ALS causal genes from SMR analyses (RESP18 and MYO19) using blood and brain eQTL summary statistics. I showed that leveraging GWAS results from well-powered studies of traits genetically correlated with ALS, such as CP and EA, can be used to increase effective power of ALS for discovery of associated genes.

118 3.6 References

1. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014). 2. Pardiñas, A. F. et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet 50, 381–389 (2018). 3. Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018). 4. Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10, 1–11 (2019). 5. Zeng, J. et al. Bayesian analysis of GWAS summary data reveals differential signatures of across human complex traits and functional genomic categories. bioRxiv 752527 (2019) doi:10.1101/752527. 6. Nicolas, A. et al. Genome-wide Analyses Identify KIF5A as a Novel ALS Gene. Neuron 97, 1268-1283.e6 (2018). 7. Brown, R. H. & Al-Chalabi, A. Amyotrophic Lateral Sclerosis. http://dx.doi.org/10.1056/NEJMra1603471 http://www.nejm.org/doi/10.1056/NEJMra1603471 (2017) doi:10.1056/NEJMra1603471. 8. Ryan, M., Heverin, M., McLaughlin, R. L. & Hardiman, O. Lifetime Risk and Heritability of Amyotrophic Lateral Sclerosis. JAMA Neurol (2019) doi:10.1001/jamaneurol.2019.2044. 9. Davies, N. M., Holmes, M. V. & Smith, G. D. Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ 362, k601 (2018). 10. Kendall, J. M. Designing a research project: randomised controlled trials and their principles. Emergency Medicine Journal 20, 164–168 (2003). 11. van Schaik, T. A., Kovalevskaya, N. V., Protopapas, E., Wahid, H. & Nielsen, F. G. G. The need to redefine genomic data sharing: A focus on data accessibility. Applied & Translational Genomics 3, 100–104 (2014). 12. Knoppers, B. M. & Joly, Y. Introduction: the why and whither of genomic data sharing. Hum Genet 137, 569–574 (2018). 13. Hardouin, S. N. & Nagy, A. Mouse models for human disease. Clinical Genetics 57, 237–244 (2000).

119 14. Morrice, J. R., Gregory-Evans, C. Y. & Shaw, C. A. Animal models of amyotrophic lateral sclerosis: a comparison of model validity. Neural Regen Res 13, 2050–2054 (2018). 15. Damme, P. V., Robberecht, W. & Bosch, L. V. D. Modelling amyotrophic lateral sclerosis: progress and possibilities. Disease Models & Mechanisms 10, 537–549 (2017). 16. Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet 48, 481–487 (2016). 17. Thelwall, M. A. et al. Is useful research data usually shared? An investigation of genome-wide association study summary statistics. http://biorxiv.org/lookup/doi/10.1101/622795 (2019) doi:10.1101/622795. 18. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome- wide association summary statistics. Nature Genetics 47, 1228–1235 (2015). 19. Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics 50, 621–629 (2018). 20. Consortium, the H. R. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48, 1279–1283 (2016). 21. Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018). 22. Zhang, Y. & Pan, W. Principal Component Regression and Linear Mixed Model in Association Analysis of Structured Samples: Competitors or Complements? Genet Epidemiol 39, 149–155 (2015). 23. Chen, H. et al. Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. Am J Hum Genet 98, 653–666 (2016). 24. Conomos, M. P., Reiner, A. P., McPeek, M. S. & Thornton, T. A. Genome-Wide Control of Population Structure and Relatedness in Genetic Association Studies via Linear Mixed Models with Orthogonally Partitioned Structure. bioRxiv 409953 (2018) doi:10.1101/409953. 25. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010). 26. The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). 27. R: The R Project for Statistical Computing. https://www.r-project.org/.

120 28. Nakazawa, M. fmsb: Functions for Medical Statistics Book with some Demographic Data. (2019). 29. Balding, D. J., Moltke, I. & Marioni, J. Handbook of Statistical Genomics. (John Wiley & Sons, 2019). 30. Robinson, M. R. et al. Genetic evidence of assortative mating in humans. Nature Human Behaviour 1, 0016 (2017). 31. Marquez-Luna, C. et al. Modeling functional enrichment improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. bioRxiv 375337 (2018) doi:10.1101/375337. 32. Sonnega, A. et al. Cohort Profile: the Health and Retirement Study (HRS). Int J Epidemiol 43, 576–585 (2014). 33. Gazal, S. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature Genetics 49, 1421–1427 (2017). 34. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 35. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015). 36. Consortium, T. Gte. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 348, 648–660 (2015). 37. Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun 6, 5890 (2015). 38. Qi, T. et al. Identifying gene targets for brain-related traits using transcriptomic and methylomic data from blood. Nat Commun 9, (2018). 39. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017). 40. Fromer, M. et al. Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nature Neuroscience 19, 1442–1453 (2016). 41. Ng, B. et al. An xQTL map integrates the genetic architecture of the human brain’s transcriptome and epigenome. Nature Neuroscience 20, 1418–1426 (2017). 42. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL metaanalysis. bioRxiv 447367 (2018) doi:10.1101/447367. 43. Benyamin, B. et al. Cross-ethnic meta-analysis identifies association of the GPX3- TNIP1 locus with amyotrophic lateral sclerosis. Nature Communications 8, 611 (2017).

121 44. van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat Genet 48, 1043–1048 (2016). 45. Wray, N. R. & Maier, R. Genetic Basis of Complex Genetic Disease: The Contribution of Disease Heterogeneity to Missing Heritability. Curr Epidemiol Rep 1, 220–227 (2014). 46. Project MinE: study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis. Eur J Hum Genet 26, 1537–1546 (2018). 47. Turro, E. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 1–9 (2020) doi:10.1038/s41586-020-2434-2. 48. Bear, M. F., Connors, B. W. & Paradiso, M. A. Neuroscience: Exploring the Brain, 3rd Edition. (Lippincott Williams and Wilkins, 2006). 49. Dhodapkar, M., Mackall, C. L. & Steinman, R. M. Dendritic Cells and Adaptive Immunity. in Williams Hematology (eds. Kaushansky, K. et al.) (McGraw-Hill Education, 2015). 50. Ludewig, P. et al. Dendritic cells in brain diseases. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 1862, 352–367 (2016). 51. Haniffa, M., Gunawan, M. & Jardine, L. Human skin dendritic cells in health and disease. J Dermatol Sci 77, 85–92 (2015). 52. Schultz, H., Sommer, T. & Peters, J. The Role of the Human Entorhinal Cortex in a Representational Account of Memory. Front Hum Neurosci 9, (2015). 53. Kitamura, T., Macdonald, C. J. & Tonegawa, S. Entorhinal–hippocampal neuronal circuits bridge temporally discontiguous events. Learn. Mem. 22, 438–443 (2015). 54. Davis, A. E., Gimenez, A. M. & Therrien, B. Effects of entorhinal cortex lesions on sensory integration and spatial learning. Nurs Res 50, 77–85 (2001). 55. Bowie, C. R. & Harvey, P. D. Cognitive deficits and functional outcome in schizophrenia. Neuropsychiatr Dis Treat 2, 531–536 (2006). 56. Ahmed, R. M. et al. Amyotrophic lateral sclerosis and frontotemporal dementia: distinct and overlapping changes in eating behaviour and metabolism. Lancet Neurol 15, 332– 342 (2016). 57. Hou, N., Yang, Y., Scott, I. C. & Lou, X. The Sec domain protein Scfd1 facilitates trafficking of ECM components during chondrogenesis. Developmental Biology 421, 8– 15 (2017).

122 58. Chen, Y. et al. An association study between SCFD1 rs10139154 variant and amyotrophic lateral sclerosis in a Chinese cohort. Amyotroph Lateral Scler Frontotemporal Degener 19, 413–418 (2018). 59. Brooks, W. S. et al. G2E3 Is a Dual Function Ubiquitin Ligase Required for Early Embryonic Development. J Biol Chem 283, 22304–22315 (2008). 60. Taylor, J. P., Brown Jr, R. H. & Cleveland, D. W. Decoding ALS: from genes to mechanism. Nature 539, 197–206 (2016). 61. Sosa, L. et al. Biochemical, biophysical, and functional properties of ICA512/IA-2 RESP18 homology domain. Biochim. Biophys. Acta 1864, 511–522 (2016). 62. Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015). 63. Jiang, M. et al. Downregulation of miR-384-5p attenuates rotenone-induced neurotoxicity in dopaminergic SH-SY5Y cells through inhibiting endoplasmic reticulum stress. Am. J. Physiol., Cell Physiol. 310, C755-763 (2016). 64. Huang, Y. et al. RESP18 is involved in the cytotoxicity of dopaminergic neurotoxins in MN9D cells. Neurotox Res 24, 164–175 (2013). 65. Westra, H.-J. & Franke, L. From genome to function by studying eQTLs. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 1842, 1896–1902 (2014). 66. Zhu, Z. et al. Shared genetic and experimental links between obesity-related traits and asthma subtypes in UK Biobank. J. Allergy Clin. Immunol. 145, 537–549 (2020). 67. Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics 50, 1112 (2018). 68. Ikeda, M. et al. Genome-Wide Association Study Detected Novel Susceptibility Genes for Schizophrenia and Shared Trans-Populations/Diseases Genetic Effect. Schizophr Bull 45, 824–834 (2019). 69. Autism Spectrum Disorders Working Group of The Psychiatric Genomics Consortium. Meta-analysis of GWAS of over 16,000 individuals with autism spectrum disorder highlights a novel locus at 10q24.32 and a significant overlap with schizophrenia. Mol Autism 8, 21 (2017). 70. Wang, Y. et al. Genetic overlap between multiple sclerosis and several cardiovascular disease risk factors. Mult. Scler. 22, 1783–1793 (2016). 71. Lopes da Silva, H. F. et al. Dietary intake and zinc status in amyotrophic lateral sclerosis patients. Nutr Hosp 34, 1361–1367 (2017).

123 72. Smith, A. P. & Lee, N. M. Role of zinc in ALS. Amyotroph Lateral Scler 8, 131–143 (2007). 73. Thévenin, A., Ein-Dor, L., Ozery-Flato, M. & Shamir, R. Functional gene groups are concentrated within chromosomes, among chromosomes and in the nuclear space of the human genome. Nucleic Acids Res 42, 9854–9861 (2014). 74. Loring, H. S. & Thompson, P. R. Emergence of SARM1 as a Potential Therapeutic Target for Wallerian-type Diseases. Cell Chemical Biology 27, 1–13 (2020). 75. Conforti, L., Gilley, J. & Coleman, M. P. Wallerian degeneration: an emerging axon death pathway linking injury and disease. Nature Reviews Neuroscience 15, 394–409 (2014).

124

3.7 Supplementary materials

Supplementary Table 1. Genotyping chip used and the sample number for each strata

SNP Post- imputation post- Strata Platforms Total Cases Controls Male Female Imputation Usage 1 Exome 13762 2257 11505 6380 7382 11967771 Discovery 2 Illumina317 3551 1481 2070 2329 1222 11875619 Discovery 3 Illumina370 4309 1724 2585 2127 2182 11885573 Discovery 4 Illumina550,Illumina610,Illumina660 46220 3406 42814 20849 25371 11475328 Discovery 5 IlluminaOmniExpress,Illumina2M 46756 14507 32249 25061 21695 11333419 Discovery 6 Cancelled Replication 7 GSA 24854 4059 20795 11271 12213 Replication TOTAL 139452 27434 112018 68017 70065

125

Supplementary Table 2. Significant genes from SMR analysis (without HEIDI test).

eQTL summary Gene GWAS P- SMR P- HEIDI P- Chromosome SNP statistics Name value value value 2 RESP18 rs2385405 2.2E-06 2.6E-06 6.5E-01 3 MOBP rs631312 2.4E-10 4.7E-07 5.7E-01 Brain meta-eQTL 9 C9orf72 rs2282240 1.1E-06 1.4E-06 7.5E-09 14 SCFD1 rs2070339 2.1E-12 1.9E-08 1.6E-01 20 SLC9A8 rs6020069 8.5E-08 2.9E-06 1.5E-01 5 TNIP1 rs12518386 4.5E-08 7.6E-08 1.0E-04 5 GPX3 rs12518386 4.5E-08 7.0E-08 6.0E-05 9 C9orf72 rs10967981 2.8E-07 3.3E-07 2.5E-22 12 TBK1 rs61931525 1.3E-07 2.1E-07 8.7E-03 Blood eQTLgen 14 SCFD1 rs7144204 3.3E-12 1.5E-11 5.7E-01 17 GGNBP2 rs11650008 1.2E-07 1.4E-07 1.3E-01 17 DHRS11 rs11657469 1.9E-06 2.0E-06 3.0E-01 17 ZNHIT3 rs4796224 4.8E-07 7.1E-07 3.5E-01 17 MYO19 rs8882 4.1E-07 8.2E-07 6.9E-01 2 RESP18 rs6436136 2.3E-06 3.7E-06 0.99 Pre-frontal cortex 14 G2E3 rs2045180 7.5E-13 2.9E-10 0.06 PsychENCODE 14 SCFD1 rs229176 1.1E-12 7.3E-11 0.03 17 SARM1 rs35714695 1.0E-08 1.7E-07 0.59 20 SLC9A8 rs6020193 1.3E-07 1.7E-06 0.07 Genes that not passed HEID-test (HEIDI P-value > 0.01) marked in red

126 4

Chapter 4: MND gut microbiome study:

From standard case-control study to survival analysis

Published as :

Progression and survival of patients with motor neuron disease relative to their faecal microbiota

Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, DOI: 10.1080/21678421.2020.1772825 2020

Contribution details at “List of publication that included in this thesis” number 3 (page 8).

With part of literature review (introduction part 4.1.1 and 4.1.2) from published work:

Gut microbiota in ALS: possible role in pathogenesis?

Expert Review of Neurotherapeutics, 19:9, 785-805, DOI: 10.1080/14737175.2019.1623026

Contribution details at “List of publication that included in this thesis” number 2 (page 7).

127 4.1 Introduction

Amyotrophic lateral sclerosis (ALS) is the most common form of motor neuron disease (MND), a late-onset and progressive neurodegenerative disease characterised by the destruction of motor neurons1. Acknowledging that ~90% of MND cases is ALS and for the sake of consistency with the clinical terminology in our data, we used MND to refer ALS in this chapter. The latest MND research has detected many possible mechanisms that might contribute to motor neuron breakdown2–4. Some of them are accumulation of misfolded protein, systemic changes of immune system that lead to excessive neuroinflammation, and energy deficiency due metabolic dysfunction1,4–6. Current researches suggested that these motor neuron breakdown mechanisms can be affected by gut microbiota dysbiosis7–9. There are 100 trillion bacteria forming complex microbiome communities in human gastrointestinal tract, which play important roles in human metabolism, immunity, and gut-brain axis10. These bacteria help human metabolism by vitamin synthesis and fermented complex molecules in human food making them easier to digest and absorbed by human digestion system, which made them a significant regulator of human energy metabolism11,12. This mechanism potentially interesting since metabolic change and dysregulation was associated with some portion of MND patients13,14. There is evidence that links the gut microbiome with the immune system. Erny et al., reported that a change in gut bacteria could affect the development of immune system and microglial maturation through short-chain fatty acid production15. Moreover microglia, the brain’s immune cells, are crucial for the course of neurodegeneration16,17. Mouse model studies support that gut microbiome composition can affect brain microglia activation15,18. From all of this evidence, we hypothesise that gut microbiome might be associated with the development of MND. To investigate the association between gut microbiota and MND, we used case-control design of gut microbiome studies using marker gene (16S rRNA) methodology. To validate our chosen method, we begin this chapter by reviewing existing methodologies to investigate gut microbiota. We have also reviewed the published studies on MND microbiota studies to give a general overview of current MND microbiome research. Then, we describe how our chosen methodology and approach can fill the gap of the existing studies.

128 4.1.1 Methods for detection and analysis of the gut microbiota

The human gut microbiome is a complex microorganism community10,19,20. The classic method for detection of living bacteria has relied on culturing of microorganisms from samples, which is necessarily culture dependent10. Given the complexity of the human gut microbiome community, this method has several limitations. First, the process of culturing microorganisms on media may introduce biases towards certain types of bacteria, for example, eliminating those that are slow-growing or unable to grow in the chosen media10,20. Second, the morphological and physiological tests used to identify the cultured bacteria often fail to detect the existence of new, unannotated bacteria21. To deal with this limitation, culture-independent methods have been developed, taking advantage of rapidly improving sequencing technologies, which have become the preferred method for study of the gut microbiome community10,20. Currently, two main sequencing methods are used, the marker-gene and metagenome methods. Each method has their own strengths and weaknesses, and the choice for their use is more dependent on the scientific questions and analysis goals22.

4.1.1.1 Marker-gene methodology

The marker-gene method provides an overview of the microbial composition in the samples studied22,23. Marker-gene analyses use Polymerase Chain Reaction (PCR) to amplify a genomic region selected for specific properties22,24. The region of interest usually includes segments of sequence that are highly variable between flanked by conserved segments. The highly variable regions serve as a barcode that are unique to specific microorganism classes and the flanked conserved regions serve as the binding sites for PCR primers, common across micro- organism classes. There are common genes that are usually used in marker-gene analyses, 16S rRNA for identifying bacteria and archaea, 18S rRNA and internal transcribed spacer (ITS) for fungi24. A key variable in marker-gene data is the sequencing technology used and the length of DNA reads generated25. For example, Illumina NextSeq 500 is able to generate 2x250 paired-end reads that can effectively cover 400-500 bp (i.e., two variable regions of the 16s RNA gene). The marker-gene approach is well-tested, fast, and has relatively low cost22–24. It is also reliable since the specific marker-gene amplification reduces the possibility of contaminations from host DNA24. Despite the distinct advantages of this method, there are limitations. The coverage is at the level of , and therefore can miss the true complexity

129 of the microbial community in the sample. Moreover, PCR amplification bias can provide an inaccurate view of the species distribution22,23,26. Post-sequencing quality control (QC) analyses of marker-gene sequencing data include a threshold for minimum reads counts (usually >10,000 per sample) and other reads base- calling qualities criteria to minimise biases in microbiome composition and identity22. The reads that pass the QC process are clustered according to their similarity and each cluster is compared to a database (as implemented in the QIIME pipeline27, now updated to QIIME228). There are several methods to cluster these reads according to similarity. The first approach allocates reads to operation taxonomic units (OTUs), grouping by similarity of sequence reads (usually 97% similarity) into single features, identifying the OTU by comparison to database reference sequences 27,29. A closed-reference approach only takes account of known and annotated species, losing information from sequence reads that cannot be aligned to the database reference. An open-reference extends the closed-reference approach by assigning unannotated species to known ones with closest genomic similarity. A de-novo clustering approach does not use reference sequence information but uses similarities within the data to cluster reads. The motivation is to be unbiased in assessing the microbial community, but this approach can miss subtle, but biologically meaningful sequences that result from merging reads into OTUs22,30. The current preferred approach is oligotyping ( the default in QIIME228) in which reads are concatenated using only the highly informative (i.e., most variable between taxonomic classes) sequence segments. The resulting oligotypes provide more precise taxonomic clustering by taking accounting for nucleotide variations in the context of sequence that is otherwise similar30–32. This approach is implemented in Deblur32 and DADA230 algorithms. The database used to as the reference to identify cluster classes includes a specialised marker genes curated databases like Ribosomal Database Project (RDP)33, GreenGenes34 and SILVA35. One key decision is whether to accept all DNA reads and normalise the data to account for differing numbers of reads between samples, or else to perform rarefaction where reads are discarded for some samples to generate a similar number of reads across samples. This decision has been a subject of debate36, but rarefaction on balance is favoured since analyses are based on proportions of taxonomic classes in the sample. Even though the marker-gene analysis does not directly provide any functional information relating to the taxonomic classes identified, functional information can be predicted22,37. The Human Microbiome Project has provided a database of bacterial genome information38, which can be used to predict functional annotations of each OTU detected in

130 marker-gene analysis22. Current functional prediction methods are usually supported by evolutionary models (like PICRUSt37) to provide confidence levels on functional annotation. Under the assumption that more closely related microorganisms tend to have similar gene content, we can predict the functional annotations of our OTU to the closest reference from the Human Microbiome Project database37. In summary, the marker-gene method is well-tested, fast, and has relatively low cost22– 24. It is reliable since the specific marker-gene amplification reduces the possibility of contaminations from host DNA24. Despite the distinct advantages of this method, there are limitations. First, the coverage is at the level of genus, and therefore can miss the true complexity of the microbial community in the sample. Second, the PCR amplification bias can provide an inaccurate view of the species distribution22,23,26. Third, the functional insight of the gut microbiome community is predicted from known studies22,23,37. It is not possible to detect putatively new functional activity, and this could be a limiting factor in studies designed to understand the aetiology of complex disease23,39, such as MND.

4.1.1.2 Whole genome metagenomic methodology

The whole genome metagenomic method offers a more complete picture of the functional properties of the gut microbial community by sequencing all microorganism DNA that isolated from the gut22,39. By sequencing all DNA in the samples, it gives several advantages over the marker-gene analysis: First, a finer taxonomic resolution can be achieved22,23,40. Second, it provides information about all genes that are present in gut communities which should contribute to an improved understanding of the functional and metabolic activities of the communities39,40. Third, in principle, when all DNA sequence information is available, taxonomic identification is not limited to a specific class of microorganism, but can identify bacteria, fungi, protozoa, and virus if the DNA is present in the samples22,40. However, this method requires deeper sequencing which makes it relatively much more expensive40 than the marker-gene approach. It also requires laborious and complicated sample preparation, and the sample includes significant quantities of host DNA40,41, which then requires more total reads (and hence higher cost) to ensure sufficient reads for the microbial DNA. Whole metagenome data require more extensive QC and analysis protocols41. The key difference from marker-gene analysis is that an extensive host DNA removal process is required. For human host DNA, the human DNA removal can be assisted by mapping the reads

131 to the human reference genome or using the specialised tools like BMTagger42 or BBmap43. The reads that pass QC can be analysed in two ways, using either unassembled or assembled reads methods22,41. The unassembled method compares the reads directly to a reference sequence database, while the assembled method first assembles genomes from overlapping reads before comparing these to the database references. The assembly method requires deeper sequencing (> 300 million reads/sample)44, but allows exploration beyond taxonomic and gene annotations (such as secondary/multi-gene biosynthetic pathway45,46 and metabolic reconstructions46,47). The unassembled method is more accommodating to limited sequencing depth and still scalable for a large and complex dataset44. The unassembled method treats each read independently when comparing to the reference sequences, so the database choice is very crucial22. For well-studied environments like the human gut, curated databases like RefSeq48 or MetaHit49 (a specialised human gut database) or protein databases like Pfam50 or UniRef51 are preferred. Tools for OTU profiling and functional analysis are usually available in separate packages (except for MEGAN52), for example, specialised OTU profiling tools like; Kraken53, MetaPhlAn54, and TIPP55 are widely used; another tool HUMAnN256 was developed for functional annotations. In contrast, the assembled methods gather short DNA sequences into longer sequences (contigs), before binning them according to their similarity to assemble partial or full microorganism genomes. For the assembled method, the software metaSPAdes57, metaVelvet58, and MEGAHIT59 are recommended as metagenomic read assembling tools, while MaxBin260 and CONCOCT61 are recommended for binning. The comparison of microbial community representations between samples requires normalisation, and the software edgeR62 and DESeq263 are recommended for these analyses.

4.1.1.3 Making sense of microbiome diversity and composition

For both marker-gene and metagenome approaches, an overview of microbial community variation is usually measured by alpha and beta diversity22,40 statistics. Alpha diversity is a measure of the taxonomic diversity within individual samples that can be compared across the sample groups. Within alpha diversity, there are several measures. Microbial taxonomical richness can be measured by the Chao1 index, which uses an extrapolation method to account for differences in the numbers of reads between samples. However, this approach can be rather sensitive to biases that result from technical differences in between-sample reads22. Another measure of alpha diversity richness is the Shannon index, which takes into consideration

132 evenness of each OTU count within samples and is less affected by the reads-difference bias22. In contrast, beta diversity measures the dissimilarity distance between each pair of samples to generate a distance matrix between all pairs of samples. Beta diversity can be measured by quantitative and qualitative metrics22,64. The quantitative metrics, like Bray-Curtis65 and weighted UniFrac66, use OTU reads counts in feature tables for the calculations. Metrics like binary Jaccard67 and unweighted UniFrac68 only calculate the distance based on the presence/absence of each OTU. To measure beta diversity distance between groups statistically, non-parametric permutation tests like PERMANOVA and ANOSIM are used22. To visualise the beta diversity clustering, Principal Component Analysis (PCA) or Principal Coordinates analysis (PCoA) are usually applied22,29. Most microbiome analysis tools usually include alpha and beta diversity calculations69. Some analytical tools like QIIME27, Mothur70, and Vegan71 provide extensive statistical algorithm support to further analyse the difference of alpha and beta diversity across groups and longitudinal studies22,69. Microbiome data are usually presented in a matrix that contains features of abundance of taxonomic or genes for each sample in the given batch/cohort22,72. This output might look simple, but it has several complicated qualities. Microbiome data are highly dimensional, it often contains thousands of different microorganism features, and it is very sparse (many zeros in the matrix) which requires careful statistical consideration for analysis22,73. At the overview level of microbial community abundance, the alpha and beta diversity provide meaningful summary statistics. But, identification of the microbial taxa that have statistically different compositions among the phenotypic groups is particularly challenging. Other than being highly dimensional and sparse, the abundance information are compositional (i.e. the taxon abundances are proportions) and so rarely conform to common distribution patterns, which make most of the classical parametric statistical test less likely to be applicable36,74. To solve this problem, the current recommended approach to overcome the problems of zero abundance taxa is to add a pseudo-count and to apply a transformation (e.g., Analysis of Composition of Microbiome (ANCOM)74, Isometric Log Ratio transform 75) so that the comparison is made between log-ratios of abundance of each taxon and of all other taxa.

4.1.1.4 New and emerging methodologies

The most recent biomolecular tools (like next generation sequencing, mass spectrophotometry, liquid chromatography) that support multi-omics analyses, allow deeper investigation of gut

133 microbiome communities. The methods described in section 4.1.1.1 and 4.1.1.2 rely only on DNA information, so their applications are limited to detecting the microbiome composition and its potential function. However, the community activity/metabolism of the microbiota at a given time could be different from its potential function predicted by a metagenomic approach76. Metatranscriptomic and metaproteomic approaches may provide better descriptions of the functional activity of the community. Moreover, DNA information alone cannot differentiate between live microorganism and transient DNA which can be investigated through culturomics77. Metatrascriptomics is a large scale mRNA transcript of microbial communities using next generation sequencing RNA-seq technology78. Sequencing the mRNA quantifies genes transcribed at given time and from these it can be possible to infer the microorganism generating them by comparing it to a reference database76,78. Metatranscriptomics approaches give more accurate assessment of microbiome activity at given time compared to genomic approaches79. However, even with the current technology, mRNA extraction from bacterial communities remains challenging. Firstly, RNA is notoriously hard to isolate in extraction protocols since it is unstable compared to DNA 76,80. Secondly, a huge proportion of successfully isolated RNA is ribosomal RNA (rRNA), which decreases the amount of microbial mRNA that can be isolated, therefore the sample needs deeper sequencing to ensure enough data is obtained to detect differentially expressed mRNA76,81. Thirdly, it is very challenging to differentiate microbial and host RNA76,78,81. The metaproteomic approach investigates microbial community activity at the protein level. This approach uses mass chromatography (MS) or liquid chromatography (LC) to identify and quantify proteins, providing a protein expression profile within the sample microbial community at given time82. Metaproteomics may be preferable over metatranscriptomics because protein is generally more stable than mRNA82,83. However, as a new emerging method, fewer applications of metaproteomics have been published compared to metatranscriptomic, mainly due to lack of consistent protocols for sample preparation, less availability of bioinformatics tools, and limited database annotation84,85. In addition, the complexity of the gut microbiome means that metaproteomic data analyses have high requirements for computational resources84,86. Advances in genomic technologies have led to a resurgence of culture-dependent methods on a large scale, hence culturomics77. This method uses multiple culture conditions and longer incubation times to overcome the weaknesses associated standard culture methods87.

134 The current culturomics implementation uses MALDI-TOF mass spectrometry for detailed OTU detection77,87. This approach quantifies the viability of microbiota, the environmental requirements for them to grow, and the dynamics of each OTU growth77,87,88. These information are very important for the production of probiotics that are often difficult to grow in culture88 and provide a deeper understanding of the contribution of microbial dynamics to disease pathology77,89. However, culturing in multiple conditions can be very difficult and requires specialised tools. For example, culturing obligate anaerobes is a challenging task, requiring specific bacteriological equipment to provide an oxygen-free environment77. To date, these new approaches have not been implemented to study MND, perhaps in part, because initial studies using marker genes (16s) have not provided clear conclusions on the involvement of the gut microbiome in MND disease pathology. However, acknowledging the limitations of marker-genes methods and metagenomics to study complex diseases like MND, these new methods may add insights to MND pathology.

4.1.2 Current MND gut microbiome studies

Given the challenges of the analysis of gut microbiome data, it is clear that large sample sizes and replication studies are needed to draw strong conclusions about the relationship between disease status and progression with the gut microbiome. These challenges are added to the practicalities of achieving faecal sample collection unbiased by diet and medication in the context of disease like MND. To date, published studies have been small (Table 1) .

135 Table 1. Published studies of MND/ALS gut dysbiosis in human

No of No of Technique for Taxonomic Result Reference ALS/MND contr identifying gut level that was subjects ols microbes compared 5 96 DNA extraction Phylum and All patients had low Firmicutes/Bacteroidetes (F/B) ratio, an Rowin et al (2017)90 from faecal class level. indicator of dysbiosis. samples followed by rRNA- 3 of 5 patients showed inflammatory biomarkers in stool targeted analysis: elevated faecal secretory IgA, calprotectin, polymerase chain eosinophilic protein. reaction (PCR), using primers for 3 of 5 patients showed low Ruminoscoccus spp; 1 patient 16S rRNA. showed high Bacteroides-Prevotella, Odoribacter spp, Barnesiella spp, and bacteroides vulgatus; 1 patient showed high Bacteroides vulgatus.

2 of 5 patients showed low levels of short chain fatty acids (SCFA).

6 5 DNA extraction Phylum, class, Low F/B ratio in ALS patients, indicating dysbiosis. Fang et al (2016)91 from faecal order, family samples followed and genus level. Significantly increased harmful microorganisms (genus by 16s rRNA Dorea) and significantly reduced beneficial microorganisms amplicon (genus Oscillibacter, Anaerostripes, Lachnospiraceae) in sequencing ALS patients. analysis.

136 No of No of Technique for Taxonomic Result References ALS/MN contr identifying gut level that was D subjects ols microbes compared 25 32 DNA extraction Genus level. Quantity of intestinal microbiotia (measured by 16s rRNA Brenner et al (2018) 92 from faecal gene copy) did not differ between controls and ALS patients. samples; quantification of 16s RNA gene sequencing showed no significant difference 16s rDNA copy in diversity and relative abundance of faecal microbiota number by qRT- between controls and ALS patients. PCR; 16sRNA gene sequencing analysis was performed to assess diversity and distribution of bacteria taxa.

50 50 DNA extraction Family and Higher abundance of E.coli and enterobacteria and low Mazzini et al (2018)93 from faecal Genus level abundance of yeast in ALS patients. samples; PCR- denaturing gradient gel electrophoresis to PCR denaturing gel electrophoresis demonstrated a cluster investigate distinction between bacterial profiles of ALS patients and eubacteria and controls. yeast population.

137 As we can see in Table 1, one of the foremost concerns regarding MND microbiome studies, is that the sample size might be too small to achieve enough statistical power to differentiate the microbiome compositions between MND patients and healthy controls. The current human microbiome studies for MND (2016-2018) are still giving conflicting results90– 93, which might reflect the high variability of human gut microbiome and reproducibility issues in this field of study. Many case-control microbiome studies are using animal models18,94,95, which has less within group variations and are studied in a controlled environment. Using sample size referenced from animal studies are likely to underestimate sample sizes for studies of the real gut microbiome variations in human. Given the variability of human gut microbiome composition, this sample size problem must be resolved to ensure reproducibility of MND gut microbiome studies. Other problems in current studies are the strong assumptions underlying analyses. For example, Mazzini et al. (2018), has decent sample size compared to other previous studies (Table 1), but the study design was limited to a small number of bacteria and yeast abundant in human gut93, which assumed that MND is driven by a massive dybiosis that can be detected from the most abundant species in the gut, is likely to underestimate the true variation in the human microbiome, especially in relation to a complex disease like MND. Considering MND as a complex disease that has wide spectrum of risk factors, including environmental factors in its disease aetiology, the gut microbiome study of MND must be supported by comprehensive clinical and environmental phenotypic data. The phenotypic data of current MND studies are mostly limited to the case control status, which underestimate the complexity of MND and the effects of environmental factor (i.e. diet data) to the gut microbiome. One of the simplest examples of MND complexity that often not included in analyses is the patient site of onset. MND bulbar onset means that the patient has difficulty in swallowing food, which is likely to change the dietary intake into soup-like food. Given the strong influence of diet on the microbiome composition96, this change of diet most likely changes the patient gut microbiome. The gut microbiome changes that are not a consequence of MND but are confounded with disease status through diet change add complexity to meaningful interpretation of MND microbiome results. Similarly, the MND stage of progression and diet intake might confound analyses. Therefore, these factors should be included in MND microbiome analysis, and results should be treated with caution in acknowledgement of non-recorded confounding factors.

138 4.1.3 Study rationale

Considering the possible role of the gut microbiome through the gut-brain axis to other neurodegenerative disease like Parkinson’s Disease (PD)94,97,98 and Alzheimer’s Disease (AD)99, it is worthy to investigate the gut microbiome in MND to gain better understanding of MND pathology. The current MND gut microbiome studies are limited in term of sample size and phenotypic data. Here, we incorporate more comprehensive clinical, metabolic indices, and matching case-control sample demography to get a better estimate of the role of the microbiome in MND development. We chose to use the marker gene (16s rRNA) method to provide an overview of the possible role of the microbiome in MND. The results could be a basis for a more comprehensive metagenomic study in the future.

139 4.2 Material and Methods

4.2.1 Sample collection

This research was conducted as a part of existing MND studies at Royal Brisbane and Women’s Hospital (RBWH) and the Centre for Clinical Research (CCR) within The University of Queensland (UQ) Australia. Participant were enrolled from 12 May 2016 till 13 April 2018 and follow-up assessment concluded in September 2019. Fifty-five MND diagnosed patients meeting the revised EL-Escorial criteria (for probable or definite MND)100 and 65 healthy controls (with matched age and sex) were recruited from RBWH to provide faecal samples. All participant confirmed that they are from European descent, not having history of diabetes, and were not using antibiotics or probiotics in the last 3 months.

4.2.2 Anthropometric, metabolic, and clinical measures

The measurement of whole-body composition and resting energy expenditure for this study were described previous publication by Steyn et al.14 and Loannides et al.101 . Body composition was assessed by air-displacement plethysmography tools by BodPod (Cosmed USA Inc.)102. The REE was assessed using indirect calorimetry from Quark RM respirometer (Cosmed USA Inc.)14. For MND patients, the ALSFRS-R and King’s stage was assessed in the same day. The respiratory function test was assessed within 1 month from initial visit and FVC (% predicted) was recorded. A summary of the clinical data record and the metabolic assessment measures used in analysis are recorded in Table 2. The faecal samples were collected using take-home collection kit Copan 4N6FLOQSwabs. The collected samples were sent to Australian Centre for Ecogenomic (ACE) sequencing facilities.

140 Table 2. Clinical data record and metabolic assessment result that used in analysis Category Indices Explanation Age at collection Age of participants at the time of faecal collection Diagnostic delay* Time difference between diagnosis and onset of symptoms (month) ALSFRS - R* ALS Functional Rating Scale Revised; disease progression index Clinical Delta FRS* Speed of disease progression; decrease in ALSFRS per month FVC* Forced Lung Capacity; Total air exhaled during breath (cm3) Riluzole medication* Taking Riluzole medication (yes/no) Fasting glucose Blood sugar level after overnight fasting (mmol/L) Fed glucose Blood sugar level after eating (mmol/L) Glucose response Difference between fed and fasting glucose levels (% of Fasting) Metabolic BMI Body Mass Index Fat mass Total Body fat (kg) Fat Free mass Lean body mass, total body mass except fat (kg) Measured Eekc Measured energy expenditure (kcal) Smoking Smoking status (non-smoker/ex-smoker/active smoker) Diet Alcohol intake Alcohol consumption (units/week) *MND case only

4.2.3 DNA extraction

DNA was extracted from 50-100 mg of faecal sample with an initial mechanical disruption of the cells, performed on the Powerlyzer 24 (Qiagen #13155) with 0.1mm glass beads at 2000 RPM for 5 minutes. The resulting lysate was processed using the QIAcube HT automated DNA extraction system (Qiagen #9001793) following the manufacturer’s protocol for the QIAamp 96 PowerFecal QIAcube HT Kit (Qiagen #51531). DNA concentration was measured using a Qubit assay (Life Technologies) and was adjusted to a concentration of 5ng/ul.

4.2.4 PCR amplification and sequencing

The 16s rRNA gene encompassing V6 to V8 regions were amplified using 926F (5’- AAACTYAAAKGAATTGRCGG-3’) and 1392R (5’-ACGGGCGGTGWGTRC-3’) primers103. This pair of primer is a universal primer that also amplifies small subunit of 18s rRNA of eukaryotes and 16s rRNA of prokaryotes. The preparation of 16s library was performed using the workflow by Illumina microbiome protocols. The PCR amplicons were

141 purified using Agencourt AMPure XP beads and indexed with 8bp barcodes using Illumina Nextera XT 384 sample index kit. The resulting purified indexed amplicons were pooled together in equimolar concentrations and sequenced on the MiSeq Sequencing System (Illumina) using paired end (2 x 300bp) sequencing with V3 chemistry in the Australian Centre for Ecogenomics according to manufacturer’s protocol. A minimum 10,000 raw reads per sample were needed as basic quality control (QC) metric, as well as an overall Q30 of 70% raw reads.

4.2.5 Reads quality control

The raw reads QC was undertaken using Trimmomatic104. The sequencing adaptor in raw reads were removed using the HEADCROP option. Reads with low quality (below Q20), even at the local level, were removed by checking read quality using a sliding-window approach (4bp).

4.2.6 Microbiome analysis using QIIME 2

Reads that passed the basic QC were transformed to QIIME 2 artefact file format to be processed by the QIIME 2 pipeline27,28. The Divisive Amplicon Denoising Algorithm version 2 (DADA2)30 was used to de-noise (reduce sequencing errors) and cluster the QC-passed reads. Rarefaction at 4000 reads was chosen, based on the rarefaction curve (the deepest sequencing with minimum samples sacrificed), to remove read-count bias in downstream analysis. The samples with less than 4000 reads were removed (115 = 15 removed, 49 cases, 51 controls remain). The alpha diversity of each sample was measured using the Shannon index and Chao1 index. The Wilcoxon test was used to test case-control differences for both indexes. To visualise the microbial community structure in cases and controls, we calculated the beta diversity measures using weighted-UniFrac66 between all pairs of samples and plot its first two Principal Coordinates and tested the difference between groups, in term of distance using PERMANOVA105 and in term of within group dispersion using PERMDISP105. To check the difference in taxa abundance between cases and controls, we used the SILVA database35 to annotate the clustered reads and test the group difference using Analysis of Composition of Microbiomes (ANCOM)74. The detailed workflow can be seen in Figure 1.

142

Sequencing output

Pre processing Remove primers Demultiplex, Quality filter

De-noise Paired-end Divisive Amplicon Denoising 2 (DADA2)

Representative Feature table sequence Clean clustered reads Read sequences table represent each cluster

Phylogenetic Fast Tree

Rarefaction Phylogenetic tree Evolutionary relationship Subsampled reads of all between clustered seqs samples to similar depth

Analysis-ready Feature table

a-Diversity analysis b-Diversity analysis Taxonomy analysis e.g. Phylogenetic diversity by Weighted/un-weighted Add Taxonomic Info (SILVA) Shannon Diversity Index UniFrac OTU Differential analysis

Interactive Visualizations and Interpretation PCoA plots, distance histograms, taxonomy charts, rarefaction plots, Network visualization, clustering etc.

Figure 1. Flow-chart/work-flow illustration of Microbiome analysis using QIIME 2 pipeline.

143 4.2.7 Metagenomic prediction using PICRUSt

The QIIME 2 rarefied DADA2 output was used as an input files for PICRUSt2106, a software tool for functional prediction from 16s microbiome data. We predict the metagenomic functional annotation from 16s rRNA using PICRUSt2 based on Kyoto Encyclopedia of Genes and Genomes (KEGG) reference107. A general overview of functional difference between cases and controls was tested using Wilcoxon test.

4.2.8 Microbiome case-control regression analysis

The biome table at Phylum and Class level was prepared by collapsing the OTU to Phylum and Class taxonomic level to ensure sufficient OTU counts and variation in each annotation. For every taxonomical level, we analysed the biome table in several steps. First, log transform was applied to OTU count and the distribution was plotted for each OTU. The OTU that have variations and still conform to normal distribution were chosen for the next stages of analysis. Second, to analyse the association between case-control status and OTU counts, logistic regression was applied. We used case-control status as the dependent variable and the OTU counts as the independent variable. To check whether OTU count might also be associated with case-control status when adjusted for metabolic covariates (from patient clinical data), we conducted with OTU counts as dependent variables and case-control status with all covariates as independent variables. The best combination covariates that fit the model was chosen using backward selections. The significant results of this stage then rechecked by running permutation linear regression with the same model. This step is taken to anticipate false positive driven by non-normal distribution.

4.2.9 MND cases only statistical analysis

To investigate the effects of Riluzole medication and disease progression to patient’s microbiome composition, a more specific analysis on MND cases only were done in similar way as the case-control. For this analysis, we conducted linear regression with OTU counts as dependent variables and the MND patient specific clinical indices, such as: Riluzole

144 medication, ALSFRS-R, FVC, time difference between diagnosis and assessment (sample collection) as independent variable. To investigate possible correlation between patients clinical and metabolic details with overall microbiota diversity, we used spearman correlation to asses association between metabolic and clinical indices collected at the time of metabolic assessment to three kinds of alpha diversity measures to analyse different level of bacterial composition shift. The first alpha diversity measure was the Shannon index to measure OTU richness (number of identified OTU) and heterogeneity (the variation of OTU richness). A second alpha diversity measure was the Chao1 index which measures only OTU species richness. A third index was a logarithm of the Firmicutes and Bacteroidetes ratio (logF/B) which measures a large shift of the relative abundance of the two most abundant bacterial phyla in the gut. We also assessed the longitudinal change in weight, BMI and ALSFRS-R relative to the Shannon, Chao1, and log(F/B) indices. The linear model was:

∆" ~ *+, + ./+ + 0%1+, #$%&ℎ

∆" where was the change of Y, that represent ALSFRS-R (disease progression), BMI or #$%&' weight (weight loss) calculated from the first and last measurements expressed in per-month units. The “index” refers to either the alpha diversity measure (Shannon or Chao1 index, or log(F/B)). To investigate the effects of microbiome on the patient survival, we used Cox- proportional model (adjusted for age and sex) to estimate the overall effect of the Shannon index, Chao1 index, or log(F/B) on time of symptom onset or time from first visit to death. There were 24 deaths recorded. Disease duration (from time since symptom onset) for deceased and surviving patients were 43.89±20.22 vs. 73.46±26.86 months, respectively (p<0.01).

145 4.3 Results

4.3.1 Phenotype summary of participants with samples that passed QC

From 120 recruited samples, 100 samples, comprising 49 MND cases and 51 controls, passed QC steps and are used for analysis (as detailed below). The cases and controls have matching properties for potential confounders like: age, BMI, and alcohol consumption (Table 2). The patient specific indices like, Bulbar onset, Riluzole medication, time difference from MND symptom onset (patient self-reported) to assessment, time difference between the starting diagnostic at hospital to assessment, ALS Functional Rating Scale Revised (ALSFRS-R), mean loss of ALSFRS-R per month (Delta-FRS), and Forced Vital Capacity (FVC) are also shown as an overview of the MND progression within the MND cases group (Table 2). As we can see from Table 3, there is no significant difference between case and control, in term of age, BMI, and alcohol consumption. The bulbar onset and Riluzole medication were shown here as patient demography and potential covariates for adjusting association in analysis. The time difference information (onset to assessment and diagnosis to assessment), ALSFRS-R, FVC, and Delta-FRS were indicated that most of MND cases were in the middle of their disease progression.

146 Table 3. Passed QC samples demography table Control MND P-value Number of participants 51 49

Participant demographics

Male : Female 30:21 (58% male) 34:15 (69% male)

Age at assessment 58.1 +- 1.4 60.9 +- 1.3 0.12 BMI 26.8 +- 0.6 26.3 +- 0.7 0.62 Alcohol consumption 13.7 +- 2.4 11.8 +- 2.8 0.61

MND patient specific

Bulbar onset 14 (28%)

Onset to assessment (month) 28.3 +- 2.8

Diagnostic to assessment (month) 13.1 +- 1.9

ALSFRS-R 37.5 +- 0.7

Delta FRS -0.49 +- 0.05

FVC (% of predicted) 87.1 +- 3.3

Riluzole medication 27 (55%)

4.3.2. Sequencing read quality Control

The initial quality control using Trimmomatic removed base calling quality score below 20 at any points in a read, to minimise wrong base call on all reads. This QC step removed ~20% (on average) reads per samples, which was an acceptable loss to minimise false positive OTU analysis in later stages. The quality score distribution of all reads that passed the initial Trimmomatic QC can be seen in Figure 2. As we can see in the figure, the reverse reads have lower quality scores compared to the forward reads. This is commonly seen, however the overall quality scores are still above 20, which is acceptable.

147 Forward Reads 40

20 Quality Quality score

0 0 100 200 280 read length

40 Reverse Reads

20 Quality Quality score

0 0 100 200 280 read length

Figure 2. Quality Scores distribution graphs for sequencing reads that passed initial Trimmomatic quality control. The x-axis is read length and y-axis is quality score of read at that length. The forward reads (above) were generally have better qualities than the reverse reads (below), especially near the end of reads.

The reads that passed Trimmomatic QC were de-noised and clustered using the DADA2 algorithm. The de-noised reads depths per-sample are plotted in a rarefaction graph (Figure 3). The three graphs (Figure 3 A,B, and C) shows the number of samples or diversity values (y axis) in various sequencing depth (x axis). From Figure 3B, the Shannon diversity index is already stable at 1000 reads per sample, meaning that the proportional OTU composition stabilises by 1000 reads. Although, more OTU might still detected at increasing depth (figure 3C), the new OTU are expected to be proportionally small, so would not change the overall composition. However, to decide the rarefaction cut-off, it is important to preserve as many as observed OTU and samples as possible (Figure 3A). We decided to use 4000 reads as the cut-off based on the rate of change in Shannon index numbers stabilizing and Observed

148 OTU not changing significantly at this stage. At this threshold, we were able to keep 100 (51 controls and 49 cases) of the original 120 samples for further analysis..

A C 55 160

45 120

80 30 Observed OTU Observed

Number of samples of Number 40

15

0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Sequence depth Sequence depth B 6.0

4.5

Control 3.0 ALS

Shannon Index 1.5

0 2000 4000 6000 8000 10000 Sequence depth

Figure 3. Rarefaction graphs. (a) Rarefaction effects on number of samples. (b) Rarefaction effects on alpha diversity (Shannon Index). (c) Rarefaction effects on observed OTU. We decided to put our cut-off at 4000 reads with the consideration that Observed OTU and Shannon index curve were already flattening while maximising number of samples.

4.3.3 Alpha and Beta diversity

The calculated alpha diversity using the Shannon index showed no significant difference between case and control (Figure 4A and B), Wilcoxon test P-value = 0.51. Similar trend observed in Chao1 index comparison, there was no significant different between case and controls (figure 4C and D).

149

Figure 4. Alpha Diversity comparison plot between controls and MND patients. Distribution comparison based on Shannon Index illustrated with scatter plot(A) and density plot(B). Distribution comparison based on Chao1 Index illustrated by scatter plot(C) and density plot(D). There was no significant difference observed.

A principal components of analysis (PCoA) graph of calculated Beta diversity using Weighted-UniFRAC (Figure 5) shows that MND cases beta diversity profiles were generally wider than controls. We can see that all 4 outliers were all MND cases and considering the alpha diversity of these outliers, it is less likely that the outlying beta-diversity profile were driven by difference of alpha diversity (Shannon index). The distance difference between Control and MND cases were tested using Permutational Multivariate Analysis of Variance (PERMANOVA) and it was significant (P-value = 0.01). However, we found that this significance was driven by the 4 cases outliers. The PERMANOVA test without these 4 outliers was not significant (P-value=0.46). We also tested the difference of dispersion between case and control using homogeneity of dispersion test (PERMDISP) and the result was significant (P-value=0.004). Again, we found that the significant result was driven by the 4 case outliers, the PERMDISP result without those 4 outliers was not significant (P-value=0.16).

150

Figure 5. Beta Diversity PCoA plot for MND patients (red) and controls (green), the size of each dot represents alpha diversity measured by Shannon Index. The 4 MND patient outliers were marked with red sample names. We observed significant difference between case-controls using PERMANOVA and PERMDISP, with significance driven by the 4 outliers.

4.3.4 Taxonomy analysis

From the taxonomy analysis at the phylum level (Figure 6), we find no significant difference in proportional abundance between two major bacteria phylum (Firmicutes and Bacteroidetes) with case-control Wilcoxon test of Firmicutes P-value=0.28 and Bacteroidetes P-value=0.74. Proteobacteria show a small, but non-significant difference (Wilcoxon test P-value= 0.79).

151 A 1.00 B 1.00 )

Elusimicrobia Candidate division TM7 Spirochaetae Opisthokonta

Bacteroidetes Lentisphaerae Synergistetes and 0.50 SHA 109 0.50 Fusobacteria Bacteria_other Cyanobacteria

Firmicutes SAR Verrucomicrobia

OTU (proportion of total counts) Tenericutes Other Unassigned Euryarchaeota

Actinobacteria assigned to

Proteobacteria OTU (proportion of counts not including OTUs Actinobacteria Proteobacteria Bacteroidetes 0.00 0.00 Firmicutes Control MND Control MND

C 1.0 Metabolism Figure 6. OTU Taxonomy histogramTransport and at CatabolismPhylum level between MND patients and controls. Cell Motility The overall Phylum distributionCell ofGrowth each and Death group (A) and the zoom-in for the less abundant phylum distribution (B). ThereCellular was Processes not significant difference D observedHuman between Diseases case and controls. Xenobiotics Biodegradation and Metabolism Excretory System Nucleotide Metabolism Environmental Adaptation Metabolism of Terpenoids and Polyketides Endocrine System Metabolism of Other Amino Acids Digestive System Metabolism of Cofactors and Vitamins Circulatory System Lipid Metabolism The taxonomy analysis also showedGlycan the Biosynthesis taxonomic and Metabolism composition of the four outliers detected in Enzyme Families beta diversity analysis (Figure 7).Energy For Metabolism all four individuals, differences0.006 in distribution were due 0.5 Carbohydrate Metabolism Biosynthesis of Other Secondary Metabolites to increased count for unassignedAmino Acid reads. Metabolism For one patient (R2368), we observed greater Genetic/Environmental Information Processing Functional Annotation 0.004 prevalence of OTUs assigned toTranslation Verrucomicrobia and Euryarchaeota compared to the OTU Transcription Replication and Repair composition profile of other cases.Folding We Sorting did andnot Degradation observe any outlying features of anthropometric, Signaling Molecules and Interaction (proportion of identified functinoal annotations) 0.002

Signal Transduction Functional Annotation metabolic, and clinical measuresMembrane for these Transport 4 patients (supplementary table 1). Unclassified Poorly Characterized Metabolism 0.0 Genetic Information Processing (proportion of identified functinoal annotations) Control MND Control MND Processes and Signaling

152

Figure 7. The OTU taxonomy of case-control OTU composition in phylum level (left box) compared with the OTU composition of the four beta-diversity outliers (right box). The outlying properties most likely attributed to increasing prevalence of Unassigned reads.

4.3.5 Metagenomic prediction

The metagenomic prediction analysis using PICRUSt2 to predicts each bacterial (OTU) count functional genes. These predicted genes were categorised based on KEGG functional annotate genes. There was no significant difference of functional annotation between case and controls (Figure 8).

153 A 1.00 B 1.00 )

Elusimicrobia Candidate division TM7 Spirochaetae Opisthokonta

Bacteroidetes Lentisphaerae Synergistetes and 0.50 SHA 109 0.50 Fusobacteria Bacteria_other Cyanobacteria

Firmicutes SAR Verrucomicrobia

OTU (proportion of total counts) Tenericutes Other Unassigned Euryarchaeota

Actinobacteria assigned to

Proteobacteria OTU (proportion of counts not including OTUs Actinobacteria Proteobacteria Bacteroidetes 0.00 0.00 Firmicutes Control MND Control MND

C 1.0 Metabolism Transport and Catabolism Cell Motility Cell Growth and Death Cellular Processes D Human Diseases Xenobiotics Biodegradation and Metabolism Excretory System Nucleotide Metabolism Environmental Adaptation Metabolism of Terpenoids and Polyketides Endocrine System Metabolism of Other Amino Acids Digestive System Metabolism of Cofactors and Vitamins Circulatory System Lipid Metabolism Glycan Biosynthesis and Metabolism Enzyme Families Energy Metabolism 0.006 0.5 Carbohydrate Metabolism Biosynthesis of Other Secondary Metabolites Amino Acid Metabolism Genetic/Environmental Information Processing Functional Annotation Translation 0.004 Transcription Replication and Repair Folding Sorting and Degradation Signaling Molecules and Interaction (proportion of identified functinoal annotations) 0.002

Signal Transduction Functional Annotation Membrane Transport Unclassified Poorly Characterized Metabolism 0.0 Genetic Information Processing (proportion of identified functinoal annotations) Control MND Control MND Processes and Signaling

Figure 8. PICRUSt2 prediction result with KEGG functional enrichment. No significant different found between case and control groups.

4.3.6 MND case-control regression analysis

We used linear regression analysis to investigate the extent to which relative abundance of taxonomic groups (both phylum and class) depends on known clinical and metabolic indices, Generally, we found no significant differences of microbiome composition between case and controls. The most significant nominal difference (Permutation Linear Regression P-value = 0.026) was for phylum Firmicutes after we remove possible effects from glucose related factors (glucose response, fed glucose, and fasting glucose). At the class level, we found Clostridia (which belong to phylum Firmicutes) has nominal difference (Permutation Linear Regression P-value = 0.031) after adjusting for fed glucose. These differences are not significant at a Bonferroni threshold that accounts for the 80 tests performed.

4.3.7 MND cases specific regression analysis

We used linear regression to assess the relationship between faecal microbiome composition and variables recorded in cases. We did not find any significant association between case-only

154 indices and OTU count in phylum level. However, in class level analysis, we found nominal association of three OTU classes (Negativicutes, Clostridia, and Alphaproteobacteria) with diagnostic delay. However, these associations were not significant given the level of multiple testing.

4.3.7 Association of gut microbiota composition with baseline clinical features

The correlation analysis between indices of microbiome diversity and patient clinical characteristics showed that the alpha diversity measures (Shannon and Chao1 index) were not associated with the site of disease onset (Figure 9A&D), severity of disease on the day of metabolic assessment (ALSFRS-R, Table 4), or King’s stage (Figure 9B&E). There was no correlation between alpha diversity with any anthropometric or metabolic measure (Table 4). We also found no difference in alpha diversity (Shannon and Chao1 index) index between normometabolic and hypermetabolic patients with MND (Figure 9C&F). There were no correlation of log(F/B) for patients with MND relative to their clinical and anthropometric measures. We found no significant differences of log(F/B) between patients with spinal or bulbar onset disease (Figure 9G). Similar trend also observed for the progression indicators, like log(F/B) change relative to ALSFRS-R scores (Table 5) and King’s stage, (Figure 9H). Moreover, there was no correlation between the log(F/B) and anthropometric measures, REE, metabolic index (Table 5) or the metabolic status of patients with MND (Figure 9I).

155

Figure 9. (A-C) Shannon index, (D-F) Chao1 index, and (G-I) logFirmicutes/Bacteroidetes (F/B) ratio relative to (A, D and G) site of onset, (B, E and H) King’s stage, or (C, F and I) metabolic status of patients with MND.

156

Table 4. Correlations between Shannon and Chao1 indices and patient information collected at the time of metabolic assessment

Shannon Index Chao1 index Measurements Spearman Correlation P-value Spearman Correlation P-value Demographic Age (years) 0.14 (-0.16 to 0.41) 0.34 -0.14 (-0.41 to 0.16) 0.36 Anthropometric measures Weight (kg) -0.08 (-0.36 to 0.22) 0.6 -0.17 (-0.44 to 0.13) 0.25 BMI (kg/m2) -0.01 (-0.31 to 0.29) 0.94 -0.52 (-0.38 to 0.21) 0.52 Hip Circumference (cm) 0.08 (-0.22 to 0.38) 0.59 -0.07 (-0.36 to 0.24) 0.64 Waist Circumference (cm) -0.01 (-0.31 to 0.29) 0.94 -0.12 (-0.41 to 0.19) 0.43 Fat-Free Mass (kg) -0.08 (-0.37 to 0.23) 0.62 -0.10 (-0.39 to 0.20) 0.51 Fat Mass (%) 0.16 (-0.14 to 0.43) 0.28 0.03 (-0.27 to 0.32) 0.85 Metabolic Measures Resting energy expenditure (kc/day) -0.08 (-0.37 to 0.22) 0.6 -0.10 (-0.39 to 0.21) 0.51 Metabolic index (%) 0.05 (-.26 to 0.35) 0.75 0.03 (-.28 to 0.33) 0.85 Clinical metrics Time since symptom onset (months) -0.19 (-0.46 to 0.10) 0.19 -0.12 (-0.39 to 0.18) 0.43 Diagnostic delay (months) -0.18 (-0.44 to 0.12) 0.22 -0.08 (-0.36 to 0.21) 0.57 ALSFRS-R 0.13 (-0.17 to 0.40) 0.39 0.18 (-0.11 to 0.45) 0.21 ALSFRS-R, Bulbar subscores 0.10 (-0.20 to 0.38) 0.5 0.11 (-0.18 to 0.39) 0.43 ALSFRS-R, Resp subscores 0.15 (-0.15 to 0.42) 0.31 0.19 (-0.11 to 0.45) 0.2 ΔFRS -0.12 (-0.40 to 0.17) 0.41 -0.01 (-0.30 to 0.28) 0.95 FVC (% predicted) -0.00(-0.30 to 0.30) 0.99 -0.00(-0.30 to 0.30) 1

157

Table 5. Correlations between Firmicutes and Bacteroidetes (logF/B) ratio and patient information collected at the time of metabolic assessment

Measurement Spearman Correlation P-value Demographics Age (years) 0.08 (-0.22 to 0.36) 0.59 Anthropometric measures Weight (kg) 0.06 (-0.24 to 0.35) 0.7 BMI (kg/m2) 0.10 (-0.20 to 0.38) 0.5 Hip Circumference (cm) 0.10 (-0.21 to 0.39) 0.53 Waist Circumference (cm) 0.00 (-0.30 to 0.31) 0.98 Fat-Free Mass (kg) -0.03 (-0.32 to 0.27) 0.85 Fat Mass (%) 0.12 (-0.18 to 0.41) 0.41 Metabolic Measures Resting energy expenditure (kc/day) 0.06 (-0.24 to 0.35) 0.69 Metabolic index (%) 0.11 (-0.20 to 0.40) 0.46 Clinical metrics Time since symptom onset (months) -0.25 (-0.50 to 0.04) 0.08 Diagnostic delay (months) -0.15 (-0.42 to 0.14) 0.3 ALSFRS-R -0.02 (-0.30 to 0.28) 0.92 ALSFRS-R, Bulbar subscores -0.07 (-0.35 to 0.22) 0.64 ALSFRS-R, Resp subscores 0.03 (-0.26 to 0.32) 0.82 ΔFRS -0.23 (-0.48 to 0.07) 0.12 FVC (% predicted) -0.21 (-0.48 to 0.10) 0.17

4.3.8 Association of gut microbiota composition with progression of disease

Disease progression as measured by the change of ALSFRS-R (b=0.03, p=0.36), weight (Regression Estimate b=0.05, p=0.87), and BMI (b=0.01, p=0.72) per month showed no significant association with the diversity of MND patient microbiome (measured by Shannon index) in a regression analysis after adjustment with age and sex. A similar trend is observed with the regression of Chao1 index relative to weight (b=-6.9x10-6, p=0.90), BMI (b=-3.6x10- 6, p=0.74), and ALSFRS- R (b=1.4x10-5, p=0.36), and log(F/B) versus weight (b=0.17, p=0.76) or BMI (b=0.06, p=0.61), and ALSFRS-R (b=0.09, p=0.33).

158 4.3.9 Association of gut microbiota composition with survival

We performed a Cox proportional hazards survival analysis within cases (MND patients) to test the effect of microbiome diversity index. We found that patients with greater richness and heterogeneity of microbiome alpha diversity index (measured by Shannon index) had worse survival from the time of onset of symptoms (hazard ratio=2.19, CI=1.06 to 4.78, LRT p=0.03), which means for every unit increase Shannon index, we found a 119% increase in risk of death (Figure 10A). A similar but less significant observation also found when we used survival time from sample collection (Hazard ratio=1.95, CI=0.93 to 4.13, LRT p=0.06) which means for every unit increase in Shannon index, we found a 95% increase in risk of death. On the other hand, we found no association between Chao1 index (which only considers richness of microbiome alpha diversity) and survival when considering the time of onset of symptoms (Hazard ratio=1.01, CI=0.99 to 1.01, LRT p=0.19 - Figure 10B). Similarly, Chao1 index was not associated with survival from time of sample collection (Hazard ratio=1.01, CI=0.99 to 1.02, LRT p=0.33). The higher log(F/B) was associated with worse survival from the time of symptom onset (Hazard ratio=2.89, CI=1.21 to 6.88, LRT p=0.02),which means for every unit increase in log(F/B), we found a 189% increase in risk of death (Figure 10C). A similar trend but weaker effect of log(F/B) was observed when considering the time of sample collection (Hazard ratio=2.33, CI=1.01 to 5.36, LRT p=0.046).

159

Figure 10. Survival probability of MND patients relative to the (A) Shannon index, (B) Chao1 index, and (C) logFirmicutes/Bacteroidetes (F/B) ratio. Crude Kaplan-Meier curves represent time since symptom onset. A higher Shannon index (HR = 2.19, p=0.03) and logF/B ratio (HR = 2.89, p=0.02) is associated with worse survival relative to time since symptom onset. Q1, lower quartile; Q2-3, quartile 2-3; Q4, upper quartile.

160 4.4 Discussion

In a series of analyses we found no evidence that faecal microbiome composition differs between MND cases and controls, nor depends on measured case-only variables. The matching of healthy controls to case counterparts based on sex and age is important to ensure that microbial or diversity measures associated with MND are not confounded by other known factors. We recruited matching healthy controls with close attention to previously reported sources of confounding such as, age, BMI, and alcohol consumption108 (Table 3). For cases we have detailed clinical data from MND patient (Table 3), such site of onset, time from onset to assessment, time from diagnostic to assessment, ALSFRS-R, Delta FRS, Riluzole medication, and forced vital capacity (FVC). These measures provide an overview of the progression stages of our MND cases and are important to consider since some clinical indices and progression stages could affect the gut microbiome. For example, since individuals with bulbar onset have difficulties of swallowing food, which directly impacts food intake, it might be expected that this could impact gut microbiome composition. Although, the gut microbiome composition of MND patient in different progression stages has not yet been studied longitudinally, it is likely that disease progression could change the composition of gut microbiome, since progression of MND also affects the peristaltic nerve of the gut109. The key point here is the importance of longitudinal study, since we did not observed association of microbial alpha diversity and site of onset at the time of sampling (Figure 9A,D, and G). In general our analyses microbiome data were of high quality as evidenced by the number of samples that passed stringent QC (100 out of 120). Still we lost 20% (on average) of reads on Trimmomatic read QC. This read loss was still within a range of good sequencing quality. One feature to note is that the quality score of reads from the reverse direction sequencing are worse than from the forward direction sequencing. Currently, there is no systematic review addressing this problem. Likely explanations are that overclustering that affect second round bridge amplification in Illumina sequencing technology, or the read2 primer/ reverse primer is more degraded than the forward primers. Although we found no significant difference in microbiome alpha diversity measures between case and control groups, the beta diversity profile (Figure 5) suggested that the MND group seem to have more variation, indicating a more diverse microbiome profile within the group. This significantly diverse beta diversity profile, especially of the 4 cases outliers, might be driven by the way MND patient cope with limb or bulbar impairment, and dietary

161 adjustments that are prescribed to MND patient to slow down the disease progression. The observation of more variable beta diversity without differences in alpha diversity indicates there is subtle compositional shift at some low-count OTUs. Even though these low-count OTU account for a small number of the total reads (and thus not strong enough to be detected by alpha diversity), they could still be significant to disease pathology. We also note that the difference beta diversity measure was driven by 4 outliers in MND cases. These four outliers taxonomic analysis showed that these four had significantly greater proportion of unassigned reads, Verrucomicrobia, and Euryarchaeota, suggesting evidence of dysbiosis compared to other patients and matched controls (Figure 7). This observation could explain some other MND microbiome studies with small sample size (<10 sample)90,91 that report the significant difference between MND patient and controls. It is likely the selection bias within these smaller studies has driven the significantly different results. For example, Rowin et al. (2017) was recruiting MND patients that had gastrointestinal symptoms, which could have accounted for distinct microbiome profile or even dysbiosis (without being necessarily driven by MND development). To detect OTU that have different proportional representations between case and control group, we conducted taxonomy differential analysis. We did not find any statistically different OTU proportion, but from Figure 6, we can see Proteobacteria is depleted in the MND group (but not statistically significant). This observation is counter-intuitive for MND group, since most studies link enriched Proteobacteria with increased risk of inflammations and various diseases, including Inflammatory Bowel disease (IBD), Obesity, Type-2 Diabetes, and Bowel Cancer110,111. This seemingly healthier gut of MND group, might be attributed to the prescribed diet as most neurologists would suggest this to MND diagnosed patients, since change of diet is one of the treatment to prolong patient survival5,13,112. From survival analysis, we observed a suggested trend that patient with greater Shannon index and log(F/B) ratio had increasing risk of death when considering time of symptom onset. Although, these observation can only be considered as suggestive given our relatively small sample size (49 cases with 24 death), they challenge the expectation of higher diversity and richness association with better health. A similar trend was found by Burberry et al. (2020) with a series of antibiotic treatment on C9orf72 mice models that reduce the mice gut bacterial diversity and improved their survival113. However, the use of antibiotic to reduce microbiome diversity and prolong MND patient survival was not always applicable in human clinical settings. A study by Gordon et al. (2007) showed that clinical trial of Minocycline

162 antibiotic treatment were harmful to MND patients114. Given our interest on MND and metabolism, we might link this finding with obesity. In MND, obesity was associated with slower disease progression115,116 and was linked with low microbiome diversity and lower F/B ratio108,117. Thus, we speculate that the association of higher risk of death with greater Shannon index and log(F/B) ratio might be secondary to the dietary intake that lead to weight gain or high caloric prescribed diet (part of MND treatment). Given, we did not asses the dietary intakes of our studies population, further study is needed to confirm this. In relation to the protective effects of obesity (higher weight and BMI) in MND, one question remains; whether metabolic change (mostly toward hypermetabolism) in MND is driven by microbiome change. According to previous metabolic study, the REE of obese and normal individual were affected by the proportion of Firmicutes and Bacteroidetes (F/B ratio)118. However, we did not find this association in our data. We tested the association between log(F/B) and REE and only found weak association, the same trend found when we tested the differences of log(F/B) in hypermetabolic and normometabolic MND patients, suggesting that it is unlikely that whole-body metabolic changes in MND is associated with altered microbiome composition. In line with this observation, we also found that there is no significant difference between alpha and beta diversity of hypermetabolic and normometabolic MND patients. Weight loss in MND patient has been an accurate predictor of MND disease progression. Recent studies suggested that loss of appetite is one of the contributing factors of weight loss119,120. However, considering the gut microbiome has a significant role in nutrient breakdown and absorption, there is the possibility that the change of microbiome composition is a driver in this weight loss. Several studies by Ley et al (2006) and Turnbaugh et al. (2006) reported that the number of Bacteroidetes proportion had positive correlation with weight loss121,122. However, we found no association between microbial composition (log(F/B), Shannon Index, Chao1 index) with weight and BMI of MND patients. A similar trend was also observed when we tested the association between microbial composition and the change of weight and BMI over time. Therefore, our data suggested that weight loss of MND patients (in our study) was unlikely to be driven by microbial composition or increased Bacteroidetes proportion as previously reported121,122. However, we still observed nominal effects of one of most abundant phylum in human gut (Firmicutes) in our population. Even though we did not collect the dietary information from MND patients, we have metabolic information from blood for glucose metabolism indicators

163 (fasting glucose, fed glucose, and glucose resistance). Our regression analyses suggested that adjusting for glucose related factors might be important to detect differences between case and control for Firmicutes counts. The regression direction was negative (permutation linear regression estimate = -0.17 & linear regression P-value=0.03), which suggest that Firmicutes count in MND cases was lower compared to controls after removing glucose related factors. This finding was consistent with the result from class level regression analysis. Clostridia (also part of phylum Firmicutes) was found to be nominally significantly lower in MND cases (permutation linear regression estimate = -0.19 & linear regression P-value=0.05) after adjusting with Fed-glucose. This observation is consistent with the previous observation with Proteobacteria, indicating that the cut in MND cases was more healthy than the controls. However, the actual interpretation is difficult as this observation could also be driven by the intervention of a healthier diet or MND cases glucose metabolism impairment. Our work provides hypotheses for follow-up and testing in future larger studies. The main limitation of our regression analysis was that we could only study common gut microbes; only 4 phyla (Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria) and 8 classes (Coriobacteriia, Bacteroidia, Clostridia, Erysipelotrichia, Negativicutes, Betaproteobacteria, Deltaproteobacteria) have enough useable variation between people to be included to analysis (sample size 100 individuals). It will be possible to study more subtle changes in lower taxonomical level (order, family, genus) with larger sample sizes and deeper sequencing. Compared to other studies on MND microbiome study in human with whole genome metagenomic, we found our results was similar in Roseburia depletion in MND patients123, although the difference of Roseburia in our sample was not statistically significance. In human gut, Roseburia are known as an obligate Gram-positive anaerobic bacteria that metabolise dietary components to produce short chain fatty acids, including butyrate124. They are implicated in colonic motility, the maintenance of immunity, and are thought to have anti- inflammatory properties125. As reported by Van Es et al. (2017), the understanding of the role of inflammatory process in the pathophysiology of MND is still evolving126 and this observation might contribute to this efforts. The importance of our study, compared to published studies, is that we used the largest cohort of MND patient with matched controls. We also used the latest 16S rRNA microbiome analysis pipeline for more robust OTU calling, rigorous quality controls, and a stringent statistical approach to test our results by always controlling false positives using multiple

164 testing corrections. Our analysis includes the standard QIIME2 approach analysis by using case-control taxanomic comparison using ANCOM, alpha (Shannon index and Chao1) and beta diversity analysis (Weighted UniFRAC), log(F/B), and regression analysis. From methodological point of view, our study were taking advantages of rich clinical (ALSFRS-R, King Stages, Riluzole medication, site-of-onset, FVC) and metabolic measures (REE, weight, BMI, glucose related measurement) using regression analysis to investigate the relationship between gut microbiome of MND patient with all of these variables. Therefore, our study filled the gap of explaining relationship between gut microbiome and metabolic related factors in MND patients. Our study has several limitations. The first limitation is that we only used 16S rRNA marker-gene methods that might not detailed enough to analyse different OTU in more detailed level (like species or ). This method also not able to provide functional information of the microbiome communities. Even though, the functional enrichment can be predicted using PICRUSt2, it still relies on existing databases that are built using gut microbiome metagenomic data of healthy populations. Without the more precise OTU identification (like species or even subspecies), it is hard to obtain enough detail of function information for meaningful interpretations. A second limitation is that our cohort still relatively small to capture the variation of MND microbial profile, especially when considering MND has heterogeneous manifestation, which can amplify the change in microbiome through differences in patients dietary habits due to dysphagia or loss of appetite. Finally, dietary intake information in our population would help to disentangle the effects of diet to MND metabolic related effects to gut microbiome profile.

4.5 Conclusion

We have conducted a comprehensive study on the gut microbiome of MND. We found no evidence of microbial profile between MND patients and healthy control group, except in beta diversity analysis that driven by four outliers of MND patients. We tested the difference of beta diversity using permutation analysis (PERMANOVA and PERMDISP) that suggest the difference were mainly driven by the larger variation of MND cases profile, especially the four outliers and it is unlikely happen by chance. We also found a suggestive evidence that higher diversity and richness, and the log(F/B) ratio were associated with earlier death. This

165 relationship appears to be independent of any metabolic and clinical features of MND. We conducted a series of regression and correlation analysis, we demonstrate the utility of metabolic and clinical data of our population, both for checking potential confounding factors and adjusting the regression. We were able to detect nominal association of Firmicutes to case- control status, after adjusting the glucose related factors, which could be interesting from functional perspective. We suggest that larger longitudinal study that taking account of complex interaction between microbiota, dietary habit, and a clinical characteristic of patients are needed to clarify the role of gut microbiome in MND.

166 4.6 References

1. Brown, R. H. & Al-Chalabi, A. Amyotrophic Lateral Sclerosis. http://dx.doi.org/10.1056/NEJMra1603471 http://www.nejm.org/doi/10.1056/NEJMra1603471 (2017) doi:10.1056/NEJMra1603471. 2. Armon, C. Environmental Risk Factors for Amyotrophic Lateral Sclerosis. NED 20, 2–6 (2001). 3. Al-Chalabi, A. et al. Analysis of amyotrophic lateral sclerosis as a multistep process: a population-based modelling study. Lancet Neurol 13, 1108–1113 (2014). 4. Eykens, C. & Robberecht, W. The genetic basis of amyotrophic lateral sclerosis: recent breakthroughs. Advances in Genomics and Genetics https://www.dovepress.com/the- genetic-basis-of-amyotrophic-lateral-sclerosis-recent-breakthrough-peer-reviewed- article-AGG (2015) doi:10.2147/AGG.S57397. 5. Dupuis, L., Pradat, P.-F., Ludolph, A. C. & Loeffler, J.-P. Energy metabolism in amyotrophic lateral sclerosis. Lancet Neurol 10, 75–82 (2011). 6. Taylor, J. P., Brown Jr, R. H. & Cleveland, D. W. Decoding ALS: from genes to mechanism. Nature 539, 197–206 (2016). 7. Westfall, S. et al. Microbiome, probiotics and neurodegenerative diseases: deciphering the gut brain axis. Cell. Mol. Life Sci. 74, 3769–3787 (2017). 8. Ghaisas, S., Maher, J. & Kanthasamy, A. Gut microbiome in health and disease: linking the microbiome-gut-brain axis and environmental factors in the pathogenesis of systemic and neurodegenerative diseases. Pharmacol Ther 158, 52–62 (2016). 9. Marizzoni, M., Provasi, S., Cattaneo, A. & Frisoni, G. B. Microbiota and neurodegenerative diseases. https://www.ingentaconnect.com/content/wk/wco/2017/00000030/00000006/art00012 (2017) doi:info:doi/10.1097/WCO.0000000000000496. 10. Thursby, E. & Juge, N. Introduction to the human gut microbiota. Biochem J 474, 1823– 1836 (2017). 11. Mazidi, M., Rezaie, P., Kengne, A. P., Mobarhan, M. G. & Ferns, G. A. Gut microbiome and metabolic syndrome. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 10, S150–S157 (2016).

167 12. Sanz, Y., Olivares, M., Moya-Pérez, Á. & Agostoni, C. Understanding the role of gut microbiome in metabolic disease risk. Pediatr Res 77, 236–244 (2015). 13. Ngo, S. T. & Steyn, F. J. The interplay between metabolic homeostasis and neurodegeneration: insights into the neurometabolic nature of amyotrophic lateral sclerosis. Cell Regeneration 4, 5 (2015). 14. Steyn, F. J. et al. Hypermetabolism in ALS is associated with greater functional decline and shorter survival. J. Neurol. Neurosurg. Psychiatry 89, 1016–1023 (2018). 15. Erny, D. et al. Host microbiota constantly control maturation and function of microglia in the CNS. Nat. Neurosci. 18, 965–977 (2015). 16. Erny, D. & Prinz, M. Microbiology: Gut microbes augment neurodegeneration. Nature 544, 304–305 (2017). 17. Jäkel, S. & Dimou, L. Glial Cells and Their Function in the Adult Brain: A Journey through the History of Their Ablation. Front Cell Neurosci 11, 24 (2017). 18. Thion, M. S. et al. Microbiome Influences Prenatal and Adult Microglia in a Sex- Specific Manner. Cell 172, 500-516.e16 (2018). 19. Heintz-Buschart, A. & Wilmes, P. Human Gut Microbiome: Function Matters. Trends in Microbiology 26, 563–574 (2018). 20. Selber-Hnatiw, S. et al. Human Gut Microbiota: Toward an Ecology of Disease. Front. Microbiol. 8, (2017). 21. Manaka, A., Tokue, Y. & Murakami, M. Comparison of 16S ribosomal RNA gene sequence analysis and conventional culture in the environmental survey of a hospital. J Pharm Health Care Sci 3, (2017). 22. Knight, R. et al. Best practices for analysing microbiomes. Nature Reviews Microbiology 1 (2018) doi:10.1038/s41579-018-0029-9. 23. Poretsky, R., Rodriguez-R, L. M., Luo, C., Tsementzi, D. & Konstantinidis, K. T. Strengths and Limitations of 16S rRNA Gene Amplicon Sequencing in Revealing Temporal Microbial Community Dynamics. PLOS ONE 9, e93827 (2014). 24. Gohl, D. M. et al. An optimized protocol for high-throughput amplicon-based microbiome profiling. (2016). 25. Vollmers, J., Wiegand, S. & Kaster, A.-K. Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! PLOS ONE 12, e0169662 (2017).

168 26. Brooks, J. P. et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiology 15, 66 (2015). 27. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7, 335–336 (2010). 28. QIIME 2. https://qiime2.org/. 29. Navas-Molina, J. A. et al. Advancing our understanding of the human microbiome using QIIME. Meth. Enzymol. 531, 371–444 (2013). 30. Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13, 581–583 (2016). 31. Callahan, B. J., McMurdie, P. J. & Holmes, S. P. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal 11, 2639– 2643 (2017). 32. Amir, A. et al. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems 2, (2017). 33. Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res. 42, D633-642 (2014). 34. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). 35. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41, D590–D596 (2013). 36. Weiss, S. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 27 (2017). 37. Langille, M. G. I. et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotech 31, 814–821 (2013). 38. Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature 550, 61–66 (2017). 39. Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nature Reviews Microbiology 14, 508–522 (2016). 40. Jovel, J. et al. Characterization of the Gut Microbiome Using 16S or Shotgun Metagenomics. Front. Microbiol. 7, (2016). 41. Sharpton, T. J. An introduction to the analysis of shotgun metagenomic data. Front Plant Sci 5, (2014).

169 42. Rotmistrovsky, K. & Agarwala, R. BMTagger: Best Match Tagger for removing human reads from metagenomics datasets. (2011). 43. BBMap - Browse Files at SourceForge.net. https://sourceforge.net/projects/bbmap/files/. 44. Ni, J., Yan, Q. & Yu, Y. How much metagenomic sequencing is enough to achieve a given goal? Scientific Reports 3, 1968 (2013). 45. Medema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res 39, W339–W346 (2011). 46. Blin, K. et al. antiSMASH 4.0—improvements in chemistry prediction and gene cluster boundary identification. Nucleic Acids Res 45, W36–W41 (2017). 47. Fondi, M. & Liò, P. Genome-scale metabolic network reconstruction. Methods Mol. Biol. 1231, 233–256 (2015). 48. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733-745 (2016). 49. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). 50. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res 42, D222–D230 (2014). 51. Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015). 52. Huson, D. H. et al. MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data. PLOS Computational Biology 12, e1004957 (2016). 53. Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15, R46 (2014). 54. Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature Methods 12, 902–903 (2015). 55. Nguyen, N.-P., Mirarab, S., Liu, B., Pop, M. & Warnow, T. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics 30, 3548–3555 (2014). 56. Abubucker, S. et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8, e1002358 (2012).

170 57. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017). 58. Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 40, e155 (2012). 59. Li, D. et al. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016). 60. Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016). 61. Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nature Methods 11, 1144–1146 (2014). 62. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139– 140 (2010). 63. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, (2014). 64. Barwell, L. J., Isaac, N. J. B. & Kunin, W. E. Measuring β-diversity with species abundance data. Journal of Animal Ecology 84, 1112–1122. 65. Bray, J. R. & Curtis, J. T. An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecological Monographs 27, 325–349 (1957). 66. Lozupone, C. A., Hamady, M., Kelley, S. T. & Knight, R. Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Appl. Environ. Microbiol. 73, 1576–1585 (2007). 67. Real, R. & Vargas, J. M. The Probabilistic Basis of Jaccard’s Index of Similarity. Systematic Biology 45, 380–385 (1996). 68. Lozupone, C. & Knight, R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 71, 8228–8235 (2005). 69. Golob, J. L., Margolis, E., Hoffman, N. G. & Fredricks, D. N. Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities. BMC Bioinformatics 18, 283 (2017).

171 70. Schloss, P. D. et al. Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009). 71. Dixon, P. VEGAN, a package of R functions for community ecology. Journal of Vegetation Science 14, 927–930 (2003). 72. McDonald, D. et al. The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience 1, 7 (2012). 73. Li, H. Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis. Annual Review of Statistics and Its Application 2, 73–94 (2015). 74. Mandal, S. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb Ecol Health Dis 26, (2015). 75. Morton, J. T. et al. Balance Trees Reveal Microbial Niche Differentiation. mSystems 2, e00162-16 (2017). 76. Aguiar-Pulido, V. et al. Metagenomics, Metatranscriptomics, and Metabolomics Approaches for Microbiome Analysis. Evol Bioinform Online 12, 5–16 (2016). 77. Lagier, J.-C. et al. Culturing the human microbiota and culturomics. Nature Reviews Microbiology 16, 540 (2018). 78. Abu-Ali, G. S. et al. Metatranscriptome of human faecal microbial communities in a cohort of adult men. Nature Microbiology 3, 356 (2018). 79. Schirmer, M. et al. Dynamics of metatranscription in the inflammatory bowel disease gut microbiome. Nat Microbiol 3, 337–346 (2018). 80. Wang, W.-L. et al. Application of metagenomics in the human gut microbiome. World J Gastroenterol 21, 803–814 (2015). 81. Martinez, X. et al. MetaTrans: an open-source pipeline for metatranscriptomics. Scientific Reports 6, 26447 (2016). 82. Wilmes, P., Heintz-Buschart, A. & Bond, P. L. A decade of metaproteomics: Where we stand and what the future holds. Proteomics 15, 3409–3417 (2015). 83. Erickson, A. R. et al. Integrated Metagenomics/Metaproteomics Reveals Human Host- Microbiota Signatures of Crohn’s Disease. PLoS ONE 7, e49138 (2012). 84. Heyer, R. et al. Challenges and perspectives of metaproteomic data analysis. Journal of Biotechnology 261, 24–36 (2017).

172 85. Rechenberger, J. et al. Challenges in Clinical Metaproteomics Highlighted by the Analysis of Acute Leukemia Patients with Gut Colonization by Multidrug-Resistant Enterobacteriaceae. Proteomes 7, 2 (2019). 86. Lai, L. A., Tong, Z., Chen, R. & Pan, S. Metaproteomics Study of the Gut Microbiome. in Functional Proteomics: Methods and Protocols (eds. Wang, X. & Kuruc, M.) 123– 132 (Springer New York, 2019). doi:10.1007/978-1-4939-8814-3_8. 87. Lagier, J.-C. et al. Culture of previously uncultured members of the human gut microbiota by culturomics. Nature Microbiology 1, 16203 (2016). 88. Amrane, S., Raoult, D. & Lagier, J.-C. Metagenomics, culturomics, and the human gut microbiota. Expert Review of Anti-infective Therapy 16, 373–375 (2018). 89. Cassir, N. et al. Clostridium butyricum Strains and Dysbiosis Linked to Necrotizing Enterocolitis in Preterm Neonates. Clin. Infect. Dis. 61, 1107–1115 (2015). 90. Rowin, J., Xia, Y., Jung, B. & Sun, J. Gut inflammation and dysbiosis in human motor neuron disease. Physiol Rep 5, (2017). 91. Fang, X. et al. Evaluation of the Microbial Diversity in Amyotrophic Lateral Sclerosis Using High-Throughput Sequencing. Front Microbiol 7, (2016). 92. Brenner, D. et al. The faecal microbiome of ALS patients. Neurobiology of Aging 61, 132–137 (2018). 93. Mazzini, L. et al. Potential Role of Gut Microbiota in ALS Pathogenesis and Possible Novel Therapeutic Strategies. J. Clin. Gastroenterol. (2018) doi:10.1097/MCG.0000000000001042. 94. Sampson, T. R. et al. Gut Microbiota Regulate Motor Deficits and Neuroinflammation in a Model of Parkinson’s Disease. Cell 167, 1469-1480.e12 (2016). 95. Laukens, D., Brinkman, B. M., Raes, J., De Vos, M. & Vandenabeele, P. Heterogeneity of the gut microbiome in mice: guidelines for optimizing experimental design. FEMS Microbiol Rev 40, 117–132 (2016). 96. Sonnenburg, J. L. & Bäckhed, F. Diet–microbiota interactions as moderators of human metabolism. Nature 535, 56 (2016). 97. Klingelhoefer, L. & Reichmann, H. Pathogenesis of Parkinson disease—the gut–brain axis and environmental factors. Nature Reviews Neurology 11, 625 (2015). 98. Hopfner, F. et al. Gut microbiota in Parkinson disease in a northern German cohort. Brain Res. 1667, 41–45 (2017).

173 99. Zhao, Y., Jaber, V. & Lukiw, W. J. Secretory Products of the Human GI Tract Microbiome and Their Potential Impact on Alzheimer’s Disease (AD): Detection of Lipopolysaccharide (LPS) in AD Hippocampus. Front Cell Infect Microbiol 7, 318 (2017). 100. Ludolph, A. et al. A revision of the El Escorial criteria - 2015. Amyotroph Lateral Scler Frontotemporal Degener 16, 291–292 (2015). 101. Ioannides, Z. A. et al. Predictions of resting energy expenditure in amyotrophic lateral sclerosis are greatly impacted by reductions in fat free mass. Cogent Medicine 4, 1343000 (2017). 102. Siri, W. E. Body composition from fluid spaces and density: analysis of methods. 1961. Nutrition 9, 480–491; discussion 480, 492 (1993). 103. Klindworth, A. et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res 41, e1 (2013). 104. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 105. Anderson, M. J. Permutational Multivariate Analysis of Variance (PERMANOVA). in Wiley StatsRef: Statistics Reference Online 1–15 (American Cancer Society, 2017). doi:10.1002/9781118445112.stat07841. 106. Douglas, G. M. et al. PICRUSt2: An improved and extensible approach for metagenome inference. bioRxiv 672295 (2019) doi:10.1101/672295. 107. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl. Acids Res. 28, 27–30 (2000). 108. Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012). 109. Tomik, J., Sowula, K., Ceranowicz, P., Dworak, M. & Stolcman, K. Effects of Biofeedback Training on Esophageal Peristalsis in Amyotrophic Lateral Sclerosis Patients with Dysphagia. J Clin Med 9, (2020). 110. Bradley, P. H. & Pollard, K. S. Proteobacteria explain significant functional variability in the human gut microbiome. Microbiome 5, 36 (2017). 111. Shin, N.-R., Whon, T. W. & Bae, J.-W. Proteobacteria: microbial signature of dysbiosis in gut microbiota. Trends Biotechnol. 33, 496–503 (2015).

174 112. Vandoorne, T., Bock, K. D. & Bosch, L. V. D. Energy metabolism in ALS: an underappreciated opportunity? Acta Neuropathol 135, 489–509 (2018). 113. Burberry, A. et al. C9orf72 suppresses systemic and neural inflammation induced by gut bacteria. Nature 1–6 (2020) doi:10.1038/s41586-020-2288-7. 114. Gordon, P. H. et al. Efficacy of minocycline in patients with amyotrophic lateral sclerosis: a phase III randomised trial. The Lancet Neurology 6, 1045–1053 (2007). 115. Paganoni, S., Deng, J., Jaffa, M., Cudkowicz, M. E. & Wills, A.-M. Body mass index, not dyslipidemia, is an independent predictor of survival in amyotrophic lateral sclerosis. Muscle Nerve 44, 20–24 (2011). 116. Ioannides, Z. A., Ngo, S. T., Henderson, R. D., McCombe, P. A. & Steyn, F. J. Altered Metabolic Homeostasis in Amyotrophic Lateral Sclerosis: Mechanisms of Energy Imbalance and Contribution to Disease Progression. Neurodegener Dis 16, 382–397 (2016). 117. Menni, C. et al. Gut microbiome diversity and high-fibre intake are related to lower long-term weight gain. International Journal of Obesity 41, 1099–1105 (2017). 118. Kocełak, P. et al. Resting energy expenditure and gut microbiota in obese and normal weight subjects. Eur Rev Med Pharmacol Sci 17, 2816–2821 (2013). 119. St, N. et al. Loss of appetite is associated with a loss of weight and fat mass in patients with amyotrophic lateral sclerosis. Amyotrophic lateral sclerosis & frontotemporal degeneration vol. 20 https://pubmed.ncbi.nlm.nih.gov/31144522/ (2019). 120. Mezoian, T. et al. Loss of appetite in amyotrophic lateral sclerosis is associated with weight loss and decreased calorie consumption independent of dysphagia. Muscle & Nerve 61, 230–234 (2020). 121. Ley, R. E., Turnbaugh, P. J., Klein, S. & Gordon, J. I. Human gut microbes associated with obesity. Nature 444, 1022–1023 (2006). 122. Turnbaugh, P. J. et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–1031 (2006). 123. Blacher, E. et al. Potential roles of gut microbiome and metabolites in modulating ALS in mice. Nature 572, 474–480 (2019). 124. Tungland, B. Chapter 2 - Short-Chain Fatty Acid Production and Functional Aspects on Host Metabolism. in Human Microbiota in Health and Disease (ed. Tungland, B.) 37– 106 (Academic Press, 2018). doi:10.1016/B978-0-12-814649-1.00002-8.

175 125. Ríos-Covián, D. et al. Intestinal Short Chain Fatty Acids and their Link with Diet and Human Health. Front Microbiol 7, (2016). 126. van Es, M. A. et al. Amyotrophic lateral sclerosis. Lancet (2017) doi:10.1016/S0140- 6736(17)31287-4.

176 4.7 Supplementary Material

Supplementary Table 1. MND microbiome beta-diversity outliers clinical data

Hyper Sample Site of Delta Kings FVC % Fat mass FFM Metabolic smok alco RILU COMORBI SEX AGE HEIGHT WEIGHT BMI metab ID onset FRS Stage (%) Fat (kg) (kg) Index ing hol ZOLE DITY olic R2368 Male 53 Spinal 0.34 3 60 177.5 90.715 28.793 37.6 34.105 56.61 94.24 NO No 2 yes depression R0089 Female 63 Spinal 0.30 1 84 160.5 56.357 21.878 36.6 20.611 35.746 74.25 NO No 14 yes restless legs R1350 Male 57 Bulbar 0.55 4B 65 177.0 68.316 21.806 22.1 15.065 53.251 112.68 NO No 0 yes cholesterol R2362 Female 62 Spinal 0.32 3 95 166.0 75.129 27.264 46.8 35.151 39.978 121.65 YES Yes 0 no nil

177 5

Chapter 5: Thesis Discussion

Thesis summary, limitations, and the future of ALS genetics research

178 5.1 Thesis summary

5.1.1 Thesis goal and general strategy

In this thesis, my overarching goal was to better understand the aetiology of ALS. As I reviewed in chapter 1, ALS is a complex disease that involves genetic and non-genetic risk factors. Genetic factors are one of the major contributors to ALS risk with the heritability estimate ~40% according to latest family studies1. Moreover, ALS genetics data are more reliable and accurate compared to the non-genetic risk factors that mostly come from self- report2. For these reasons, the main focus of my research was genetic analyses of ALS. By reviewing the literature, I found that many of ALS non-genetic risk factors were themselves heritable, such as hypermetabolism3, education attainment4, some psychiatric conditions5–7 and elite athleticism8, which lead us to hypothesise that some of ALS “non-genetic” risk factors could be genetically correlated. Therefore, I studied the non-genetic contributions to ALS indirectly through genetic correlations of the heritable these risk factor (Chapter 2). Despite current limitations of ALS GWAS, I used genetically correlated heritable risk factors to improve ALS risk prediction, which, in turn, possibly added some insight to aetiology of ALS. I explored novel genetic and multi-omics methodologies to enrich the available ALS GWAS result that might add more insight and a more complete view of ALS aetiology (Chapter 3). Finally, this thesis explored the contribution of variation in gut microbiome and its association with ALS traits (Chapter 4).

5.1.2 Thesis findings

5.1.2.1 ALS is genetically correlated with cognitive performance, education attainment, and schizophrenia

An important research approach used in this thesis was to study ALS genetics through genetically correlated risk factors. To begin the implementation of this strategy, I conducted genetic correlation analyses across publicly available GWAS summary statistics using the LDHub platform9. The study found that ALS has significant negative genetic correlations with educational attainment (EA) and cognitive performance (CP) related traits. Furthermore, I used this correlation to construct multi-traits PRS predictors for ALS. By combining the genetic effects of ALS, EA, and CP and weight them based on their strength of genetic correlation (multi-traits PRS), I applied this predictor to independent dataset and found an increased

179 prediction variance explained. This observation added evidence of these genetic correlations. These results also provided support to previous studies that linked educational and cognitive performance factors as risk factors of ALS. The negative genetic correlations of ALS with these two traits suggest that high EA or CP provide protective effects for ALS, but this does not necessarily imply causality. Using similar analytical approaches (genetic correlation and multi-traits PRS), the previously reported positive genetic correlation between schizophrenia (SCZ) and ALS10 (estimated with partially overlapping data) was also confirmed. This predictive improvement from the implementation of ALS multi-trait predictor (with CP, EA, and SCZ) also support the estimated genetic correlation to ALS (from independent data). While sample size does not impact the expected estimate of a genetic correlation, sample size does affect the standard error, and therefore the ability to declare a correlation to be significantly different from zero. One concern of the genetic correlation study is that the significant correlation might be driven by high powered and polygenic GWAS traits, which include CP, EA, and SCZ GWAS. If this bias were true, other well-powered and highly polygenic GWAS like Height11 and Type-2 Diabetes12 would be detected as significantly correlated. In chapter 3, the cell-type analysis of ALS, CP, EA, and SCZ found that they are all highly enriched in brain related tissues, which might explain their significant correlations with ALS. This brain-tissues-related correlation is also true for Parkinson’s and Alzheimer’s disease which have been reported to be genetically correlated with CP13,14, EA15–17, and SCZ18,19. Therefore, the genetic correlation of CP, EA, and SCZ are not exclusive to ALS and might also help us to understand the aetiology of other neurodegenerative disease like Parkinson’s and Alzheimer’s disease.

5.1.2.2 ALS genetic architecture is not highly polygenic and involves a significant proportion of rare variants

Despite the improved PRS predictive ability for ALS from adding PRS for EA, CP, and SCZ into the prediction, the prediction performance as measured by variance explained (Nagelkerke-R2) and area under the curve (AUC) was very small. The main reason for this low prediction was the ALS low SNP-based heritability estimated from common SNPs (0.0176 SE= 0.004) calculated using the latest published ALS GWAS20. This low SNP-based heritability estimate and low variance explained by genetic predictors, limit the application of ALS genetic predictor in clinical21–23 and research settings24. Similar limitations also apply to

180 the interpretation of genetic correlation results25. The ALS low SNP-based heritability estimate may reflect the genetic architecture of ALS, as previous analyses suggested that the low frequency variants may be relatively more important in ALS than other common diseases26. A series of analyses using Bayesian approaches (SBayesR27 and SBayesS28) that have greater flexibility in modelling the underlying genetic architecture model (not only infinitesimal) generated improved SNP weights for the PRS. The prediction results of non- infinitesimal ALS genetic predictors work significantly better (Chapter 2 and 3) and the estimated proportion of causal SNPs (Pi-parameter (p)) for ALS was very low (SBayesS and SBayesR posterior statistics are p~0.007 with SD~0.001, Chapter 3). These observations are best judged relative to other diseases (Type-2 Diabetes p = 0.01 and Depression p = 0.08)28, which suggest that ALS genetic architecture is not as highly polygenic as these diseases. The SBayesS method also models the relationship between ALS genetics effects and their allele frequency, which is consistent with relatively strong negative selection (Chapter 3). SbayesS posterior statistics for S parameter was estimated to be -1 (SD=0.09), compared to the S parameter for other common disorders of -0.30 for Type-2 diabetes and -0.24 for depression. Since the SNP-based heritability of ALS was relatively very small, caution may be needed in the interpretation of this implied “negative selection”. None-the-less this high-negative S value still reflects a strong relationship between low minor allele frequency variants and stronger effect sizes, which supports a relatively more important involvement of rare-variants in ALS genetic architecture than for other common disorders.

5.1.2.3 ALS associated variants are likely to affect gene expression in brain and immune related tissues

I applied cell-type enrichment analysis to infer which cell-types with more genetic associations for a trait than expected by chance29. This analysis uses differential gene expression across cell- types to annotate gene combinations to cell-types and then identify the cell-type gene sets. Using the latest ALS European GWAS (unpublished), I showed that ALS was highly enriched in central nervous system (brain) related tissues (Chapter 3). While this observation is expected for ALS, it does help confirm that the ALS GWAS associations include real association signals. Notably, similar cell-types were also highly enriched in CP, EA, and SCZ. These results could imply that the genetic correlation might be driven by similar cell-types (brain-related tissues) of these traits. Therefore, I took this cell-type overlap hypothesis further by looking at the genes that are possibly causal from all of these traits. I conducted SMR30 analysis using the largest

181 GWAS summary statistics of these traits with brain (brain meta-analysis)31 and blood (eQTLgen)32 expression Quantitative Trait Loci (eQTL) summary statistics. I found some overlapping genes from ALS and CP. Notably, most of the overlapping genes found in both traits did not pass the HEIDI30 test for ALS. This could imply that the overlapping genes of ALS and CP were only because of linkage (gene that in the same LD block) rather than due to pleiotropy or being causal. An alternative explanation is that these genes have a complex pleiotropic architecture of multiple causal variants. More research is needed to understand the relationship between these traits. Despite these inconclusive results, the cell-type analysis also found an enrichment of genetic associations for ALS in genes expressed in an immune related cell type (dendritic cell). Although the involvement of an immune system hypothesis in ALS pathology is not new, the suggested contribution of the immune system via dendritic cell could be an interesting to explore further. Dendritic cells (DC) are “antigen presenting cells” - cells that present the cell surface of unknown cells to T-cells – which act as a messenger between innate and adaptive immune systems33 and are known for its involvement in the pathology of Multiple Sclerosis (MS)34, Alzheimer’s disease35 and Parkinson’s disease35. In MS pathology, the accumulation of DC in brain is mainly caused by perturbed DC drainage from the brain to cervical lymph nodes34. A similar mechanism might be relevant in ALS, but more studies are needed to confirm this. DC immune function also depends on pattern recognition receptors (in the form of toll- like receptors) that need to be “trained” to function properly (matured) by exposure to surrounding pathogens such as viruses and bacteria36. If DC do not mature properly then aberrant allergic reaction and autoimmune disease in the gut like inflammatory bowel diseases37,38 can occur. Hence, it might be interesting to see DC’s interaction with gut microbiota. The involvement of immune system defect in ALS pathology also can be observed in gene level. In part 1.2.2.1, I reviewed the ALS associated genes involved in neuron support and immune system (microglia) failure. There are two genes suspected to be involved in this pathology, C9orf72 and TBK1. The C9orf72 repeat expansion is strongly associated with both familial and sporadic ALS, however the exact pathology is unknown39,40. One study suggested that C9orf72 excessive repeats inactivate G proteins by exhausting guanine (that is important as G protein exchange factor), which leads to abnormal microglia activity40,41. However, the latest study using knock-out C9orf72 mice model, revealed that lack of C9orf72 expression only showed mild motor deficit, but significant abnormal immune activity at mice whole body

182 (not only in brain microglia), resembling systemic lupus erythematosus42. While this whole body abnormal immune system does not fit with human ALS pathology, it provides support that C9orf72 are involved in immune system related pathology. It was suggested that C9orf72 might interact with other ALS associated genes that especially expressed in brain/nerve tissues that creates localised abnormal activity in brain42. I listed these brain related causal genes (SLC9A8 and RESP18) in chapter 3 SMR analysis. The TBK1 gene is also heavily involved with immune system, especially on autophagy mechanism43. The autophagy is like a cell’s internal immune system to clear excess proteins and intracellular pathogens by degrading them44. TBK1 activates autophagy by phosphorylating the autophagy receptor optineurin, in response to accumulating interferon (usually produce during intercellular infection) or aggregate protein44. TBK1 nonsense and missense mutations create dysfunctional TBK1 protein that reduces the ability and responsiveness to activate autophagy, which increases the chance of accumulating aberrant protein like mutant SOD145. Therefore, it is possible that immune system involved in ALS from excessive autoimmune reaction (C9orf72) and cell’s intracellular immune system fails to clear accumulation protein (TBK1).

5.1.2.4 ALS causal genes by SMR analysis

One of the main goals of genetic study of ALS is to find its causal genes. The latest ALS GWAS has reported 14 loci associated with ALS, however, these associations do not imply causality. To test for the possible causality of these associated ALS loci, SMR analyses were performed using the ALS relevant tissues that found using cell-type analysis (Chapter 3 part 3.3.3). The largest brain and blood eQTL summary statistics available to date (Brain meta- eQTL and Prefrontal Cortex PsychENCODE for brain; eQTLgen for blood) were used for SMR analysis. Meanwhile, the blood eQTL was used to capture immune related cell-type like DC. SMR analyses suggest 11 causal genes for ALS (Chapter 3 part 3.3.4). SCFD1 was the most significantly associated genes in all eQTL data used in SMR analysis for ALS. The brain meta-eQTL and prefrontal cortex PsychENCODE also provide support for RESP18, and SLC9A8 as causal genes base on expression QTLs in brain related tissues. RESP18 has not previously been reported as an ALS associated gene. MOBP gene was significant in brain meta- eQTL only, while G2E2 and SARM1 genes were significant using the prefrontal cortex PsychENCODE data. The blood eQTLgen detected 4 genes (GGNBP2, DHRS11, ZNHIT3,

183 MYO19) in the same region in chromosome 17. Many of these putative causal genes (SCFD1, SLC98A, RESP18, GGNBP2, DHRS11) have functional annotation that supports Endoplasmic Reticulum Stress pathology of ALS. On the other hand, the other causal genes like MYO19 and MOBP are relevant to mechanisms in cytoskeleton defects.

5.1.2.5 ALS patient have more diverse microbiome profiles (beta diversity) that might affect their disease progression.

My interest in ALS gut microbiome was driven by previous studies that link ALS risk and progression with metabolic dysregulation46–48. As reviewed in Chapter 1, many metabolism activities are also heritable, but the genetic correlation study (Chapter 2 part 2.4.2) did not find any significant genetic correlations with ALS and no digestion-related tissues enriched in ALS cell-type analysis (Chapter 3 part 3.3.3). These observations led us to hypothesise that metabolic dysregulation might be a consequence rather than a cause of ALS. One hypothesis is that the ALS metabolic dysregulation might be driven by gut microbiome dysbiosis. Therefore, we conducted the marker-gene 16S rRNA microbiome analysis from a hundred case-control ALS faecal samples cohort – one of the largest gut microbiome studies in ALS to date. In general, we did not find significant difference of microbiome composition between ALS patient and healthy controls. However, I found that the variation of microbiome profiles (beta diversity) within ALS patient group was significantly wider than within the healthy control group. Although, the significance was driven by only four ALS patient outliers, I tested the difference of beta diversity using a permutation test (PERMANOVA and PERMDISP) and find that this observation was unlikely to occur by chance. Therefore, this estimated beta diversity variance difference might reflect real variation in beta diversity of ALS patients compared to controls. The other interesting finding was that ALS patients with lower alpha- diversity (measured by Shannon index) had better survival and slower disease progression, which challenges the current view that higher alpha diversity leads to healthier gut and longer survival. Although these results need to replicated in other studies, they suggest that gut microbiome were more relevant in ALS disease progression rather than disease onset.

184 5.2 Thesis limitations

5.2.1 Low SNP-based heritability estimates of ALS GWAS

For the last 13 years, GWAS has been a relatively successful method for studying complex traits and disorder49. However, the low SNP-based heritability for ALS estimated from GWAS data could suggest that the GWAS paradigm is less relevant to ALS, although there are caveats against drawing this conclusion. The largest (unpublished) ALS GWAS to date estimated SNP- based heritability as only ~0.03 (SE=0.002 by SBayesR and SBayesS). This SNP-based heritability is very small compared to the total heritability estimated from family studies of ~0.41. This difference in heritability and SNP-based heritability estimates has implications for clinical application21–23. GWAS data suggest that ALS is not highly polygenic and could involve significant proportion of genetic variation associated with rare alleles not captured by common SNPs in GWAS (Chapter 3). This significant portion of rare alleles not captured by GWAS makes the application of common variant information in clinical settings very limited, the variance explained by PRS is low, and many genes in ALS might not be detected yet. It is notable that the magnitude of ALS SNP-based heritability estimates has decreased when comparing to the initial estimates (SNP-h2 =0.08, SE=0.005 in Van-Rheenen 2016 with 12,577 cases and 23,475 controls)26 to the latest estimate (SNP-h2=0.03, SE=0.002 in latest unpublished European ALS GWAS with 27,434 cases and 112,018 controls). The larger the sample size the smaller the standard error of the estimate, hence the estimate from the largest sample size should be more accurate. However, there are some technical limitations of the ALS GWAS data which could play a role. It is notable that many samples contributing to the ALS GWAS comprise case-only or control-only samples, which require stringent QC in bringing them together into a case versus control analysis. It is possible that real genetic signal is lost in this process, which could bias estimates of SNP-based heritability downward and reduce power for identifying significant SNP associations. If common variants play a limited role in the genetic architecture of ALS then a logical next step is to conduct whole genome sequencing (WGS) studies (to incorporate rare-alleles), to increase the sample size of the study (which is needed to make the rare alleles association study possible)50, and to be more careful in the experimental design properly matching cases and controls.

185 5.2.2 The limited availability of ALS GWAS and expression level data with sufficient power

In this thesis, I investigate the putative causality of ALS GWAS SNPs by using SMR methodology that incorporates eQTL data from brain and blood tissues (Chapter 3). Although this analysis found some possible ALS causal genes, the results could be considered underwhelming, mostly due to the underpowered ALS GWAS and the lack of expression level GWAS data with sufficient power. First, consider the problem that comes from underpowered ALS GWAS. The objective evidence that the GWAS are underpowered comes from the out- of-sample prediction results that show greater prediction when SNPs with a P-value greater than the genome-wide significance threshold (P-value > 5x10-8) are used to generate the polygenic risk scores (PRS), compared to when only genome-wide significant SNPs are used. This implies that there are ALS loci with more subtle association effect sizes that do not reach significance with the current ALS GWAS sample-size (Chapter 2 and 3). Similar predictive pattern was also found in well-powered GWAS studies for SCZ, where out of sample prediction of PRS from SNPs with P-value greater than genome-wide threshold (P-value > 5x10-8) perform better than PRS from genome-wide significant SNPs5. This observation indicates that this is a common problem when using genome-wide significant SNPs PRS. The PRS prediction of SCZ using the initial GWAS in 2008 (479 cases and 2,937 controls) was also very similar to current ALS PRS prediction (very small predictive power and the larger P- value threshold > 0.5 yields a better prediction than genome-wide threshold)51,52. The increasing sample size in SCZ GWAS in 2014 (36,989 cases and 113,075 controls) was not only increased PRS prediction, it also lower P-value threshold (P-value > 0.05) for the best SCZ PRS predictor5. This SCZ prediction (using the lower P-value threshold) is stable in the latest GWAS release53. It is quite likely that the same trend applies to ALS; the larger the sample size, the more statistical power we have to detect subtle genetic variance, the lower the P-values threshold is required to optimise the PRS prediction. A second problem comes from the availability of well-powered tissue specific expression (e.g. eQTL) data. According to cell-type analysis, ALS was significantly enriched in brain and immune related cell-types (Chapter 3). However, this result is likely limited by the availability of the other cell-types data with sufficient statistical power to identify eQTL. There are possibly other cell-type that are relevant to ALS, but they do not yet show significant association of enrichment due to lack of power or simply they are not yet available in references data sets. Brain and whole-blood are the most common tissue types available and have the

186 largest sample size cell-specific expression data available to date. I can only use brain meta- eQTL (N=1.1K) and prefrontal cortex PsychENCODE (N=1.4K) data (as the largest expression data sets available) to represent brain tissues in SMR analysis; while the blood eQTLgen (~31K) was used to represent general expression. In addition to power issue of the available expression data, the lack of availability of motor-neuron cell expression data also limits the SMR analysis. However, according to a review by Liu et al. (2017), the correlation in gene expression between 11 tissues in GTEx ver.654 was ~0.75. This means that using tissues not directly relevant to the trait of interest will, on average, have correlation of ~0.75 of the gene expression analysis using the right-tissue. Since ALS is indeed a motor neuron related disease for which eQTL data are not yet available, there is some loss of power, but the very large blood eQTLgen (~31K) is well-powered to identify many eQTL shared across tissues.

5.2.3 The lacked power of gut microbiome study and difficulty in inferring causality

Our microbiome study was the largest published to date for ALS, yet it was still underpowered to draw strong conclusions. Our microbiome analysis results (Chapter 4), found that ALS patients with lower microbial diversity tend to have longer survival and some patients might experience heavy dysbiosis. However, this study cannot test whether microbiome were causal to ALS or just consequences of ALS. Our current faecal microbiome cohort sample size (49 ALS cases and 51 healthy controls) and methodological approach (single sampling of 16S rRNA marker-gene), did not have enough power and precision to detect microbial level in species level or taking full advantages of our rich clinical measurements. The lack of power in our microbiome study was evident compared to the latest published microbiome study55. Using 34,057 community sample individuals from two different cohorts of Israel and US, Rothschild et al. (2020) used shot-gun metagenomic methodology, a strong correlation between gut microbiome composition and age, gender, BMI, blood-glucose were detected. Notably, these results were consistent in the US and Israel cohorts55. Our current cohort of 100 individuals did not have enough statistical power to detect these correlations.

187 5.3 Implications from results

One of the main findings reported in this thesis was the suggested genetic architecture of ALS. Using the largest unpublished ALS GWAS to date, the post-GWAS analyses using various approaches (Chapter 3) suggest that ALS genetic architecture tends to be less polygenetic compared to other common diseases (Type-2 diabetes, schizophrenia, major depression, cardiovascular disease) and involves significant proportion of rare-variants. This suggestion is useful to guide future ALS genetic research, which should concentrate on the application of whole-genome sequencing and detection of associated genes through burden testing (discussed in part 5.4.1). The cell-type analysis of ALS and the causal-genes list from SMR analysis (Chapter 3) could be relevant to guide in-vivo and in-vitro studies of ALS genetics. Despite lack of some specific cell-type gene reference data set relevant to ALS (discussed in part 5.2.2), the SMR results were conclusive that ALS associated regions are quite likely expressed in brain related tissues and immune related cells. This finding adds support to the view that the future of ALS studies should focus on modelling ALS in brain tissues and to some extent in immune related cells (like DC). The ideal study for this differential expression is discussed in 5.4.2. The causal gene list from SMR analysis provides guidance on genes to prioritise in follow up functional genomic studies. Moreover, the list of causal genes also suggested that many of those genes (from both brain and blood eQTL) annotation are involved in protein processing in Endoplasmic Reticulum (ER), which potentially contributes to the ER stress mechanism of ALS pathology. This observation also suggests further investigation in biological process of ER in ALS affected cells (discussed in 5.4.3). The genetic correlation analysis found that ALS was significantly correlated with CP and EA (Chapter 2). Moreover, these two traits were negatively correlated and could be leveraged to improve ALS genetic predictors. Even though this result does not imply causality or protective effects of education (EA) or IQ (CP) to ALS, it does confirm the negative association, which was previously inconclusive in many observational studies about education and IQ to the risk ALS. These seemingly protective effects of education are likely mediated by socio-economic status56, where highly educated populations are less likely to be exposed to ALS risk factors like smoking, heavy metals, or head-trauma. This education effects might also explain weaker genetic correlation with exercise and smoking related traits (Supplementary table 1 in Chapter 2). The deeper investigation of these genetic correlations based on cell-type

188 analysis and SMR output, suggested that CP has a stronger genetic overlap with ALS, which could potentially help the understanding of ALS-FTD spectrum from several biological levels. At the tissue level, the cell-type analysis of ALS and CP were highly enriched in the entorhial cortex that play important role in memory57 (Chapter 3). At the gene level, many CP SMR causal genes P-values were inflated in ALS SMR results (with inflation parameter lambda = 3.0 in chapter 3). MYO19 gene was also found significant in both ALS and CP which indicates shared causal gene between them. This result could be important to understand the genetic basis of ALS-FTD beyond C9orf72. Lastly, despite negative results, the gut microbiome study of ALS in chapter 4 revealed a tendency that ALS patients that have less abundant microbiome profiles have slower progression. This observation suggests that the role of gut microbiota in ALS might be more relevant in disease progression of ALS. It is also worth to note that the ALS case group had a significantly larger beta-diversity variation. Although, this significant larger variation was driven by 4 outliers, it might indicate the true variations in the larger sample size of ALS cases. This larger variation within the ALS cases group, also indicates the importance of within cases ALS microbiome composition analysis rather than case-control. These observations were possible because of the rich clinical information within this cohort. To build on this study, I discussed the ideal gut microbiome study design for ALS in part 5.4.4.

189 5.4 Future of ALS research

5.4.1 Population scale whole genome sequencing to investigate rare variants

As discussed in section 5.2.1, a large ALS case-control cohort of whole genome sequencing is an obvious solution to some of the ALS GWAS limitations. However, this solution is hard to implement in the real world. First, difficulties comes from the fact that ALS is a relatively rare disease with lifetime risk of 0.33% (many common diseases studied have lifetime risk >1%)58– 61, which make the case sample recruitment in massive numbers difficult. Moreover, many people with ALS survive only 3-5 years after diagnosis, the pool of prevalent cases is lower compared to other diseases. A second difficulty is the logistic problem of recruiting case samples. The number of hospitals or clinics which have advanced neurological units able to treat and diagnose ALS accurately is limited. This limitation can lead to inaccurate diagnosis, which is amplified by the heterogenous nature of ALS62. Third, it is difficult to find funding to conduct whole genome sequencing on very large samples. Although, the price for WGS is dropping63, the operation and material cost of one sample (standard 30X depth) is still more than 1000 USD (with no sign of dropping beyond this price since 2016), which is significantly more expensive than GWAS genotyping chip (<100 USD/sample)64,65. Even though WGS with shallow depth (1.5X to 4X, cost ~300 USD) is available, the shallow sequencing depth limits coverage, the detection of structural variants, and the ability to call heterozygous rare variants accurately. This limitation is not ideal for ALS genetic analysis because structural and rare heterozygous variants could be particularly relevant. Therefore, most likely, we will need to use standard depth (30X) WGS for ALS. Increasing sample size using whole genome sequencing seems to be the most obvious solution to current ALS GWAS problem, the availability of the data on this scale will not be available in the immediate future. Despite the current problem in funding and logistics, project MinE is an ALS whole genome sequencing study, on-going since 2013 by crowd-funding support66. To date, they have successfully collected ~10,600 samples with whole genome sequencing data (N case= 6,245) and reach almost a half of their targeted sample size goal of 22,500 individuals. Even though the complete project MinE dataset might be not finished in the immediate future, the project has shown a great promise for analysis of ALS rare-variants and for investigation of the genetic architecture of ALS. Currently, project MinE only focuses on cohorts of European descent. This is a good place to start, since Europe already has the infrastructure to support these efforts.

190 Therefore, I recommend support of project MinE and prepare to analyse these data in the future. If this effort is successful, the extensions to other ancestries will naturally become the next frontier to ALS genetics. However, even if Project MinE achieve their targeted sample size (22.5K individuals), it is likely that the study still be underpowered to detect the associations with a significant number of rare-variants. According to a systematic review and simulation by Lee et al. (2014), even with a study cohort of 1,000,000 individuals, the estimated proportion of variance due to rare-variants (minor allele frequency < 0.001) will remains underestimated67. This is especially true, if the association testing rely on single-variant test under an additive genetic model that widely implemented in standard GWAS67,68. To address this problem, aggregation tests evaluate cumulative effects of multiple genetic variant in an gene or region (burden test)68 are needed. There are many methods of burden test available with their own strength and weakness67–69. Therefore, the burden testing can be chosen according to the ALS genetic architecture, which are suggested in chapter 3, is less polygenic than for many other diseases studied.

5.4.2 Single-cell expression sequencing data to pin-point ALS pathology

The current advances of sequencing technology has led to the development of single-cell RNA sequencing (scRNA-seq) technology70. The standard RNA-seq that has been around for a decade, has low resolution cell isolation and limited precision of RNA measurement, which tends to give an average expression of the whole tissues without consideration of variation between cells70–72. These limitations contributes to our limited understanding of a cell-specific molecular profiles, lack of statistical power (reflected in high standard errors of estimates), and their implication to the whole cellular system, causing underestimation of disease/phenotypic heterogeneity70–72. Compared to the standard RNA-seq technology, scRNA-seq is able to precisely isolate a single cell and get its expression profile to be compared to the other cells within or between tissues70,71. In this thesis, I found that the ALS genetic signal is highly enriched in genes that are expressed in various brain tissues and dendritic cell. This result suggests that ALS might involve multiple tissues which could lead to heterogeneity in disease manifestation. I propose that the heterogenous nature of ALS can be analysed using scRNA-seq with a bottom-up approach. This could begin by analysis of a single patient motor neuron nerve tissue. Assuming

191 that in the beginning of ALS onset, a patient has a mixture of healthy motor neurons and dying motor neurons, these two states of neuron can be isolated and their expression profiles compared. By analysing the expression profile in single cell precision, we can infer the difference in biological activity between healthy and dying neuron cell while controlling for within-individual genetic and environmental background. Then, we could compare the expression profiles of dendritic cells and neuronal cells in the progression of diseases. After, applying this paradigm to single patients, we can extend this differential analysis to the population level by comparing the expression profiles within an ALS group or the healthy individual group.

5.4.3 Application of Induced Pluripotent Stem Cell (IPSC)

This thesis highlights the complex nature of ALS, which involves many genes and their combination with environmental risk factors. One of the major problems in translating the findings from ALS genetic studies into clinical practice is the limitation of animal models to simulate the effects of ALS associated gene mutation. Most ALS in-vivo gene testing used yeast, worms, flies, zebrafish, and mice model with single gene mutations (SOD1, C9orf72, TDP43, and FUS)73. However, while this approach can work for high penetrant singe gene mutation in familial ALS research, it is unable to simulate the complex trait nature of sporadic ALS74. Mice models with single gene mutations are also quite likely to underestimate familial ALS complexity, since the familial ALS high penetrant genes might also have complex interaction with many other genes along its biological pathway (see Chapter 1). Therefore, I support the application of Induced Pluripotent Stem Cell (IPSC) to test ALS associated genes. Unlike mice models that breed with a certain gene mutation, the IPSC cells can be cultured from the actual ALS patients, which simulate the real ALS cell biology in neurons. This approach avoids the assumption bias that only one gene is causal for ALS75. Moreover, IPSC- based study design also allows a specific organelle activity observation (like ER or cytoskeleton)76–79 which is important to follow-up SMR results (Chapter 3 and discussed in part 5.3) and gain understanding of ER-stress and cytoskeleton defects in ALS. Despite a criticism that suggests use of IPSC to study ALS can underestimate the neuron cell’s complex relationship with other cells (e.g immune cells)80, the new detailed insight on ALS motor neuron dynamics offered by IPSC application is still important to complement the currently available animal model.

192

5.4.4 Longitudinal gut microbiome data from various location of gastrointestinal tract

The main limitations of our ALS microbiome study were the limited precision of 16S rRNA for identify the bacterial composition in species level and only a single cross-sectional sample was available per person. The difference between cases and controls is in the rate of change of the microbiome richness and composition81–84. The microbiome study reported in chapter 4 also supports this notion by suggesting that gut microbiome might be more relevant in ALS disease progression. Therefore, longitudinal microbiome sampling for each patient in the cohort is needed to investigate the disease progression dynamics. Moreover, we also suggest that microbiome samples should be taken from various site in gastrointestinal tracts of samples. The faecal samples only represent the gut microbiome composition at the end of gastrointestinal tracts (colon and rectum), and do not really reflecting the microbiome composition of small-intestines and duodenum85,86, where most of the nutrient processing and immune related reaction happens in the gut87,88. Since, ALS gut-brain axis might be modulated by the immune system, our gut microbiome study (chapter 4) might be looking at gut biome diversity in the wrong parts of gastrointestinal tract by using faecal samples. This suggests that the follow-up of our results also studied longitudinal gut microbiota composition from different parts of gastrointestinal tract. However, the logistics of recruiting ALS participants to undergo the colonoscopy needed to access microbiome studies from the gastrointestinal tract are likely prohibitive to this suggestion.

5.5 Conclusions

This thesis has four main findings that contribute to the understanding of ALS aetiology. First, we confirm the genetic architecture of ALS – relatively less polygenic and with relatively higher contribution of rare variants compared to other common diseases. These results guide experimental design to focus on whole-genome sequencing association studies. Second, the list of ALS causal genes in relevant cell-type (brain tissues) not only add insight into ALS aetiology, but also guide the future ALS research that should focus on brain tissues modelling, focusing on ER-stress and cytoskeleton defect mechanism. Third, the genetic correlation analysis confirms that ALS has negative correlation with education attainment and cognitive

193 performance. This finding confirmed the negative direction of the association that was previously inconclusive in many studies about education and IQ to the risk ALS. Deeper investigation in these genetic correlations also potentially improves the understanding of shared biological mechanism between ALS and CP, in relation to ALS-FTD spectrum. Last, our gut microbiome results suggested that gut microbiota is more relevant in progression rather than disease development. Moreover, we demonstrate the importance of rich clinical data to support microbiome study.

194 5.6 References

1. Trabjerg, B. B. et al. ALS in Danish Registries: Heritability and links to psychiatric and cardiovascular disorders. Neurol Genet 6, e398 (2020). 2. Belbasis, L., Bellou, V. & Evangelou, E. Environmental Risk Factors and Amyotrophic Lateral Sclerosis: An Umbrella Review and Critical Assessment of Current Evidence from Systematic Reviews and Meta-Analyses of Observational Studies. NED 46, 96– 105 (2016). 3. van Dongen, J., Willemsen, G., Chen, W.-M., de Geus, E. J. C. & Boomsma, D. I. Heritability of metabolic syndrome traits in a large population-based sample. J Lipid Res 54, 2914–2923 (2013). 4. Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics 50, 1112 (2018). 5. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014). 6. Cross-Disorder Group of the Psychiatric Genomics Consortium. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet 381, 1371–1379 (2013). 7. Duncan, L. et al. Significant Locus and Metabolic Genetic Correlations Revealed in Genome-Wide Association Study of Anorexia Nervosa. Am J Psychiatry appiajp201716121402 (2017) doi:10.1176/appi.ajp.2017.16121402. 8. Harwood, C. A. et al. Long-term physical activity: an exogenous risk factor for sporadic amyotrophic lateral sclerosis? Amyotroph Lateral Scler Frontotemporal Degener 17, 377–384 (2016). 9. Zheng, J. et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017). 10. McLaughlin, R. L. et al. Genetic correlation between amyotrophic lateral sclerosis and schizophrenia. Nat Commun 8, (2017). 11. Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum Mol Genet 27, 3641– 3649 (2018).

195 12. Xue, A. et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun 9, 1–14 (2018). 13. Guerreiro, R. et al. Genome-wide analysis of genetic correlation in dementia with Lewy bodies, Parkinson’s and Alzheimer’s diseases. Neurobiology of Aging 38, 214.e7- 214.e10 (2016). 14. Smeland, O. B. & Andreassen, O. A. How can genetics help understand the relationship between cognitive dysfunction and schizophrenia? Scandinavian Journal of Psychology 59, 26–31 (2018). 15. Anderson, E. L. et al. Education, intelligence and Alzheimer’s disease: evidence from a multivariable two-sample Mendelian randomization study. International Journal of Epidemiology 49, 1163–1172 (2020). 16. Dumitrescu, L. et al. Genetic variants and functional pathways associated with resilience to Alzheimer’s disease. Brain 143, 2561–2575 (2020). 17. Kotagal, V. et al. Educational Attainment and Motor Burden in Parkinson disease. Mov Disord 30, 1143–1147 (2015). 18. Smeland, O. B. et al. Genome-wide Association Analysis of Parkinson’s Disease and Schizophrenia Reveals Shared Genetic Architecture and Identifies Novel Risk Loci. Biological Psychiatry 89, 227–235 (2021). 19. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nature Genetics 47, 1236–1241 (2015). 20. Nicolas, A. et al. Genome-wide Analyses Identify KIF5A as a Novel ALS Gene. Neuron 97, 1268-1283.e6 (2018). 21. Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet 45, 400–405 (2013). 22. Jostins, L. & Barrett, J. C. Genetic risk prediction in complex disease. Hum. Mol. Genet. 20, R182–R188 (2011). 23. Wray, N. R. & Goddard, M. E. Multi-locus models of genetic risk of disease. Genome Med 2, 10 (2010). 24. Rehbach, K. et al. Publicly available hiPSC lines with extreme polygenic risk scores for modeling schizophrenia. bioRxiv 2020.07.04.185348 (2020) doi:10.1101/2020.07.04.185348. 25. Rheenen, W. van, Peyrot, W. J., Schork, A. J., Lee, S. H. & Wray, N. R. Genetic correlations of polygenic disease traits: from theory to practice. Nat Rev Genet 20, 567– 581 (2019).

196 26. van Rheenen, W. et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nat Genet 48, 1043–1048 (2016). 27. Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. bioRxiv 522961 (2019) doi:10.1101/522961. 28. Zeng, J. et al. Bayesian analysis of GWAS summary data reveals differential signatures of natural selection across human complex traits and functional genomic categories. bioRxiv 752527 (2019) doi:10.1101/752527. 29. Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics 50, 621–629 (2018). 30. Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet 48, 481–487 (2016). 31. Qi, T. et al. Identifying gene targets for brain-related traits using transcriptomic and methylomic data from blood. Nat Commun 9, (2018). 32. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL metaanalysis. bioRxiv 447367 (2018) doi:10.1101/447367. 33. Dhodapkar, M., Mackall, C. L. & Steinman, R. M. Dendritic Cells and Adaptive Immunity. in Williams Hematology (eds. Kaushansky, K. et al.) (McGraw-Hill Education, 2015). 34. De Laere, M., Berneman, Z. N. & Cools, N. To the Brain and Back: Migratory Paths of Dendritic Cells in Multiple Sclerosis. J Neuropathol Exp Neurol 77, 178–192 (2018). 35. Bossù, P., Spalletta, G., Caltagirone, C. & Ciaramella, A. Myeloid Dendritic Cells are Potential Players in Human Neurodegenerative Diseases. Front. Immunol. 6, (2015). 36. Quaratino, S., Duddy, L. P. & Londei, M. Fully competent dendritic cells as inducers of T cell anergy in autoimmunity. PNAS 97, 10911–10916 (2000). 37. Baumgart, D. C. et al. Patients with active inflammatory bowel disease lack immature peripheral blood plasmacytoid and myeloid dendritic cells. Gut 54, 228–236 (2005). 38. Baumgart, D. C. & Carding, S. R. Inflammatory bowel disease: cause and immunobiology. Lancet 369, 1627–1640 (2007). 39. Farg, M. A. et al. C9ORF72, implicated in amytrophic lateral sclerosis and frontotemporal dementia, regulates endosomal trafficking. Hum. Mol. Genet. (2017) doi:10.1093/hmg/ddx309. 40. Taylor, J. P., Brown Jr, R. H. & Cleveland, D. W. Decoding ALS: from genes to mechanism. Nature 539, 197–206 (2016).

197 41. Jiang, J. et al. Gain of Toxicity from ALS/FTD-Linked Repeat Expansions in C9ORF72 Is Alleviated by Antisense Oligonucleotides Targeting GGGGCC-Containing RNAs. Neuron 90, 535–550 (2016). 42. Atanasio, A. et al. C9orf72 ablation causes immune dysregulation characterized by leukocyte expansion, autoantibody production and glomerulonephropathy in mice. Scientific Reports 6, 23204 (2016). 43. Oakes, J. A., Davies, M. C. & Collins, M. O. TBK1: a new player in ALS linking autophagy and neuroinflammation. Molecular Brain 10, 5 (2017). 44. Ahmad, L., Zhang, S.-Y., Casanova, J.-L. & Sancho-Shimizu, V. Human TBK1: A Gatekeeper of Neuroinflammation. Trends in Molecular Medicine 22, 511–527 (2016). 45. Cirulli, E. T. et al. Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways. Science 347, 1436–1441 (2015). 46. Ioannides, Z. A., Ngo, S. T., Henderson, R. D., McCombe, P. A. & Steyn, F. J. Altered Metabolic Homeostasis in Amyotrophic Lateral Sclerosis: Mechanisms of Energy Imbalance and Contribution to Disease Progression. Neurodegener Dis 16, 382–397 (2016). 47. Steyn, F. J. et al. Hypermetabolism in ALS is associated with greater functional decline and shorter survival. J. Neurol. Neurosurg. Psychiatry 89, 1016–1023 (2018). 48. Ngo, S. T. & Steyn, F. J. The interplay between metabolic homeostasis and neurodegeneration: insights into the neurometabolic nature of amyotrophic lateral sclerosis. Cell Regeneration 4, 5 (2015). 49. Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. The American Journal of Human Genetics 101, 5–22 (2017). 50. Young, A. I. Solving the missing heritability problem. PLOS Genetics 15, e1008222 (2019). 51. O’Donovan, M. C. et al. Identification of loci associated with schizophrenia by genome- wide association and follow-up. Nat Genet 40, 1053–1055 (2008). 52. International Schizophrenia Consortium et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009). 53. Pardiñas, A. F. et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet 50, 381–389 (2018). 54. Liu, X. et al. Functional Architectures of Local and Distal Regulation of Gene Expression in Multiple Human Tissues. Am. J. Hum. Genet. 100, 605–616 (2017).

198 55. Rothschild, D. et al. An atlas of robust microbiome associations with phenotypic traits based on large-scale cohorts from two continents. bioRxiv 2020.05.28.122325 (2020) doi:10.1101/2020.05.28.122325. 56. Avinun, R. Educational Attainment Polygenic Score is Associated with Depressive Symptoms via Socioeconomic Status: A Gene-Environment-Trait Correlation. bioRxiv 727552 (2020) doi:10.1101/727552. 57. Schultz, H., Sommer, T. & Peters, J. The Role of the Human Entorhinal Cortex in a Representational Account of Memory. Front Hum Neurosci 9, (2015). 58. Ringholz, G. M. et al. Prevalence and patterns of cognitive impairment in sporadic ALS. Neurology 65, 586–590 (2005). 59. Alonso, A., Logroscino, G., Jick, S. S. & Hernán, M. A. Incidence and lifetime risk of motor neuron disease in the United Kingdom: a population-based study. Eur J Neurol 16, 745–751 (2009). 60. van Es, M. A. et al. Amyotrophic lateral sclerosis. The Lancet 390, 2084–2098 (2017). 61. Longinetti, E. & Fang, F. Epidemiology of amyotrophic lateral sclerosis: an update of recent literature. Curr Opin Neurol 32, 771–776 (2019). 62. Chiò, A. et al. Prognostic factors in ALS: A critical review. Amyotrophic Lateral Sclerosis 10, 310–323 (2009). 63. Muir, P. et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 17, 53 (2016). 64. DNA Sequencing Costs: Data. Genome.gov https://www.genome.gov/about- genomics/fact-sheets/DNA-Sequencing-Costs-Data. 65. Christensen, K. D., Phillips, K. A., Green, R. C. & Dukhovny, D. Cost Analyses of Genomic Sequencing: Lessons Learned from the MedSeq Project. Value Health 21, 1054–1061 (2018). 66. Project MinE: study design and pilot analyses of a large-scale whole-genome sequencing study in amyotrophic lateral sclerosis. European Journal of Human Genetics 1 (2018) doi:10.1038/s41431-018-0177-4. 67. Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-Variant Association Analysis: Study Designs and Statistical Tests. Am J Hum Genet 95, 5–23 (2014). 68. Povysil, G. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nature Reviews Genetics 20, 747–759 (2019).

199 69. Zhang, X., Basile, A. O., Pendergrass, S. A. & Ritchie, M. D. Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico. BMC Bioinformatics 20, 46 (2019). 70. Gawel, D. R. et al. A validated single-cell-based strategy to identify diagnostic and therapeutic targets in complex diseases. Genome Medicine 11, 47 (2019). 71. Zeng, T. & Dai, H. Single-Cell RNA Sequencing-Based Computational Analysis to Describe Disease Heterogeneity. Front. Genet. 10, (2019). 72. Rodger, E. J. Single-cell RNA Sequencing to Investigate Human Disease. Journal of RNA and Genomics 14, (2018). 73. Morrice, J. R., Gregory-Evans, C. Y. & Shaw, C. A. Animal models of amyotrophic lateral sclerosis: a comparison of model validity. Neural Regen Res 13, 2050–2054 (2018). 74. Damme, P. V., Robberecht, W. & Bosch, L. V. D. Modelling amyotrophic lateral sclerosis: progress and possibilities. Disease Models & Mechanisms 10, 537–549 (2017). 75. Hawrot, J., Imhof, S. & Wainger, B. J. Modeling cell-autonomous motor neuron phenotypes in ALS using iPSCs. Neurobiology of Disease 134, 104680 (2020). 76. Simic, M. S. et al. Transient activation of the UPRER is an essential step in the acquisition of pluripotency during reprogramming. Science Advances 5, eaaw0025 (2019). 77. Ke, M. et al. Azoramide protects iPSC-derived dopaminergic neurons with PLA2G6 D331Y mutation through restoring ER function and CREB signaling. Cell Death & Disease 11, 1–14 (2020). 78. Boraas, L. C., Guidry, J. B., Pineda, E. T. & Ahsan, T. Cytoskeletal Expression and Remodeling in Pluripotent Stem Cells. PLoS One 11, (2016). 79. Griesi-Oliveira, K. et al. Actin cytoskeleton dynamics in stem cells from autistic individuals. Scientific Reports 8, 11138 (2018). 80. Bohl, D., Pochet, R., Mitrecic, D. & Nicaise, C. Modelling and treating amyotrophic lateral sclerosis through induced-pluripotent stem cells technology. Curr Stem Cell Res Ther 11, 301–312 (2016). 81. Fukuyama, J. et al. Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment. PLOS Computational Biology 13, e1005706 (2017).

200 82. Lugo-Martinez, J., Ruiz-Perez, D., Narasimhan, G. & Bar-Joseph, Z. Dynamic interaction network inference from longitudinal microbiome data. Microbiome 7, 54 (2019). 83. Usyk, M. et al. Cervicovaginal microbiome and natural history of HPV in a longitudinal study. PLOS Pathogens 16, e1008376 (2020). 84. Yang, X., Qian, Y., Xu, S., Song, Y. & Xiao, Q. Longitudinal Analysis of Fecal Microbiome and Pathologic Processes in a Rotenone Induced Mice Model of Parkinson’s Disease. Front. Aging Neurosci. 9, (2018). 85. W, T. et al. Characterizing the microbiota in gastrointestinal tract segments of subminiatus: Dynamic changes and functional predictions. Microbiologyopen e789–e789 (2019) doi:10.1002/mbo3.789. 86. Schmidt, T. S. et al. Extensive transmission of microbes along the gastrointestinal tract. eLife 8, e42693 (2019). 87. Dieterich, W., Schink, M. & Zopf, Y. Microbiota in the Gastrointestinal Tract. Med Sci (Basel) 6, (2018). 88. Lavelle, A. & Sokol, H. Gut microbiota: Beyond metagenomics, metatranscriptomics illuminates microbiome functionality in IBD. Nature Reviews Gastroenterology & Hepatology 15, 193–194 (2018).

201

--End of Thesis--

202