medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
1 A global atlas of genetic associations of 220 deep phenotypes
2
3 Saori Sakaue1-5, 46, *, Masahiro Kanai1, 5-9, 46, Yosuke Tanigawa10, Juha Karjalainen5-7,9, Mitja 4 Kurki5-7,9, Seizo Koshiba11,12, Akira Narita11, Takahiro Konuma1, Kenichi Yamamoto1,13, 5 Masato Akiyama2,14, Kazuyoshi Ishigaki2-5, Akari Suzuki15, Ken Suzuki1, Wataru Obara16, 6 Ken Yamaji17, Kazuhisa Takahashi18, Satoshi Asai19,20, Yasuo Takahashi21, Takao 7 Suzuki22, Nobuaki Shinozaki22, Hiroki Yamaguchi23, Shiro Minami24, Shigeo Murayama25, 8 Kozo Yoshimori26, Satoshi Nagayama27, Daisuke Obata28, Masahiko Higashiyama29, 9 Akihide Masumoto30, Yukihiro Koretsune31, FinnGen, Kaoru Ito32, Chikashi Terao2, 10 Toshimasa Yamauchi33, Issei Komuro34, Takashi Kadowaki33, Gen Tamiya11,12,35,36, 11 Masayuki Yamamoto11,12,35, Yusuke Nakamura37,38, Michiaki Kubo39, Yoshinori Murakami40, 12 Kazuhiko Yamamoto15, Yoichiro Kamatani2,41, Aarno Palotie5,9,42, Manuel A. Rivas10, Mark J. 13 Daly5-7,9, Koichi Matsuda43, *, Yukinori Okada1,2,41,44,45, * 14 15 1. Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, 16 Japan. 17 2. Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative 18 Medical Sciences, Yokohama, Japan 19 3. Center for Data Sciences, Harvard Medical School, Boston, MA, USA 20 4. Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and 21 Women’s Hospital, Harvard Medical School, Boston, MA, USA 22 5. Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 23 Cambridge, MA, USA 24 6. Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, 25 USA 26 7. Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, 27 MA, USA 28 8. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA 29 9. Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland 30 10. Department of Biomedical Data Science, School of Medicine, Stanford University, 31 Stanford, CA, USA 32 11. Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan 33 12. The Advanced Research Center for Innovations in Next-Generation Medicine (INGEM), 34 Sendai, Japan 1
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice. medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
35 13. Department of Pediatrics, Osaka University Graduate School of Medicine, Suita, Japan. 36 14. Department of Ophthalmology, Graduate School of Medical Sciences, Kyushu University, 37 Fukuoka, Japan 38 15. Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, 39 Yokohama, Japan 40 16. Department of Urology, Iwate Medical University, Iwate, Japan 41 17. Department of Internal Medicine and Rheumatology, Juntendo University Graduate 42 School of Medicine, Tokyo, Japan 43 18. Department of Respiratory Medicine, Juntendo University Graduate School of Medicine, 44 Tokyo, Japan 45 19. Division of Pharmacology, Department of Biomedical Science, Nihon University School 46 of Medicine, Tokyo, Japan 47 20. Division of Genomic Epidemiology and Clinical Trials, Clinical Trials Research Center, 48 Nihon University School of Medicine, Tokyo, Japan 49 21. Division of Genomic Epidemiology and Clinical Trials, Clinical Trials Research Center, 50 Nihon University School of Medicine, Tokyo, Japan 51 22. Tokushukai Group, Tokyo, Japan 52 23. Department of Hematology, Nippon Medical School, Tokyo, Japan 53 24. Department of Bioregulation, Nippon Medical School, Kawasaki, Japan 54 25. Tokyo Metropolitan Geriatric Hospital and Institute of Gerontology, Tokyo, Japan 55 26. Fukujuji Hospital, Japan Anti-Tuberculosis Association, Tokyo, Japan 56 27. The Cancer Institute Hospital of the Japanese Foundation for Cancer Research, Tokyo, 57 Japan 58 28. Center for Clinical Research and Advanced Medicine, Shiga University of Medical 59 Science, Otsu, Japan 60 29. Department of General Thoracic Surgery, Osaka International Cancer Institute, Osaka, 61 Japan 62 30. Aso Iizuka Hospital, Fukuoka, Japan 63 31. National Hospital Organization Osaka National Hospital, Osaka, Japan 64 32. Laboratory for Cardiovascular Genomics and Informatics, RIKEN Center for Integrative 65 Medical Sciences, Yokohama, Japan 66 33. Department of Diabetes and Metabolic Diseases, Graduate School of Medicine, The 67 University of Tokyo, Tokyo, Japan 68 34. Department of Cardiovascular Medicine, Graduate School of Medicine, The University of 69 Tokyo, Tokyo, Japan 2 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
70 35. Graduate School of Medicine, Tohoku University, Sendai, Japan 71 36. Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan 72 37. Human Genome Center, Institute of Medical Science, The University of Tokyo, Tokyo, 73 Japan 74 38. Cancer Precision Medicine Center, Japanese Foundation for Cancer Research, Tokyo, 75 Japan 76 39. RIKEN Center for Integrative Medical Sciences, Yokohama, Japan 77 40. Division of Molecular Pathology, Institute of Medical Science, The University of Tokyo, 78 Tokyo, Japan 79 41. Laboratory of Complex Trait Genomics, Department of Computational Biology and 80 Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, 81 Japan 82 42. Psychiatric & Neurodevelopmental Genetics Unit, Department of Psychiatry, Analytic 83 and Translational Genetics Unit, Department of Medicine, and the Department of Neurology, 84 Massachusetts General Hospital, Boston, MA, USA 85 43. Department of Computational Biology and Medical Sciences, Graduate school of 86 Frontier Sciences, the University of Tokyo, Tokyo, Japan 87 44. Laboratory of Statistical Immunology, Immunology Frontier Research Center 88 (WPI-IFReC), Osaka University, Suita, Japan 89 45. Integrated Frontier Research for Medical Science Division, Institute for Open and 90 Transdisciplinary Research Initiatives, Osaka University, Suita, Japan 91 46. These authors contributed equally: S Sakaue and M Kanai.
92 * Corresponding authors Saori Sakaue, M.D., Ph.D. Koichi Matsuda, M.D., Ph.D. Yukinori Okada, M.D., Ph.D. [email protected] [email protected] [email protected] Center for Data Sciences, Department of Computational Department of Statistical Harvard Medical School Biology and Medical Sciences, Genetics, Osaka University Graduate school of Frontier Graduate School of Medicine Sciences, The University of Tokyo 93
3 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
94 Abstract
95 The current genome-wide association studies (GWASs) do not yet capture sufficient
96 diversity in terms of populations and scope of phenotypes. To address an essential need to
97 expand an atlas of genetic associations in non-European populations, we conducted 220
98 deep-phenotype GWASs (disease endpoints, biomarkers, and medication usage) in
99 BioBank Japan (n = 179,000), by incorporating past medical history and text-mining results
100 of electronic medical records. Meta-analyses with the harmonized phenotypes in the UK
101 Biobank and FinnGen (ntotal = 628,000) identified over 4,000 novel loci, which substantially
102 deepened the resolution of the genomic map of human traits, benefited from East Asian
103 endemic diseases and East Asian specific variants. This atlas elucidated the globally shared
104 landscape of pleiotropy as represented by the MHC locus, where we conducted
105 fine-mapping by HLA imputation. Finally, to intensify the value of deep-phenotype GWASs,
106 we performed statistical decomposition of matrices of phenome-wide summary statistics,
107 and identified the latent genetic components, which pinpointed the responsible variants and
108 shared biological mechanisms underlying current disease classifications across populations.
109 The decomposed components enabled genetically informed subtyping of similar diseases
110 (e.g., allergic diseases). Our study suggests a potential avenue for hypothesis-free
111 re-investigation of human disease classifications through genetics.
112
4 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
113 Main
114 Introduction
115 Medical diagnosis has been shaped through the description of organ dysfunctions and
116 extraction of shared key symptoms, which categorizes a group of individuals into a specific
117 disease to provide an optimal treatment. The earliest physicians in ancient Egypt empirically
118 made disease diagnoses based on clinical symptoms, palpitation, and auscultation (~2600
119 BC)1. Since then, continuous efforts by physicians have sophisticated the disease
120 classifications through empirical categorization. An increased understanding of organ
121 functions and the availability of diagnostic tests including biomarkers and imaging
122 techniques have further contributed to the current disease classifications, such as ICD102
123 and phecode3.
124 In the past decades, genome-wide association studies (GWASs) have provided new
125 insights into the biological basis underlying disease diagnoses. The genetic underpinnings
126 enable us to re-interrogate the validity of historically- defined disease classifications. To this
127 end, a comprehensive catalog of disease genetics is warranted. However, current genetic
128 studies still lack the comprehensiveness in three ways; (i) population, in that the vast
129 majority of GWASs have been predominated by European populations4, (ii) scope of
130 phenotypes, which have been limited to target diseases of a sampling cohort, and (iii) a
131 systematic method to interpret a plethora of summary results for understanding disease
132 pathogenesis and epidemiology. We thus need to promote equity in genetic studies by
133 sharing the results of genetic studies of deep phenotypes from diverse populations.
134 To expand the atlas of genetic associations, here we conducted 220 deep-phenotype
135 GWASs in BioBank Japan project (BBJ), including 108 novel phenotypes in East Asian
5 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
136 populations. We then conducted GWASs for corresponding harmonized phenotypes in UK
137 Biobank (UKB) and FinnGen, and finally performed trans-ethnic meta-analyses (ntotal =
138 628,000). The association results elucidated trans-ethnically shared landscape of the
139 pleiotropy and genetic correlations across diseases. Furthermore, we applied DeGAs5 to
140 perform truncated singular-value decomposition (TSVD) on the matrix of GWAS summary
141 statistics of 159 diseases each in Japanese and European ancestries, and derived latent
142 components shared across the diseases. We interpreted the derived components by (i)
143 functional annotation of the genetic variants explaining the component, (ii) identification of
144 important cell types in which the genes contributing to the component are specifically
145 regulated, and (iii) projection of GWASs of biomarkers or metabolomes into the component
146 space. The latent components recapitulated the hierarchy of current disease classifications,
147 while different diseases sometimes converged on the same component which implicated the
148 shared biological pathway and relevant tissues. We also classified a group of similar
149 diseases (e.g., allergic diseases) into subgroups based on these components. Analogous to
150 the conventional hierarchical classification of diseases based on the shared symptoms, an
151 atlas of genetic studies resolved the shared latent structure behind human diseases, which
152 elucidated the genetic variants, genes, organs, and biological functions underlying human
153 diseases.
154
6 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
155 Results
156 GWAS of 220 traits in BBJ and trans-ethnic meta-analysis
157 Overview of this study is presented in Extended Data Figure 1. BBJ is a nationwide
158 biobank in Japan, and recruited participants based on the diagnosis of at least one of 47
159 target diseases (Supplementary Note)6. Along with the target disease status, deep
160 phenotype data, such as past medical history (PMH), drug prescription records (~ 7 million),
161 text data retrieved from electronic medical records (EMR), and biomarkers, have been
162 collected. Beyond the collection of case samples based on the pre-determined target
163 diseases, the PMH and EMR have provided broader insights into disease genetics, as
164 shown in recently launched biobanks such as UKB7 and BioVU8. In this context, we curated
165 the PMH, performed text-mining of the EMR, and merged them with 47 target disease
166 status.9 We created individual-level phenotype on 159 disease endpoints (38 target
167 diseases with median 1.25 times increase in case samples and 121 novel disease
168 endpoints) and 23 categories of medication usage. We then systematically mapped the
169 disease endpoints into phecode and ICD10, to enable harmonized GWASs in UKB and
170 FinnGen. We also analyzed a quantitative phenotype of 38 biomarkers in BBJ, of which
171 individual phenotype data are available in UKB10. Using genotypes imputed with the 1000
172 Genome Project phase 3 data (n = 2,504) and population-specific whole-genome
173 sequencing data (n = 1,037) as a reference panel11, we conducted the GWASs of 159 binary
174 disease endpoints, 38 biomarkers, and 23 medication usages in ~179,000 individuals in BBJ
175 (Figure 1a–c, Supplementary Table 1 and 2 for phenotype summary). To maximize the
176 statistical power, we used a linear mixed model implemented in SAIGE12 (for binary traits)
177 and BOLT13 (for quantitative traits). By using linkage disequilibrium (LD)-score regression14,
7 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
178 we confirmed that the confounding biases were controlled in the GWASs (Supplementary
179 Table 3). In this expanded scope of GWASs in the Japanese population, we identified 396
180 genome-wide significant loci across 159 disease endpoints, 1891 across 38 biomarkers, of
181 which 92 and 156 loci were novel, respectively (P < 1.0 × 10-8; see Methods,
182 Supplementary Table 4). We conducted the initial medication-usage GWASs in East Asian
183 populations, and detected 171 genome-wide significant loci across 23 traits (see Methods).
184 These signals underscore the value of (i) conducting GWASs in non-Europeans and (ii)
185 expanding scope of phenotypes by incorporating biobank resources such as PMH and EMR.
186 For example, we detected an East Asian-specific variant, rs140780894, at the MHC locus in
187 pulmonary tuberculosis (PTB; Odds Ratio [OR] = 1.2, P = 2.9 × 10-23, Minor Allele
188 Frequency[MAF]EAS = 0.24; Extended Data Figure 2), which was not present in European
15 189 population (Minor Allele Count [MAC]EUR = 0) . PTB is a serious global health burden and
190 relatively endemic in Japan16 (annual incidence per 100,000 was 14 in Japan whereas 8 in
191 the United Kingdom and 3 in the United States in 2018 [World Health Organization, Global
192 Tuberculosis Report]). Because PTB, an infectious disease, can be treatable and remittable,
193 we substantially increased the number of cases by combining the participants with PMH of
194 PTB to the patients with active PTB at the time of recruitment (from 5499 to 7,800 case
195 individuals). Similar to this example, we identified a novel signal at rs190894416 at 7p14.2
196 (OR = 16, P = 6.9×10-9; Extended Data Figure 3) in dysentery, which is bloody diarrhea
197 caused by infection with Shigella bacillus that was once endemic in Japan when poor
198 hygiene had been common17. We also identified novel signals in common diseases that
199 have not been target diseases but were included in the PMH record, such as rs715 at 3’UTR
8 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
200 of CPS1 in cholelithiasis (Extended Data Figure 4; OR = 0.87, P = 9.6×10-13) and
201 rs2976397 at the PSCA locus in gastric ulcer, gastric cancer , and gastric polyp (Extended
202 Data Figure 5; OR = 0.86, P = 6.1×10-24). We detected pleiotropic functionally impactful
203 variants, such as a deleterious missense variant, rs28362459 (p.Leu20Arg), in FUT3
204 associated with gallbladder polyp (OR = 1.46, P = 5.1×10-11) and cholelithiasis (OR = 1.11, P
205 = 7.3×10-9; Extended Data Figure 6), and a splice donor variant, rs56043070 (c.89+1G>A),
206 causing loss of function of GCSAML associated with urticaria (OR = 1.24, P = 6.9×10-12;
207 Extended Data Figure 7), which was previously reported to be associated with platelet and
208 reticulocyte counts18. Medication-usage GWASs also provided interesting signals as an
209 alternative perspective for understanding disease genetics19. For example, individuals taking
210 HMG CoA reductase inhibitors (C10AA in Anatomical Therapeutic Chemical Classification
211 [ATC]) were likely to harbor variations at HMGCR (lead variant at rs4704210, OR = 1.11, P =
212 2.0×10-27). Prescription of salicylic acids and derivatives (N02BA in ATC) were significantly
213 associated with a rare East Asian missense variant in PCSK9, rs151193009 (p.Arg93Cys;
-11 214 OR = 0.75, P = 7.1×10 , MAFEAS=0.0089, MAFEUR=0.000; Extended Data Figure 8), which
215 might indicate a strong protective effect against the thromboembolic diseases in general.
216 To confirm that the signals identified in BBJ were replicable, we conducted GWASs of
217 corresponding phenotypes (i.e., disease endpoints and biomarkers) in UKB and FinnGen,
218 and collected summary statistics of medication usage GWAS recently conducted in UKB19
219 (Supplementary Table 5). We then compared the effect sizes of the genome-wide
220 significant variants in BBJ with those in a European dataset across binary and quantitative
9 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
221 traits (see Methods). The loci identified in our GWASs were successfully replicated in the
222 same effect direction (1,830 out of 1,929 [94.9%], P < 10-325 in sign test) and with high
223 effect-size correlation (Extended Data Figure 9).
224 Motivated by the high replicability, we performed trans-ethnic meta-analyses of these
225 220 harmonized phenotypes across three biobanks (see Methods). We identified 1,362
226 disease-associated, 10,572 biomarker-associated, and 841 medication-associated loci in
227 total, of which 356, 3,576, and 236 were novel, respectively (Figure 1d, Supplementary
228 Table 6). All these summary results of GWASs are openly shared without any restrictions.
229 Together, we successfully expanded the genomic map of human complex traits in terms of
230 populations and scope of phenotypes through conducting deep-phenotype GWASs across
231 trans-ethnic nationwide biobanks.
232
10 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
233 234 Figure 1. Overview of the identified loci in the trans-ethnic meta-analyses of 220 deep 235 phenotype GWASs. 236 (a-c) The pie charts describe the phenotypes analyzed in this study. The disease endpoints
237 (a; ntrait = 159) were categorized based on the ICD10 classifications (A to Z; Supplementary
238 Table 1a), the biomarkers (b; ntrait = 38; Supplementary Table 1b) were classified into nine 239 categories, and medication usage was categorized based on the ATC system (A to S; 240 Supplementary Table 1c). (d) The genome-wide significant loci identified in the
11 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
241 trans-ethnic meta-analyses and pleiotropic loci (P < 1.0×10-8). The traits (rows) are sorted as
242 shown in the pie chart, and each dot represents significant loci in each trait. Pleiotropic loci 243 are annotated by lines with a locus symbol. 244
12 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
245 The regional landscape of pleiotropy.
246 Because human traits are highly polygenic and the observed variations within the human
247 genome are finite in number, pleiotropy, where a single variant affects multiple traits, is
248 pervasive20. While pleiotropy has been intensively studied in European populations by
249 compiling previous GWASs20,21, the landscape of pleiotropy in non-European populations
250 has remained elusive. By leveraging this opportunity for comparing the genetics of deep
251 phenotypes across populations, we sought to investigate the landscape of regional
252 pleiotropy in both Japanese and European populations. We defined the degree of pleiotropy
253 as the number of significant associations per variant (P < 1.0×10-8)21. In the Japanese,
254 rs11066015 harbored the largest number of genome-wide significant associations (45 traits;
255 Figure 2a), which was in tight LD with a missense variant at the ALDH2 locus, rs671.
256 Following this, rs117326768 at the MHC locus (23 traits) and rs1260326 at the GCKR locus
257 (18 traits) were most pleiotropic. In Europeans, rs3132941 at the MHC locus harbored the
258 largest number of genome-wide significant associations (46 traits; Figure 2b), followed by
259 rs4766578 at the ATXN2/SH2B3 locus (38 traits) and rs4665972 at the GCKR locus (28
260 traits). Notably, the ALDH2 locus (pleiotropic in Japanese) and the MHC locus (pleiotropic in
261 Japanese and Europeans) are known to be under recent positive selection22,23. To
262 systematically assess whether pleiotropic regions in the genome were likely to be under
263 selection pressure in each of the populations, we investigated the enrichment of the
264 signatures of recent positive selection quantified by the metric singleton density score
265 (SDS)22 values within the pleiotropic loci (see Methods). Intriguingly, when compared with
266 those under the null hypothesis, we observed significantly higher values of SDS χ2 values
267 within the pleiotropic loci, and this fold change increased as the number of associations
13 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
268 increased (i.e., more pleiotropic) in both Japanese and Europeans (Figure 2c and 2d). To
269 summarize, the trans-ethnic atlas of genetic associations elucidated the broadly shared
270 landscape of pleiotropy, which implied a potential connection to natural selection signatures
271 affecting human populations.
272
14 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
273
274 Figure 2. Number of significant associations per variant.
275 (a, b) The Manhattan-like plots show the number of significant associations (P < 1×10-8) at
276 each tested genetic variant for all traits (ntrait = 220) in Japanese (a) and in Europeanan
277 GWASs (b). Loci with a large number of associations were annotated based on the closestst
278 genes of each variant. (c, d) The plots indicate the fold change of the sum of SDS χ2 withinin
279 variants with a larger number of significant associations than a given number on the x-axisxis
280 compared with those under the null hypothesis in Japanese (c) and in Europeans (d). We
281 also illustrated a regression line based on local polynomial regression fitting.
282 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
283 Pleiotropic associations in HLA and ABO locus.
284 Given the strikingly high number of associations in both populations, we next sought to
285 fine-map the pleiotropic signals within the MHC locus. To this end, we imputed the classical
286 HLA alleles in BBJ and UKB, and performed association tests for 159 disease endpoints and
287 38 biomarkers (Figure 3a and 3b). After the fine-mapping and conditional analyses (see
288 Methods), we identified 94 and 153 independent association signals in BBJ and UKB,
289 respectively (the regional threshold of significance was set to P < 1.0×10-6; Supplementary
290 Table 7). Overall, HLA-B in class I and HLA-DRB1 in class II harbored the largest number of
291 associations in both BBJ and UKB. For example, we successfully fine-mapped the strong
292 signal associated with PTB to HLA-DRβ1 Ser57 (OR = 1.20, P = 7.1×10-19) in BBJ. This is
293 the third line of evidence showing the robust association of HLA with tuberculosis identified
294 to date24,25, and we initially fine-mapped the signal to HLA-DRB1. Interestingly, HLA-DRβ1
295 at position 57 also showed pleiotropic associations with other autoimmune and
296 thyroid-related diseases, such as Grave’s disease (GD), hyperthyroidism, Hashimoto’s
297 disease, hypothyroidism, Sjogren’s disease, chronic hepatitis B, and atopic dermatitis in BBJ.
298 Of note, the effect direction of the association of HLA-DRβ1 Ser57 was the same between
299 hyperthyroid status (OR = 1.29, P = 2.6×10-14 in GD and OR = 1.37, P = 1.4×10-8 in
300 hyperthyroidism) and hypothyroid status (OR = 1.50, P = 9.0×10-8 in Hashimoto’s disease
301 and OR = 1.31, P = 1.5×10-7 in hypothyroidism), despite the opposite direction of thyroid
302 hormone abnormality. This association of HLA-DRβ1 was also observed in Sjogren’s
303 syndrome (OR = 2.04, P = 7.9 × 10-12), which might underlie the epidemiological
16 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
304 comorbidities of these diseases26. Other novel associations in BBJ included HLA-DRβ1
305 Asn197 with sarcoidosis (OR = 2.07, P = 3.7×10-8), and four independent signals with
306 chronic sinusitis (i.e., HLA-DRA, HLA-B, HLA-A, and HLA-DQA1).
307 Another representative pleiotropic locus in the human genome is the ABO locus. We
308 performed ABO blood-type PheWAS in BBJ and UKB (Figure 3c and 3d). We estimated the
309 ABO blood type from three variants (rs8176747, rs8176746, and rs8176719 at 9q34.2)27,
310 and associated them with the risk of diseases and quantitative traits for each blood group. A
311 variety of phenotypes, including common diseases such as myocardial infarction as well as
312 biomarkers such as blood cell traits and lipids, were strongly associated with the blood types
313 in both biobanks (Supplementary Table 8). We also replicated an increased risk of gastric
314 cancer in blood-type A as well as an increased risk of gastric ulcer in blood-type O in BBJ28.
315
17 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
316 317 Figure 3. HLA and ABO association PheWAS.
318 (a,b) Significantly associated HLA genes identified by HLA PheWAS in BBJ (a) or in UKB (b)
319 are plotted. In addition to the top association signals of the phenotypes, independent
320 associations identified by conditional analysis are also plotted, and the primary association
321 signal is indicated by the plots with a gray border. The color of each plot indicates two-tailed
322 P values calculated with logistic regression (for binary traits) or linear regression (for
323 quantitative traits) as designated in the color bar at the bottom. The bars in green at the top
324 indicate the number of significant associations per gene in each of the populations. The
325 detailed allelic or amino acid position as well as statistics in the association are provided in
326 Supplementary Table 7.
18 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
327 (c,d) Significant associations identified by ABO blood-type PheWAS in BBJ (c) or in UKB (d)
328 are shown as boxes and colored based on the odds ratio. The size of each box indicates
329 two-tailed P values calculated with logistic regression (for binary traits) or linear regression
330 (for quantitative traits).
331
19 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
332 Genetic correlation elucidates the shared phenotypic domains across populations.
333 The interplay between polygenicity and pleiotropy suggests widespread genetic correlations
334 among complex human traits29. Genetic relationships among human diseases have
335 contributed to the refinement of disease classifications30 and elucidation of the biology
336 underlying the epidemiological comorbidity29. To obtain deeper insights into the
337 interconnections among human traits and compare them across populations, we computed
338 pairwise genetic correlations (rg) across 106 traits (in Japanese) and 148 traits (in
2 339 Europeans) with Z-score for h SNP > 2, using bivariate LD score regression (see Methods).
340 We then defined the correlated trait domains by greedily searching for the phenotype blocks
341 with pairwise rg > 0.7 within 70% of rg values in the block on the hierarchically clustered
342 matrix of pairwise rg values (Extended Data Figure 10). We detected domains of tightly
343 correlated phenotypes, such as (i) cardiovascular- acting medications, (ii) coronary artery
344 disease, (iii) type 2 diabetes- related phenotypes, (iv) allergy- related phenotypes, and (v)
345 blood-cell phenotypes in BBJ (Extended Data Figure 10a). These domains implicated the
346 shared genetic backgrounds on the similar diseases and their treatments (e.g., (ii) diseases
347 of the circulatory system in ICD10 and for coronary artery disease and their treatments) and
348 diagnostic biomarkers (e.g., (iii) glucose and HbA1c in type 2 diabetes). Intriguingly, the
349 corresponding trait domains were mostly identified in UKB as well (Extended Data Figure
350 10b). Thus, we confirmed that the current clinical boundaries for a spectrum of human
351 diseases broadly reflect the shared genetic etiology across populations, despite differences
352 in ethnicity and despite potential differences in diagnostic and prescription practices.
353
20 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
354 Deconvolution of a matrix of summary statistics of 159 diseases provides novel
355 insights into disease pathogenesis.
356 A major challenge in genetic correlation is that the rg is a scholar value between two traits,
357 which summarizes the averaged correlation over the whole genome into just one metric31.
358 This approach is not straightforward in specifying a set of genetic variants driving the
359 observed correlation, which should pinpoint biological pathways and dysfunctional organs
360 explaining the shared pathogenesis. To address this, gathering of the genetic association
361 statistics of hundreds of different phenotypes can dissect genotype-phenotype association
362 patterns without a prior hypothesis, and identify latent structures underlying a spectrum of
363 complex human traits. In particular, matrix decomposition on the summary statistics is a
364 promising approach5,32,33, which derives orthogonal components that explain association
365 variance across multiple traits while accounting for linear genetic architectures in general.
366 This decomposition can address two challenges in current genetic correlation studies. First,
367 it informs us of genetic variants that explain the shared structure across multiple diseases,
368 thereby enabling functional interpretation of the component. Second, it can highlight
369 sub-significant associations and less powered studies, which are important in understanding
370 the contribution of common variants in rare disease genetics with a small number of case
371 samples32 or in genetic studies in underrepresented populations where smaller statistical
372 power is inevitable.
373 Therefore, we applied DeGAs5 on a matrix of our disease GWAS summary statistics in
374 Japanese and the meta-analyzed statistics in Europeans (ndisease = 159; Figure 4a and 4b).
375 To interpret the derived latent components, we annotated the genetic variants explaining
376 each component (i) through GREAT genomic region ontology enrichment analysis34, (ii)
21 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
377 through identification of relevant cell types implicated from tissue specific regulatory DNA
378 (ENCODE335) and expression (GTEx36) profiles, and (iii) by projecting biomarker GWASs
379 and metabolome GWASs into the component space (nbiomarker=38, nmetabolite_EAS=206,
380 nmetabolite_EUR=248; Figure 4a). We applied TSVD on the sparse Z score matrix of 22,980
381 variants, 159 phenotypes each in 2 populations (Japanese and Europeans), and derived 40
382 components that together explained 36.7% of the variance in the input summary statistics
383 matrix (Extended Data Figure 11, 12).
384 Globally, hierarchically similar diseases as defined by the conventional ICD10
385 classification were explained by the same components, based on DeGAs trait squared
386 cosine scores that quantifies component loadings5 (Figure 4c, d). This would be considered
387 as a hypothesis-free support of the historically defined disease classification. For example,
388 component 1 explained the genetic association patterns of diabetes (E10 and E11 in ICD10)
389 and component 2 explained those of cardiac and vascular diseases (I00-I83), in both
390 populations. Functional annotation enrichment of the genetic variants explaining these
391 components by GREAT showed that component 1 (diabetes component) was associated
-19 392 with abnormal pancreas size (binomial Penrichment =7.7×10 ) as a human phenotype,
393 whereas component 2 (cardiovascular disease component) was associated with
-10 394 xanthelasma (i.e., cholesterol accumulation on the eyelids; binomial Penrichment =3.0×10 ).
395 Further, the genes comprising component 1 were enriched in genes specifically expressed
-4 396 in the pancreas (Penrichment =5.5×10 ), and those comprising component 2 were enriched in
-3 397 genes specifically expressed in the aorta (Penrichment =1.9×10 ; Extended Data Figure 13).
398 By projecting the biomarker and metabolite GWASs into this component space, we
22 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
399 observed that component 1 represented the genetics of glucose and HbA1c, and component
400 2 represented the genetics of blood pressure and lipids, all of which underscored the
401 biological relevance. Thus, this deconvolution-projection analysis elucidated the latent
402 genetic structure behind human diseases, which highlighted the underlying biological
403 functions, relevant tissues, and associated human phenotypes.
404 The latent components shared across diseases explained the common biology behind
405 etiologically similar diseases. For example, we identified that component 10 explained the
406 genetics of cholelithiasis (gall stone), cholecystitis (inflammation of gallbladder), and gall
407 bladder polyp (Figure 4e). The projection of European metabolite GWASs into the
408 component space identified that component 10 represented the metabolite GWAS in the
409 bilirubin metabolism pathway. Component 10 was composed of variants involved in
-10 410 intestinal cholesterol absorption in the mouse phenotype (binomial Penrichment =3.8×10 ).
411 This is biologically relevant, since increased absorption of intestinal cholesterol is a major
412 cause of cholelithiasis, which also causes cholecystitis37. This projection analysis was also
413 applicable to the Japanese metabolites GWASs, which showed the connection between the
414 component 1 (diabetes component) and arginine and glucose levels, and between the
415 component 10 (gallbladder disease component) and glycine, which conjugates with bile
416 acids38.
417 Some components could be further utilized to boost understanding of the underpowered
418 GWASs with the use of well-powered GWAS, and for identifying the contributor of shared
419 genetics between different diseases. For example, we complemented underpowered
420 varicose GWAS in BBJ (ncase = 474, genome-wide significant loci = 0) with higher-powered
421 GWAS in Europeans (ncase = 22,037, genome-wide significant loci = 54), since both GWASs
23 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
422 were mostly represented by component 11, which was explained by variants related to
-7 423 abnormal vascular development (binomial Penrichment = 4.2×10 ; Figure 4f). Another example
424 is component 27, which was shared with rheumatoid arthritis and systemic lupus
425 erythematosus, two distinct but representative autoimmune diseases. Component 27 was
426 explained by the variants associated with interleukin secretion and plasma cell number
-10 -10 427 (binomial Penrichment = 6.1×10 and 9.3×10 , respectively), and significantly enriched in the
-4 428 DNase I hypersensitive site (DHS) signature of lymphoid tissue (Penrichment =1.3×10 ; Figure
429 4g). This might suggest the convergent etiology of the two autoimmune diseases, which
430 could not be elucidated by the genetic correlation alone.
431 Finally, we aimed at hypothesis-free categorization of diseases based on these
432 components. Historically, hypersensitivity reactions have been classified into four types (e.g.,
433 types I to IV)39, but the clear sub-categorization of allergic diseases based on this
434 pathogenesis and whether the categorization can be achieved solely by genetics were
435 unknown. In our TSVD results, the allergic diseases (mostly J and L in ICD10) were
436 represented by the four components 3, 16, 26, and 34. By combining these components as
437 axis-1 (e.g., components 3 and 16) and axis-2 (e.g., components 26 and 34), and comparing
438 the cumulative variance explained by these axes, we defined axis-1 dominant allergic
439 diseases (e.g., asthma and allergic rhinitis) and axis-2 dominant allergic diseases (metal
440 allergy, contact dermatitis, and atopic dermatitis; Figure 4h). Intriguingly, the axis-1
441 dominant diseases corresponded etiologically well to type I allergy (i.e., immediate
442 hypersensitivity). The variants explaining axis-1 were biologically related to IgE secretion
-46 -44 443 and Th2 cells (binomial Penrichment = 9.9×10 and 2.9×10 , respectively). Furthermore,
24 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
444 GWAS of eosinophil count was projected onto axis-1, which recapitulated the biology of type
445 I allergy40. In contrast, the axis-2 dominant diseases corresponded to type IV allergy (i.e.,
446 cell-mediated delayed hypersensitivity). The variants explaining axis-2 were associated with
-10 -9 447 IL-13 and interferon secretion (binomial Penrichment = 1.6×10 and 5.2×10 , respectively),
448 and GWAS of C-reactive protein was projected onto axis-2, which was distinct from axis-141.
449 To summarize, our deconvolution approach (i) recapitulated the existing disease
450 classifications, (ii) clarified the underlying biological mechanisms and relevant tissues
451 shared among a spectrum of related diseases, and (iii) showed potential application for
452 genetics-driven categorization of human diseases.
453
454
25 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
455 456 Figure 4. The deconvolution analysis of a matrix of summary statistics of 159
457 diseases across populations.
458 (a) An illustrative overview of deconvolution-projection analysis. Using DeGAs framework, a
459 matrix of summary statistics from two populations (EUR: European and BBJ: Biobank
460 Japan) was decomposed into latent components, which were interpreted by annotation of a
26 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
461 set of genetic variants driving each component and in the context of other GWASs through
462 projection. (b) A schematic representation of TSVD applied to decompose a summary
463 statistic matrix W to derive latent components. U, S, and V represent resulting matrices of
464 singular values (S) and singular vectors (U and V). (c) A heatmap representation of DeGAs
465 squared cosine scores of diseases (columns) to components (rows). The components are
466 shown from 1 (top) to 40 (bottom), and diseases are sorted based on the contribution of
467 each component to the disease measured by the squared cosine score (from component 1
468 to 40). Full results with disease and component labels are in Extended Data Figure 14.
469 (d) Results of TSVD of disease genetics matrix and the projection of biomarker genetics.
470 Diseases (left) and biomarkers (right) are colored based on the ICD10 classification and
471 functional categorization, respectively. The derived components (middle; from 1 to 40) are
472 colored alternately in blue or red. The squared cosine score of each disease to each
473 component and each biomarker to each component is shown as red and blue lines. The
474 width of the lines indicates the degree of contribution. The diseases with squared cosine
475 score > 0.3 in at least one component are displayed. Anth; anthropometry, BP; blood
476 pressure, Metab; metabolic, Prot; protein, Kidn; kidney-related, Ele; Electrolytes, Liver;
477 liver-related, Infl; Inflammatory, BC; blood cell. (e-h) Examples of disease-component
478 correspondence and the biological interpretation of the components by projection and
479 enrichment analysis using GREAT. A representative component explaining a group of
480 diseases based on the contribution score, along with responsible genes, functional
481 enrichment results GREAT, relevant tissues, and relevant biobarkers/metabolites is shown.
482 GB; gallbladder. RA; rheumatoid arthritis. SLE; systemic lupus erythematosus.
483
27 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
484 Discussion
485 Here, we performed 220 GWASs of human complex traits by incorporating the PMH and
486 EMR data in BBJ, substantially expanding the atlas of genotype-phenotype associations in
487 non-Europeans. We then systematically compared their genetic basis with GWASs of
488 corresponding phenotypes in Europeans. We confirmed the global replication of loci
489 identified in BBJ, and discovered 4,170 novel loci through trans-ethnic meta-analyses,
490 highlighting the value of conducting GWASs in diverse populations. The results are openly
491 shared through web resources, which will be a platform to accelerate further research such
492 as functional follow-up studies and drug discovery42. Of note, leveraging these well-powered
493 GWASs, we observed that the genes associated with endocrine/metabolic, circulatory, and
494 respiratory diseases (E, I, and J by ICD10) were systematically enriched in targets of
495 approved medications treating those diseases (Extended Data Figure 15). This should
496 motivate us to use this expanded resource for genetics-driven novel drug discovery and
497 drug repositioning.
498 The landscape of regional pleiotropy was globally shared across populations, and
499 pleiotropic regions tended to have been under recent positive selection. Further elucidation
500 of pleiotropy in other populations is warranted to replicate our results. To highlight the utility
501 of deep phenotype GWASs, we finally decomposed the multi-ethnic genotype–phenotype
502 association patterns by TSVD. The latent components derived from TSVD pinpointed the
503 convergent biological mechanisms and relevant cell types across diseases, which can be
504 utilized for re-evaluation of existing disease classifications. The incorporation of biomarker
505 and metabolome GWAS summary statistics enabled further interpretation of the latent
506 components. Our approach suggested a potential avenue for restructuring of the medical
28 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
507 diagnoses through dissecting the shared genetic basis across a spectrum of diseases, as
508 analogous to the current disease diagnostics historically shaped through empirical
509 categorization of shared key symptoms across a spectrum of organ dysfunctions.
510 In conclusion, our study substantially expanded the atlas of genetic associations,
511 supported the historically-defined categories of human diseases, and should accelerate the
512 discovery of the biological basis contributing to complex human diseases.
513
514
515 Acknowledgments
516 We sincerely thank all the participants of BioBank Japan, UK Biobank, and FinnGen. This
517 research was supported by the Tailor-Made Medical Treatment program (the BioBank Japan
518 Project) of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT), the
519 Japan Agency for Medical Research and Development (AMED). The FinnGen project is
520 funded by two grants from Business Finland (HUS 4685/31/2016 and UH 4386/31/2016) and
521 nine industry partners (AbbVie, AstraZeneca, Biogen, Celgene, Genentech, GSK, MSD,
522 Pfizer and Sanofi). Following biobanks are acknowledged for collecting the FinnGen project
523 samples: Auria Biobank (https://www.auria.fi/biopankki/), THL Biobank
524 (https://thl.fi/fi/web/thl-biopank), Helsinki Biobank
525 (https://www.terveyskyla.fi/helsinginbiopankki/), Northern Finland Biobank Borealis
526 (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki), Finnish Clinical Biobank Tampere
527 (https://www.tays.fi/biopankki), Biobank of Eastern Finland (https://ita-suomenbiopankki.fi),
528 Central Finland Biobank (https://www.ksshp.fi/fi-FI/Potilaalle/Biopankki), Finnish Red Cross
529 Blood Service Biobank (https://www.bloodservice.fi/Research%20Projects/biobanking),
29 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
530 Terveystalo Biobank Finland
531 (https://www.terveystalo.com/fi/Yritystietoa/Terveystalo-Biopankki/Biopankki/). S.S. was in
532 part supported by The Mochida Memorial Foundation for Medical and Pharmaceutical
533 Research.M.Kanai was supported by a Nakajima Foundation Fellowship and the Masason
534 Foundation. Y.Tanigawa is in part supported by a Funai Overseas Scholarship from the
535 Funai Foundation for Information Technology and the Stanford University School of
536 Medicine. M.A.R. is in part supported by National Human Genome Research Institute
537 (NHGRI) of the National Institutes of Health (NIH) under award R01HG010140 (M.A.R.), and
538 a National Institute of Health center for Multi- and Trans-ethnic Mapping of Mendelian and
539 Complex Diseases grant (5U01 HG009080). The content is solely the responsibility of the
540 authors and does not necessarily represent the official views of the National Institutes of
541 Health. Y.O. was supported by the Japan Society for the Promotion of Science (JSPS)
542 KAKENHI (19H01021, 20K21834), and AMED (JP20km0405211, JP20ek0109413,
543 JP20ek0410075, JP20gm4010006, and JP20km0405217), Takeda Science Foundation,
544 and Bioinformatics Initiative of Osaka University Graduate School of Medicine, Osaka
545 University.
546
547 Author Contributions
548 S.S., M. Kanai, and Y.O. conceived the study. S.S., M. Kanai, Y. Tanigawa., M.A.R., and
549 Y.O. wrote the manuscript. S.S., M. Kanai, J.K., M. Kurki, T.Konuma, Kenichi Yamamoto,
550 M.A., K.Ishigaki, Kazuhiko Yamamoto, Y. Kamatani, A.P., M.J.D., and Y.O. conducted
551 GWAS data studies. S.S., Y. Tanigawa., and M.A.R. conducted statistical decomposition
552 analysis. S.S., S.T., A.N., G.T., and Y.O. conducted metabolome analysis. A.S., K.S., W.O.,
30 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
553 Ken Yamaji, K.T., S.A., Y.Takahashi, T.S., N.S., H.Y., S.Minami, S.Murayama, Kozo
554 Yoshimori, S.N., D.O., M.H., A.M., Y.Koretsune, K.Ito, C.T., T.Y., I.K., T.Kadowaki, M.Y.,
555 Y.N., M.Kubo, Y.M., Kazuhiko Yamamoto, and K.M. collected and managed samples and
556 data. A.P. and M.J.D. coordinated collaboration with FinnGen.
557
558 Competing Financial Interests
559 M.A.R. is on the SAB of 54Gene and Computational Advisory Board for Goldfinch Bio and
560 has advised BioMarin, Third Rock Ventures, MazeTx and Related Sciences. The funders
561 had no role in study design, data collection and analysis, decision to publish, or preparation
562 of the manuscript.
563
564 Data availability
565 The genotype data of BBJ used in this study are available from the Japanese
566 Genotype-phenotype Archive (JGA; http://trace.ddbj.nig.ac.jp/jga/index_e.html) with
567 accession code JGAD00000000123 and JGAS00000000114. The UKB analysis was
568 conducted via application number 47821. This study used the FinnGen release 3 data.
569 Summary statistics of BBJ GWAS and trans-ethnic meta-analysis will be publicly available
570 without any restrictions.
571
572 Code availability
573 We used publicly available software for the analyses. The used software is listed and
574 described in the Method section of our manuscript.
575
31 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
576 Methods
577 Genome-wide association study of 220 traits in BBJ
578 We conducted 220 deep phenotype GWASs in BBJ. BBJ is a prospective biobank that
579 collaboratively collected DNA and serum samples from 12 medical institutions in Japan and
580 recruited approximately 200,000 participants, mainly of Japanese ancestry (Supplementary
581 Note). All study participants had been diagnosed with one or more of 47 target diseases by
582 physicians at the cooperating hospitals. We previously conducted GWASs of 42 out of the
583 47 target diseases9. In this study, we newly curated the PMH records included in the clinical
584 data, and performed text-mining to retrieve disease records from the free-format EMR as
585 well. For disease phenotyping, we merged this information with the target disease status,
586 and defined the case status for 159 diseases with a case count > 50 (Supplementary Table
587 2). As controls, we used samples in the cohort without a given diagnosis or related
588 diagnoses, which was systematically defined by using the phecode framework3
589 (Supplementary Table 1). For medication-usage phenotyping, we again retrieved
590 information by text-mining of 7,018,972 medication records. Then, we categorized each
591 medication trade name by using the ATC, World Health Organization. For biomarker
592 phenotyping, we used the same processing and quality control method as previously
593 described (Supplementary Table 2 for phenotype summary)10,43. In brief, we excluded
594 measurements outside three times of interquartile range (IQR) of upper/lower quartile. For
595 individuals taking anti-hypertensive medications, we added 15 mmHg to systolic blood
596 pressure (SBP) and 10 mmHg to diastolic blood pressure (DBP). For individuals taking a
597 statin, we applied the following correction to the lipid measurements: i) Total cholesterol was
598 divided by 0.8; ii) measured LDL-cholesterol (LDLC) was adjusted as LDLC / 0.7; iii) derived
32 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
599 LDLC from the Friedewald was re-derived as (Total cholesterol / 0.8) - HDLC - (Triglyceride/
600 5).
601 We genotyped participants with the Illumina HumanOmniExpressExome BeadChip or a
602 combination of the Illumina HumanOmniExpress and HumanExome BeadChips. Quality
603 control of participants and genotypes was performed as described elsewhere11. In this
604 project, we analyzed 178,726 participants of Japanese ancestry as determined by the
605 principal component analysis (PCA)-based sample selection criteria. The genotype data
606 were further imputed with 1000 Genomes Project Phase 3 version 5 genotype (n = 2,504)
607 and Japanese whole-genome sequencing data (n = 1,037) using Minimac3 software. After
608 this imputation, we excluded variants with an imputation quality of Rsq < 0.7.
609 We conducted GWASs for binary traits (i.e., disease endpoints and medication usage)
610 by using a generalized linear mixed model implemented in SAIGE (version 0.37), which had
611 substantial advantages in terms of (i) maximizing the sample size by including genetically
612 related participants, and (ii) controlling for case–control imbalance12 , which was the case in
613 many of the disease endpoints in this study. We included adjustments for age, age2, sex,
614 age×sex, age2×sex, and top 20 principal components for as covariates used in step 1. For
615 sex-specific diseases, we alternatively adjusted for age, age2, and the top 20 principal
616 components as covariates used in step 1, and we used only controls of the sex to which the
617 disease is specific. For the X chromosome, we conducted GWASs separately for males and
618 females, and merged their results by inverse-variance fixed-effects meta-analysis44. We
619 conducted GWASs for quantitative traits (i.e., biomarkers) by using a linear mixed model
620 implemented in BOLT-LMM (version 2.3.4). We included the same covariates as used in the
621 binary traits above.
33 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
622 All the participants provided written informed consent approved from ethics committees
623 of the Institute of Medical Sciences, the University of Tokyo and RIKEN Center for
624 Integrative Medical Sciences.
625
626 Harmonized genome-wide association study of 220 traits in UKB and FinnGen
627 We conducted the GWASs harmonized with BBJ in UKB and in FinnGen. The UK Biobank
628 project is a population-based prospective cohort that recruited approximately 500,000
629 people across the United Kingdom (Supplementary Note). We defined case and control
630 status of 159 disease endpoints, which were originally retrieved from the clinical information
631 in UKB and mapped to BBJ phenotypes via phecode (Supplementary Table 1). We also
632 analyzed 38 biomarker values provided by the UKB. The genotyping was performed using
633 either the Applied Biosystems UK BiLEVE Axiom Array or the Applied Biosystems UK
634 Biobank Axiom Array. The genotypes were further imputed using a combination of the
635 Haplotype Reference Consortium, UK10K, and 1000 Genomes Phase 3 reference panels by
636 IMPUTE4 software7. In this study, we analyzed 361,194 individuals of white British genetic
637 ancestry as determined by the PCA-based sample selection criteria (see URLs). We
638 excluded the variants with (i) INFO score ≤ 0.8, (ii) MAF ≤ 0.0001 (except for missense and
639 protein-truncating variants annotated by VEP45, which were excluded if MAF ≤ 1 × 10-6), and
-10 640 (iii) PHWE ≤ 1 × 10 . We conducted GWASs for 159 disease endpoints by using SAIGE with
641 the same covariates used in the BBJ GWAS. For biomarker GWASs, we used publicly
642 available summary statistics of UKB biomarker GWAS when available (see URLs), and
643 otherwise performed linear regression using PLINK software with the same covariates,
644 excluding the genetically related individuals (the 1st, 2nd, or 3rd degree)7. For medication
34 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
645 usage GWASs, we used publicly available summary statistics of medication usage in UKB19,
646 which was organized by the ATC and thus could be harmonized with BBJ GWASs.
647 FinnGen is a public–private partnership project combining genotype data from Finnish
648 biobanks and digital health record data from Finnish health registries (Supplementary
649 Notes). For GWASs, we used the summary statistics of FinnGen release 3 data (see URLs).
650 The disease endpoints were mapped to BBJ phenotypes by using ICD10 code, and we
651 defined 129 out of 159 endpoints in BBJ. We did not conduct biomarker and
652 medication-related GWASs because the availability of these phenotypes was limited.
653
654 Meta-analysis, definition of significant loci, and annotation of the lead variants with
655 genome-wide significance
656 First, we performed intra-European meta-analysis when summary statistics of both UKB and
657 FinnGen were available, and then performed trans-ethnic meta-analysis across three or two
658 cohorts in 159 disease endpoints, 38 biomarker values, and 23 medication usage GWASs.
659 We conducted these meta-analyses by using the inverse-variance method and estimated
660 heterogeneity with Cochran’s Q test with metal software44. The summary statistics of
661 primary GWASs in BBJ and trans-ethnic meta-analysis GWASs are openly shared without
662 any restrictions.
663 We adopted the genome-wide significance threshold of < 1.0×10-8, as previously used
664 in similar projects in BBJ and UKB9,21. We defined independent genome-wide significant loci
665 on the basis of genomic positions within ±500 kb from the lead variant. We considered a
666 trait-associated locus as novel when the locus within ±1 Mb from the lead variant did not
667 include any variants that were previously reported to be significantly associated with the
35 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
668 same disease. We basically searched for previous reports of known loci in the GWAS
669 catalog18, but also referred to PubMed or preprints when the corresponding trait was not
670 included in GWAS catalog or when the large-scale GWASs were released in the preprint
671 server as of July 2020 (Supplementary Table 9).
672 We annotated the lead variants using ANNOVAR software, such as rsIDs in dbSNP
673 database (see URLs), the genomic region and closest genes, and functional consequences.
674 We also supplemented this with the gnomAD database15, and also looked for the allele
675 frequencies in global populations as an independent resource.
676
677 Replication of significant associations in BBJ
678 For 2,287 lead variants in the genome-wide significant loci of 159 disease endpoints and 38
679 biomarkers in BBJ, we compared the effect sizes and directions with European-only
680 meta-analysis when available and with UKB-based summary statistics otherwise. Of them,
681 1,929 variants could be compared with the corresponding European GWASs. Thus, we
682 performed the Pearson’s correlation test for these variants’ beta in the association test in
683 BBJ and in European GWAS. We also performed the correlation tests with variants with
-8 684 PEUR < 0.05 and to those with PEUR < 1.0×10 .
685
686 Evaluation of regional pleiotropy
687 We assessed the regional pleiotropy based on each tested genetic variant separately for
688 BBJ GWASs and for European GWASs (i.e., intra- European meta-analysis when FinnGen
689 GWAS was available and UKB summary statistics otherwise). We quantified the degree of
690 pleiotropy per genetic variant by aggregating and counting the number of genome-wide
36 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
691 significant associations across 220 traits. We then annotated loci from the largest number of
692 associations (nassociations > 9 in BBJ and > 18 in Europeans) in Figure 2a, b.
693 Next, we assessed the recent natural selection signature within the pleiotropic loci
694 separately for Japanese and for Europeans. To do this, we first defined the pleiotropic loci
695 by identifying genetic variants that harbored a larger number of significant associations than
696 a given threshold. We varied this threshold from 1 to 40. Then, at each threshold, we
697 calculated the sum of SDS χ2 values within the pleiotropic loci, and compared this with the χ2
698 distribution under the null hypothesis with a degree of freedom equal to the number of
699 variants in the loci. We thus estimated the SDS enrichment within the pleiotropic loci defined
700 by a given threshold as fold change and P value. The SDS values were obtained from the
701 web resource indicated in the original article on Europeans (see URLs) and provided by the
702 authors on Japanese23. The raw SDS values were normalized according to the derived allele
703 frequency as described previously.
704
705 Fine-mapping of HLA and ABO loci
706 We performed the fine-mapping of MHC associations in BBJ and UKB by HLA imputation46.
707 In BBJ, we imputed classical HLA alleles and corresponding amino acid sequences using
708 the reference panel recently constructed from 1,120 individuals of Japanese ancestry by the
709 combination of SNP2HLA software, Eagle, and minimac3 , as described previously47. We
710 applied post-imputation quality control to keep the imputed variants with minor allele
711 frequency (MAF) ≥ 0.5% and Rsq > 0.7. For each marker dosage that indicated the
712 presence or absence of an investigated HLA allele or an amino acid sequence, we
713 performed an association test with the disease endpoints and biomarkers. We assumed
37 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
714 additive effects of the allele dosages on phenotypes in the regression models. We included
715 the same covariates as in the GWAS. In UKB, we imputed classical HLA alleles and
716 corresponding amino acid sequences using the T1DGC reference panel of European
717 ancestry (n = 5,225)48. We applied the same post-imputation quality control and performed
718 the association tests as in BBJ.
719
720 Heritability and genetic correlation estimation
721 We performed LD score regression (see URLs) for GWASs of BBJ and Europeans to
722 estimate SNP-based heritability, potential bias, and pairwise genetic correlations. Variants in
723 the MHC region (chromosome 6:25–34 Mb) were excluded. We also excluded variants with
724 χ2 > 80, as recommended previously49. For heritability estimation, we used the baselineLD
725 model (version 2.2), which included 97 annotations that correct for bias in heritability
726 estimates50. We note that we did not report liability-scale heritability, since population
727 prevalence of 159 diseases in each country was not always available, and the main
728 objective of this analysis was an assessment of bias in GWAS, rather than the accurate
729 estimation of heritability. We calculated the heritability Z-score to assess the reliability of
2 730 heritability estimation, and reported the LDSC results with Z-score for h SNP is > 2
731 (Supplementary Table 3). For calculating pairwise genetic correlation, we again restricted
2 49 732 the target GWASs to those whose Z-score for h SNP is > 2, as recommended previously . In
733 total, we calculated genetic correlation for 106 GWASs in BBJ and 148 in European GWASs,
734 which resulted in 5,565 and 10,878 trait pairs, respectively.
735 To illustrate trait-by-trait genetic correlation, we hierarchically clustered the rg values
736 with hclust and colored them as a heatmap (Extended Data Figure 10). To adopt reliable
38 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
737 genetic correlations, we restricted the rg values that had Pcor < 0.05. Otherwise, the rg values
738 were replaced with 0. We then defined the tightly clustered trait domains by greedily
739 searching for the phenotype blocks with pairwise rg > 0.7 within 70% of rg values in the block
740 from the top left of the clustered correlation matrix. We manually annotated each trait
741 domain by extracting the characteristics of traits constituting the domain (Extended Data
742 Figure 10).
743
744 Deconvolution of a matrix of summary statistics by TSVD
745 We performed the TSVD on the matrix of genotype-phenotype association Z scores as
746 described previously as DeGAs framework5. In this study, we first focused on 159 disease
747 endpoint GWASs in BBJ and European GWAS (i.e., 318 in total) to derive latent
748 components through TSVD. On constructing a Z-score matrix, we conducted variant-level
749 QC. We removed variants located in the MHC region (chromosome 6: 25–34 Mb), and
750 replaced unreliable Z-score estimates with zero when one of the following conditions were
751 satisfied:
752 - P value of marginal association ≥ 0.001
753 - Standard error of beta value ≥ 0.2
754 Considering that rows and columns with all zeros do not contribute to matrix decomposition,
755 we excluded variants that had all zero Z-scores across 159 traits in either in BBJ or
756 Europeans. We then performed LD pruning using PLINK software51 (“--indep-pairwise 50 5
757 0.1”) with an LD reference of 5,000 randomly selected individuals of white British UKB
758 participants to select LD-independent variant sets, which resulted in a total of 22,980
759 variants. Thus, we made a Z-score matrix (= W) with a size of 318 (N: 159 diseases × 2
39 medRxiv preprint doi: https://doi.org/10.1101/2020.10.23.20213652; this version posted October 27, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
760 populations) × 22,980 (M: variants). With a predetermined number of K, TSVD
T T 761 decomposed W into a product of three matrices: U, S, and V : W = USV . U = (ui,k)i,k is
762 an orthonormal matrix of size N × K whose columns are phenotype singular vectors, S is a
763 diagonal matrix of size K × K whose elements are singular values, and V = (vj,k)j,k is an
764 orthonormal matrix of size M × K whose columns are variant singular vectors. Here we set
765 K as 40, which together explained 36.7% of the total variance of the original matrix. This
766 value was determined by experimenting with different values from 20 to 100 and selecting
767 the informative and sufficient threshold. We used the TruncatedSVD module in the
768 sklearn.decomposition library of python for performing TSVD.
769 To interpret and visualize the results of TSVD, we calculated the squared cosine