The Genomic Landscape of Clonal Hematopoiesis in Japan
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Materials for : The genomic landscape of clonal hematopoiesis in Japan Authors: Chikashi Terao1-3*, Akari Suzuki4, Yukihide Momozawa5, Masato Akiyama1,6, Kazuyoshi Ishigaki1, Kazuhiko Yamamoto4, Koichi Matsuda7,8, Yoshinori Murakami9, Steven A McCarroll10-12, Michiaki Kubo5, Po-Ru Loh10,13§, Yoichiro Kamatani1,14*§ 1.Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan 2.Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan 3.The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan 4.Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan. 5.Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan. 6.Department of Ophthalmology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan. 7.Laboratory of Genome Technology, Human Genome Center, Institute of Medical Science, The University of Tokyo, Tokyo, Japan. 8.Laboratory of Clinical Genome Sequencing, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan. 9.Division of Molecular Pathology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan 10.Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. 11. Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 12.Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA. 13.Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA. 14.Kyoto-McGill International Collaborative School in Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan §these authors equally supervised this work 1 Correspondence should be addressed to: Chikashi TERAO at Laboratory for Statistical and Translational Genetics , RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan, E-mail address: [email protected] or Yoichiro Kamatani at Laboratory for Statistical and Translational Genetics , RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan, E-mail address: [email protected] Supplementary Method page 3-10 Supplementary Note page 11-15 Fig. S1-S55 page 16-73 Tables S1-S28 page 74-108 Supplementary Reference page 109 2 Supplementary Methods Calculation of BAF and LRR from genotype intensity 1. Computation of cluster median of X and Y signal intensities. We calculated median of X and Y intensities in each genotype. If a cluster contained calls less than 10, we set its median to missing. 2. Affine-normalization and correction by GC and CpG contents. We took a similar approach to Jacobs et al1, and our method is the same as Loh et al10 regarding this calculation. A pair of multiple variate linear regression was carried out to correct effects of GC and CpG contents. Median of centers (X, Y) were set as two dependent variables to be expected (X, Y) values and modeled with the use of X and Y values in each subject of the genotype, GC and CpG contents in 9 windows of 50, 0.1k, 0.5k, 1k, 10k, 50k, 100k, 250k and 1Mbp at the center of SNPm as covariates as follows. 9 2 퐺퐶 푝 퐺퐶 퐶푝퐺 푝 퐶푝퐺 X푚,푖,푒푥푝 = e푥,푖 + Xm,iαx,i + Ym,iαy,i + ∑ ∑[(푓푚,푘 ) ∙ 훼푖,푘,푝 + (푓푚,푘 ) ∙ 훼푖,푘,푝 ] 푘=1 푝=1 9 2 퐺퐶 푝 퐺퐶 퐶푝퐺 푝 퐶푝퐺 Y푚,푖,푒푥푝 = e푦,푖 + Xm,iβx,i + Ym,iβy,i + ∑ ∑[(푓푚,푘 ) ∙ 훽푖,푘,푝 + (푓푚,푘 ) ∙ 훽푖,푘,푝 ] 푘=1 푝=1 where 푋푚,푖,푒푥푝 and 푌푚,푖,푒푥푝 are median of X and Y intensity in a cluster of a genotype of i-th individual in m-th variant, respectively, 푋푚,푖 and 푌푚,푖 are X and Y values in i-th individual for m-th variant, respectively, 훼푥,푖 , 훽푦,푖 are effect sizes of calculated X and Y for i-th 퐺퐶 퐶푝퐺 individual, respectively, 푓푚,푘 or and 푓푚,푘 are fraction of GC and CpG in k-th window, 퐶푝퐺 퐺퐶 훼푖,푘,푝 푎푛푑 훽푖,푘,푝, are effect size of GC and CpG in i-th individual for k-th window, respectively and e푥,푖 and e푦,푖are the error terms of X and Y for i-th individual, respectively. The multi-variate linear regression analyses were conducted per individual (~179k sets of models), assuming fixed effects of X, Y intensities, GC and CpG waves across variants. The GC content was computed with the use of bedtools on the hg19 reference. The CpG content was calculated with the use of EpiGRAPH CpG annotation45. Corrected X, Y values were calculated by expected X and Y values minus residuals of X and Y. 3.Computation of means of corrected X and Y values. 3 Means of corrected X and Y calculated above in the three cluster centers for each genotype were computed. 4.Transformation of corrected X and Y intensities to θ and R values. We transformed corrected X and Y to θ and R as follows; 2 푌 θ = ∙ arctan ( ) 휋 푋 푙표푔2푅 = 푋 + 푌 This follows the method by Staaf et al43 and differs from the method by Loh et al10. 5. Transformation of θ and R to LRR and BAF values. We computed cluster centers of each genotype to obtain θ and R in three cluster centers. We conducted linear interpolation between the cluster centers to estimate expected log2R based on θ in each individual. If cluster centers of homozygotes were missing, we used reflection of the cluster center in the opposite homozygote genotype across the vertical line on the center of heterozygote genotype. 6. Mean shift of LRR. We noticed that LRR in our dataset had slightly downward bias. To avoid noise in mosaic calls due to this bias, we shifted LRR values in each genotype for each variant to have mean 0 (mean shift). Calling mosaic events with the use of BAF and LRR 1.Filtering constitutional duplications The 25 states corresponding to phased BAF deviation (from -0.24 to 0.24 with interval of 0.02) were used in HMM. We assume events in the state revealed mean BAF deviation equal to state value (-0.24 – 0.24 with interval of 0.02) with empirical standard deviation computed across genotyping results. We capped z-score of 4. Transition probability was determined 0.003 from zero to non-zero state, 0.001 from a state to its negative value (phase switch error). Regions to 4 mask were selected by computing maximum likelihood (Viterbi path) and examining contiguous non-zero states. We masked regions of likely constitutional duplications with <2Mb with |ΔBAF|>0.1 and LRR>0.1 and their 2Mb nearby regions. 2.Parameterized hidden Markov model for event detection We used a family of 3 states HMMs parametrized by deviation of BAF, θ, namely, {-θ,0, θ} taking phase switch errors into account with transition matrix slightly different from 1st step (from±θ to 0, 0.0003, from 0 to ±θ, 0.004 * 0.0003 and switch error of 0.001). For acrocentric chromosomes (no p-arm genotypes), starting probability of non-zero state was decreased by a factor of 0.2. Like the 1st step, we assume events in the state revealed mean BAF deviation equal to state value with empirical standard deviation computed across genotyping results. We capped z-score of 2. 3.Calling existence of an event: likelihood ratio test statistics Based on a sequence of phased BAF deviation (ΔBAF) on each chromosome, we can analyze whether the sequence of observed ΔBAF can be explained by presence of mosaic events with the states defined above. We compute the likelihood ratio statistics with the use of a family of HMM paths parameterized by θ and modeled a total probability observing the sequence L(0 | ∆BAF) ofΔBAF. Likelihood ratio for θ can be modeled by f(∆BAF) = 푠푢푝휃{퐿(휃 |∆퐵퐴퐹)} 0 of θ indicates no mosaic events. We discretized θ to run from 0.001 to 0.25 in 50 multiplicative steps in practice. 4.Calling event boundaries Boundaries of called events were determined based on the consensus of mosaic calls in five samples taken from the posterior of the HMM using the likelihood-maximizing value for θ. 5. Calling copy number Since we noticed that the steepness of slopes in BAF-LRR space to distinguish possible gains and losses from CNN-LOH in the current data set were quite different from the previous study (this is not limited to raw or transformed BAF and LRR values in the current study), we did not use the previously-defined threshold of slopes. We defined 1.3 and -1.05 to distinguish gains 5 and losses from CNN-LOH, respectively. We called copy number only if the most likely call was at least 10 times more likely than the next-most likely call (~90% confidence). 6. Filtering possible constitutional duplications after calling mosaic events. We excluded possible constitutional duplications with length >10 Mb with LRR > 0.35 or LRR >0.2 and |ΔBAF|>0.16. We also filtered events with length <10Mb with LRR >0.2 or LRR>0.1 and |ΔBAF|>0.1. These threshold was defined by the previous study10. We further excluded events classified as gain or unknown and satisfying any of the following criteria: (1) less than 5Mbp length and contained in segmental duplications reported in 1000 genome projects with extension of 0.1Mbp in both directions (2) length <5Mb with LRR>0.13, or (3) length<2Mbp and LRR>0.02. We excluded events with LRR>-0.1 and observed heterozygosity within events less than one third of expected heterozygosity.