Assembly and Analysis of 100 Full MHC Haplotypes from the Danish
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 1 Assembly and analysis of 100 full MHC haplotypes from the Danish 2 population 3 4 Jacob M. Jensen1*, Palle Villesen1,2, Rune M. Friborg1, The Danish Pan 5 Genome Consortium, Thomas Mailund1, Søren Besenbacher1,3, Mikkel H. 6 Schierup1,4* 7 8 1. Bioinformatics Research Centre, Aarhus University, CF Møllers Alle 8, 9 8000 Aarhus C. Denmark 10 2. Department of Clinical Medicine, Aarhus University, Palle Juul-Jensens 11 Boulevard 82, 8200 Aarhus N. Denmark 12 3. Department of Molecular Medicine, Aarhus University Hospital, Skejby, 13 Aarhus N 8200, Denmark 14 4. Department of Bioscience, Aarhus University, Ny Munkegade 116, 8000 15 Aarhus C. Denmark 16 17 * Correspondence to [email protected], [email protected] 18 19 1 Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 20 Abstract 21 Genes in the major histocompatibility complex (MHC, also known as 22 HLA) play a critical role in the immune response and variation within the 23 extended 4 Mb region shows association with major risks of many 24 diseases. Yet, deciphering the underlying causes of these associations 25 is difficult because the MHC is the most polymorphic region of the 26 genome with a complex linkage disequilibrium structure. Here we 27 reconstruct full MHC haplotypes from de novo assembled trios without 28 relying on a reference genome and perform evolutionary analyses. We 29 report 100 full MHC haplotypes and call a large set of structural variants 30 in the regions for future use in imputation with GWAS data. We also 31 present the first complete analysis of the recombination landscape in 32 the entire region and show how balancing selection at classical genes 33 have linked effects on the frequency of variants throughout the region. 34 35 2 Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 36 Introduction 37 The major histocompatibility complex covers 4 Mb on Chromosome 6 and is 38 the most polymorphic part of the human genome. Most of the ~200 genes in 39 the region are directly involved with the immune system. The high diversity is 40 thought to be driven by balancing selection acting on several individual genes 41 combined with an overall small recombination rate in the MHC (DeGiorgio et 42 al. 2014). Genome wide association studies have revealed the MHC to be the 43 most important region in the human genome for disease associations, in 44 particular for autoimmune diseases (Trowsdale and Knight 2013; Zhou et al. 45 2016). 46 47 The very high diversity and wide-ranging linkage disequilibrium (LD) makes it 48 difficult to disentangle selective forces and to accurately pinpoint the variants 49 responsible for disease associations. Many regions are too variable for 50 reliable identification of variants from mapping of short reads to the human 51 reference genome. LD causes multiple nearby variants to provide the same 52 statistical evidence of association hampering the identification of causal 53 variants. In addition to the human genome reference MHC haplotype, seven 54 other haplotypes have been sequenced (Horton et al. 2008), though six of 55 these are incomplete, and exploiting these improves mapping performance 56 significantly (Dilthey et al. 2015; Dilthey et al. 2016). There is a strong need 57 for obtaining a larger number of full MHC haplotypes, which requires de novo 58 assembly of the haplotypes without the use of a reference genome. Long-read 59 technology and refined capture methods is potentially very powerful 3 Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 60 (Chaisson et al. 2015; English et al. 2015; Selvaraj et al. 2015) but these 61 approaches are still prohibitively expensive at a large scale. 62 63 The Danish Pangenome project (L Maretty, JM Jensen, B Petersen, JA 64 Sibbesen, S Liu, P Villesen, L Skov, K Belling, CT Have JMG Izarzugaza et 65 al., in press) was designed to perform individual de novo assembly of 50 66 parent-child trios sequenced to high depth with multiple insert size libraries. 67 We use data from 25 of these trios to reconstruct and analyze the four 68 parental MHC haplotypes in each trio (100 haplotypes in total). Our approach 69 combines the de novo assemblies with transmission information, read backed 70 phasing and joint analysis of each trio. Our final set of 100 full MHC 71 haplotypes have less than 2% missing data and >92% of all variants phased. 72 We recently reported that we found a total of 701kb of novel sequence in in 73 these haplotypes and that some of these segments are large (3-6kb) and 74 common in our haplotypes (present in 22-26% of parental haplotypes) (L 75 Maretty, JM Jensen, B Petersen, JA Sibbesen, S Liu, P Villesen, L Skov, K 76 Belling, CT Have JMG Izarzugaza et al., in press). Here we describe our 77 method of assembly and phasing in detail and perform an evolutionary 78 analysis of the resulting haplotypes. 79 80 4 Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 81 Results 82 Assembly of 100 full MHC haplotypes 83 Our assembly approach was designed to circumvent the challenges in 84 mapping short reads to a reference sequence. Through several steps we 85 leverage transmission information and read-backed phasing to create 86 candidate haplotypes to which we can map reads. As the candidate 87 haplotypes were created from the reads themselves subsequent mapping is 88 more successful than mapping to the reference genome and phasing is 89 improved. The procedure of mapping and phasing is iterated, as each inferred 90 phased haplotype improves mapping and in turn phasing. 91 Figure 1 shows a schematic of our pipeline. The starting point is a set of 92 scaffolds for each individual, de novo assembled using ALLPATHS-LG 93 (Gnerre et al. 2011) on genomes sequenced to 78X by multiple insert size 94 libraries (L Maretty, JM Jensen, B Petersen, JA Sibbesen, S Liu, P Villesen, L 95 Skov, K Belling, CT Have JMG Izarzugaza et al., in press). 96 97 We extracted scaffolds mapping with at least 50 kb to the MHC region (the 98 number of scaffolds ranges from 1-8 across individuals, see Supplemental 99 Fig. S1a) and concatenated these to create diploid consensus scaffolds 100 including bubbles in the assembly graph (step 2). For each trio, >77% of the 101 bubbles in the alignment graphs were phased without Mendelian violation 102 using the sequence immediately upstream and downstream of each bubble to 103 find exact matches within the trio (step 3). After phasing, we created a 104 sequence for each non-transmitted parental haplotype and created a 105 consensus sequence between transmitted parental haplotypes and inherited 5 Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 106 child haplotypes by multiple global alignments of segments between phased 107 bubbles (step 4&5). We then mapped reads to the transmitted (consensus) 108 haplotypes and genotyped and phased them using transmission information 109 and read-backed phasing (step 6). We then closed gaps to obtain full-length 110 haplotypes with <2% gaps (step 7, Supplemental Fig. S1b). A second iteration 111 of mapping, genotyping and phasing resulted in phasing of >92% of the 112 variants in the transmitted haplotypes of which >80 % were mapped back to 113 the non-transmitted haplotype using exact matching (step 8) (Supplemental 114 Fig. S1c). We evaluated the accuracy of variant calling and phasing by 115 cloning and Sanger sequencing of 5 clones from 75 random fragments from 116 highly polymorphic regions containing between 2 and 10 variants (204 117 variants in total). We found a validation rate of 86% (see Supplemental Table 118 S1 for details) for the phase of the variants. 119 We used simulations to further evaluate the power and accuracy of our 120 approach by simulating reads in an artificial trio with known MHC haplotypes, 121 reconstructing the haplotypes using our pipeline and comparing these to the 122 original haplotypes. We simulated reads from a trio with four of the different 123 reference haplotypes, pgf and mcf in the mother, cox and qbl in the father and 124 pgf and cox in the child. Reads were simulated to exactly reflect the coverage, 125 insert size distribution and error profile as our own sequencing. De novo 126 assembly and inference of phased haplotypes were then done in exactly the 127 same way as for the real data using our pipeline outlined in Figure 1 and we 128 then investigated whether we could separately recover the cox and the pgf 129 haplotypes in the child. Supplemental Fig. S2a-b shows that while the initial 130 assembly in the child is a mixture of the two haplotypes, the final haplotypes 6 Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press 131 generally align over the whole region with pgf and cox, respectively, showing 132 that the pipeline has phased them. We found that 91.6 % of the haplotypes 133 aligned to the correct haplotypes (Supplemental Fig. S2b) and that the lengths 134 of incorrectly phased segments were generally very short compared to the 135 correctly phased segments (Supplemental Fig. S2c). As collapse of 136 paralogous or repetitive sequence might be a likely error mode in the 137 assembled haplotypes (Alkan et al. 2011), we calculated the content of Alu 138 and LINE-1 repetitive elements as a measure of the amount of collapsed 139 repetitive sequence in the eight reference haplotypes, our simulated 140 haplotypes and our 100 new haplotypes.