Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Resource Single haplotype assembly of the from a hydatidiform mole

Karyn Meltz Steinberg,1 Valerie A. Schneider,2 Tina A. Graves-Lindsay,1 Robert S. Fulton,1 Richa Agarwala,2 John Huddleston,3,4 Sergey A. Shiryev,2 Aleksandr Morgulis,2 Urvashi Surti,5 Wesley C. Warren,1 Deanna M. Church,6 Evan E. Eichler,3,4 and Richard K. Wilson1 1The Genome Institute at Washington University, St. Louis, Missouri 63108, USA; 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA; 3Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; 4Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA; 5Department of Pathology and Human Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA; 6Personalis, Inc., Menlo Park, California 94025, USA

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 1003 Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of , repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end se- quences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly. [Supplemental material is available for this article.]

The production of a reference sequence assembly for the human Kidd et al. 2008; Zody et al. 2008). Assembling these regions is genome was a milestone in biology and clearly has impacted many further complicated due to the fact that regions of segmental du- areas of biomedical research (McPherson et al. 2001; International plication (SD) are often correlated with copy-number variants Human Genome Sequencing 2004). The availability of this re- (CNVs) (Sharp et al. 2005). Regions harboring large CNV SDs have source allows us to investigate genomic structure and variation at been misrepresented in the reference assembly because assembly a depth previously unavailable (Kidd et al. 2008; The 1000 Ge- algorithms aim to produce a haploid consensus. Highly identical nomes Project Consortium 2012). These studies have helped make paralogous and structurally polymorphic regions frequently lead clear the shortcomings of our initial assembly models and the to nonallelic sequences being collapsed into a single contig or al- difficulty of comprehensive genome analysis. While the current lelic sequences being improperly represented as duplicates. Be- human reference assembly is of extremely high quality and is still cause of this complexity, a single, haploid reference is insufficient the benchmark by which all other human assemblies must be to fully represent human diversity (Church et al. 2011). compared, it is far from perfect. Technical and biological com- The availability of at least one accurate allelic representation at plexity lead to both missing sequences as well as misassembled loci with complex genomic architecture facilitates the understanding sequence in the current reference, GRCh38 (Robledo et al. 2002; of the genomic architecture and diversity in these regions (Watson Eichler et al. 2004; International Human Genome Sequencing et al. 2013). To enable the assembly of these regions, we have de- 2004; Church et al. 2011; Genovese et al. 2013). veloped a suite of resources from CHM1, a DNA source containing The two most vexing biological problems affecting assembly a single human haplotype (Taillon-Miller et al. 1997; Fan et al. are (1) complex genomic architecture seen in large regions with 2002). A complete hydatidiform mole (CHM) is an abnormal highly homologous duplicated sequences and (2) excess allelic product of conception in which there is a very early fetal demise diversity (Bailey et al. 2001; Mills et al. 2006; Korbel et al. 2007;

Ó 2014 Steinberg et al. This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication Corresponding author: [email protected] date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is Article published online before print. Article, supplemental material, and pub- available under a Creative Commons License (Attribution-NonCommercial 4.0 lication date are at http://www.genome.org/cgi/doi/10.1101/gr.180893.114. International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

2066 Genome Research 24:2066–2076 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/14; www.genome.org www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

CHM1 assembly of the human genome and overgrowth of the placental tissue. Most CHMs are androge- Table 1. Assembly statistics netic and contain only paternally derived autosomes and sex Total sequence length 3,037,866,619 resulting either from dispermy or duplication of Total assembly gap length 210,229,812 a single sperm genome. The phenotype is thought to be a result of Gaps between scaffolds 225 abnormal parental contribution leading to aberrant genomic im- Number of scaffolds 163 printing (Hoffner and Surti 2012). The absence of allelic variation Scaffold N50 50,362,920 Number of contigs 40,828 in monospermic CHM makes it an ideal candidate for producing Contig N50 143,936 a single haplotype representation of the human genome. There are Total number of chromosomes 23 a number of existing resources associated with the ‘‘CHM1’’ sam- ple, including a BAC library with end sequences generated with Sanger sequencing using ABI 3730 technology (https://bacpac. chori.org/), an optical map (Teague et al. 2010), and a BioNano corporated into the assembly but were used to analyze the quality genomic map (see Data access), some of which have previously of the assembly. been used to improve regions of the reference human genome We assessed the integrity and fidelity of CHM1 with respect assembly. to the GRCh37 reference by analyzing CHORI-17 BAC end se- BAC clones have historically been used to resolve difficult ge- quence mapping to GRCh37. Approximately 95.5% clone ends nomic regions and identify structural variants (Barbouti et al. 2004; mapped uniquely concordantly, 4% mapped uniquely discor- Carvalho and Lupski 2008). A BAC library constructed from CHM1 dantly, and the remaining 0.5% mapped to multiple locations. DNA (CHORI-17, CH17) has also been utilized to resolve several These statistics indicate that the genomic DNA derived from the very difficult genomic regions, including human-specific duplica- CHM1 cell line that was used to create the BAC library and Illu- tions at the SRGAP2 gene family on 1 (Dennis et al. mina libraries is not grossly rearranged and represents a suitable 2012). Additionally, the CHM1 BAC clones were used to generate template for a platinum reference. In addition, analyses from an single haplotype assemblies of regions that were previously mis- optical map generated using CHM1 genomic DNA do not show an represented because of haplotype mixing (Watson et al. 2013). excess of structural variants that would suggest somatic rearrange- Both of these efforts contributed to the improvement of the ment (Teague et al. 2010). SNP genotyping also confirms the haploid GRCh38 reference human genome assembly, adding hundreds of content of the cell line, and karyotyping was performed at several kilobases of sequence missing in GRCh37, in addition to providing stages during passaging to ensure the integrity (Fan et al. 2002). an accurate single haplotype representation of complex genome regions. Assessment of assembly quality Because of the previously established utility of sequence data derived from the CHM1 resource, we wished to develop a complete Repetitive element content assembly of a single human haplotype. To this end, we produced The assembly was masked with both WindowMasker (Morgulis et al. a short read-based (Illumina) reference-guided assembly of CHM1 2006) and RepeatMasker (Smit et al. 1996–2010) and 34.29% and with integrated high-quality finished fully sequenced BAC clones 47.21% of the assembly was masked, respectively. This is compa- to further improve the assembly. This assembly has been anno- rable to the repetitive content of GRCh37 (34.24% and 47.15%, tated using the NCBI annotation process and has been aligned to WindowMasker and RepeatMasker, respectively). When the repet- other human assemblies in GenBank, including both GRCh37 and itive elements are parsed out by type, the numbers of each element GRCh38. Here we present evidence that the CHM1 genome as- are comparable between GRCh37 and CHM1_1.1 (Table 2). sembly is a high-quality draft with respect to gene and repetitive element content as well as by comparison to other individual ge- nome assemblies. We will also discuss current plans for developing Gene content a fully finished genome assembly based on this resource. This analysis is based on NCBI Homo sapiens annotation run 105 that includes GRCh37.p13, CHM1_1.1, HuRef, and the single Results chromosome assembly CRA_TCAGchr7v2. For our comparison, we only used the annotations on the original GRCh37 Primary We generated an assembly of the complete hydatidiform mole, Assembly sequences, as many of the fix patches in patch release 13 CHM1, a genome comprised of 23 chromosomes (1–22 and X and are based on CHM1. Using this annotation run provides a better MT) with a total sequence length of 3.04 Gb. Contig N50 length is comparison than the original GRCh37 annotation, because the ;144 kbp and scaffold N50 length is 50 Mbp (Table 1). These N50 same software and evidence set was used. statistics were based upon the reference-guided assembly with BAC Using gene annotation as a proxy for assembly quality, the tiling paths incorporated and are defined as the length of the results indicate that the CHM1_1.1 assembly (39,009 total , contig or scaffold where 50% of the assembly is assembled into 19,892 coding genes) (Table 3) is of higher quality than the contigs/scaffolds that are of that length or greater. Compared to other HuRef assembly (38,070 total genes, 19,668 protein-coding genes), WGS human assemblies, HuRef (J. Craig Venter assembly; GenBank though not quite as good as the GRCh37 assembly (39,947 total GCA_000002125.2), ALLPATHS (GenBank AEKP00000000.1), and genes, 20,072 protein-coding genes). The alignment evidence used YH_2.0 (GenBank GCA_000004845.2), the CHM1_1.1 assembly has to support each gene model supports this conclusion. CHM1_1.1 a lower contig number and a higher contig N50, demonstrating has 21 genes annotated with a ‘‘transcription discrepancy’’ com- that CHM1_1.1 is more contiguous than previously generated in- pared to 15 in GRCh37. Interestingly, some genes are problematic dividual genome assemblies (Fig. 1). We incorporated high-quality in both assemblies, such as OVGP1 and MUC19, suggesting that sequence from 382 BAC clones (Supplemental Table S1) to improve even in a single haplotype background, complex gene family the assembly in complex regions where the GRCh37 reference was regions can be difficult to assemble (Supplemental Data: Gene incorrect (Fig. 2). BAC end sequences themselves were not in- Annotation).

Genome Research 2067 www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Steinberg et al.

mental Table S2). MUC3B, a membrane bound mucin that maps to Chromosome 7q22 (NC_018918.2: 100477710-100541651) is annotated only on CHM1_1.1 as predicted from Gnomon gene models. The protein produced by MUC3B functions as a major glycoprotein component of mucus gel at the intestinal surface that provide a barrier against foreign particles and microbial organisms. It is part of a tandem duplication involving MUC3A, and MUC3B is expressed exclusively in the small intestine and colon (Pratt et al. 2000; Kyo et al. 2001). Variants of MUC3A have been associated with inflammatory bowel disease, and up-regulation of MUC3 inhibited adherence of pathogenic E. coli in human intestinal cells (Pan et al. 2013). The CHM1 version of MUC3B contains four copies of the tandem repeat. Figure 1. Comparison of contig count and contig N50 between CHM1_1.1 (GenBank GCA_000306695.2) and HuRef (J. Craig Venter assembly; GenBank Other clinically relevant CHM1 genes not present in the GCA_000002125.2), ALLPATHS (GenBank AEKP00000000.1) and YH_2.0 GRCh37.p13 primary assembly include KCNJ18 and DUX4L1. KCNJ18 (GenBank GCA_000004845.2) WGS assemblies. CHM1_1.1 has only is a member of a large gene family of potassium inwardly rectifying 10%–20% of the number of total contigs as the other assemblies and has channels located on 17q11.2 (NC_018928.2: 21605469-21617558). It a contig N50 1.5 to six times larger. is expressed mostly in skeletal muscle and regulated by thyroid hormone. Mutations in this gene have been associated with thy- While GRCh37 may have better global gene-annotation rotoxic hypokalemic periodic paralysis (MIM 613239) (Ryan et al. metrics, there are regions in which CHM1_1.1 performs better. For 2010). DUX4L1 encodes a transcription factor comprised of two example, we identified 549 genes unique to the CHM1 assembly homeobox domains located within a macrosatellite repeat in the (i.e., absent from the GRCh37.p13 primary assembly; Supple- subtelomeric region of 4q (NC_018915.2: 190981943-190983264)

Figure 2. (A) WGS assembly from the first pass (CHM1_1.0; GCF_000306695.1, bronze line) on Chromosome 1p12 (NC_018912.1: 121,050,000- 121,400,000) demonstrated a gap (gray box) in the assembly (assembly name: AMYH010000980.1, green lines). Using MEGABLAST, two CH17 clones (AC247039.2 and AC253572.3, red lines) aligned to the region and appeared to span the gap. (B) By incorporating these BAC sequences into the assembly, the gap was subsequently resolved in CHM1_1.1 (NC_018912.2: 121,050,000–121,650,000). The tiling path components, FP325311.11, AC241952.2, AC247039.2, AC253572.3, and AC241377.3 indicate the clone names used to resolve the gap. The clones from A are indicated in red while the other clones are in purple. The final assembly–assembly alignment is indicated in purple, showing the gap resolution.

2068 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

CHM1 assembly of the human genome

Table 2. Comparison of repetitive elements in GRCh37 and types out of a total of 1089 unique phenotypes in NHGRI GWAS. Of CHM1_1.1 the 291 matching ClinVar alleles, 22 are categorized as pathogenic. CHM1 GRCh37 Two of the 22 pathogenic alleles are nonsense mutations that cause autosomal recessive disorders: Hemochromatosis type 2A and DNA 482,783 461,751 Marinesco-Sjogren syndrome. The remaining 18 pathogenic alleles LINE 1,511,690 1,498,690 have global minor allele frequencies at least 1% (Supplemental LTR 716,720 717,656 Table S3). Overall, the CHM1 genome does not appear to harbor an Low complexity 376,835 371,543 excessive number of risk alleles or extremely rare alleles associated RNA 778 729 SINE 1,757,213 1,793,723 with diseases (Supplemental data: Clinical Allele Analysis). A similar Satellite 3,099 9,566 analysis performed on the GRCh37 assembly identified 3556 dis- Simple repeat 398,210 417,913 ease susceptibility variants and 15 risk alleles with MAF < 1% (Chen rRNA 1,749 1,769 and Butte 2011). scRNA 1,301 1,340 snRNA 4,306 4,386 srpRNA 1,665 1,481 Representation of segmental duplication tRNA 1,852 2,002 Analysis of SDs suggests CHM1_1.1 has good representation of large, Other 19,338 15,581 duplicated sequences. Using whole-genome assembly comparison (WGAC), we discovered 54,580 pairwise alignments corresponding to 130.9 Mbp of nonredundant duplications or 4.6% of the genome (Supplemental Table S4). Intrachromosomal events comprise a (Hewitt et al. 1994). Repeat copy-number variation is associated majority of the SDs with 99.7 Mbp in contrast to 57.3 Mbp of with facioscapulohumeral muscular dystrophy (MIM 158900) interchromosomal duplications. Additionally, intrachromosomal (Bosnakovski et al. 2008). Both of these genes are now annotated in alignments are generally longer and more similar than inter- GRCh38 with information from the CHM1 data. chromosomal alignments (Supplemental Fig. S1). Both of these patterns are consistent with our previous WGAC analysis of Clinical allele analysis GRCh37 (Sudmant et al. 2013). Using an alternative approach to Using data from the NHGRI GWAS catalog and ClinVar (Landrum detect SD based on read depth analysis (WSSD) (Bailey et al. 2001), et al. 2014), we assessed the number of risk alleles present in the we identified 124.6 Mbp of duplicated sequences (4.4% of the CHM1 genome. Most loci could be successfully remapped from genome). These WSSD duplications supported 89.5 of 96.1 Mbp GRCh37 to CHM1 (7962/7991 NHGRI GWAS loci and 43,614/ (93%) WGAC duplications that were also $ 10 kbp and > 94% 48,516 ClinVar loci) using the NCBI Remap tool (http://www.ncbi. identity (Supplemental Fig. S2). Correspondingly, 119.6 Mbp of nlm.nih.gov/genome/tools/remap). The CHM1 genotype matched WSSD duplications (96%) overlapped or occurred within 5 kbp of the ‘‘risk’’ allele at 3284 loci from NHGRI GWAS and 291 loci from a WGAC duplication. To determine how CHM1.1 WGAC dupli- ClinVar. CHM1 carries an associated allele for 366 unique pheno- cations compared to duplications from GRCh37, we remapped the

Table 3. Gene predictions and models for GRCh37.p13, GRCh37 primary assembly, CHM1_1.1, and HuRef assemblies GRCh37.p13 Feature GRCh37.p13 Primary Assembly CHM1_1.1 HuRef

Genes and pseudogenes 40,158 39,947 39,009 38,070 Protein-coding 20,176 20,072 19,892 19,668 Noncoding 7,667 7,627 7,529 7,151 Pseudogenes 12,315 12,248 11,588 11,251 Genes with variants 15,068 14,994 8,718 8,620 Placed on multiple assembly units 2,665 na na na mRNAs 67,517 64,734 35,142 34,843 Fully supported 67,267 64,514 34,941 34,630 With > 5% ab initio 203 181 162 180 Partial 116 138 602 4,357 Placed on multiple assembly units 2,171 na na na Known RefSeq (NM_) 34,632 34,606 34,367 34,212 Model RefSeq (XM_) 32,885 30,128 775 631 Model RefSeq (XM_) with correction 17 15 21 15 Other RNAs 15,063 14,151 11,408 10,854 Fully supported 13,599 13,123 10,384 9,960 With > 5% ab initio 0 0 0 0 Partial 2,369 2,370 2,401 2,557 Placed on multiple assembly units 279 na na na Known RefSeq (NR_) 6,623 6,618 6,458 6,328 Model RefSeq (XR_) 7,011 6,538 3,946 3,650 CDSs 68,035 65,099 35,522 35,173 Fully supported 67,267 64,514 34,941 34,630 With > 5% ab initio 220 197 175 192 Partial 96 118 313 2,966 Known RefSeq (NP_) 34,632 34,605 34,365 34,185 Model RefSeq (XP_) 32,885 30,128 775 631 Model RefSeq (XP_) with correction 17 15 21 15

Genome Research 2069 www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Steinberg et al.

WGAC alignments from CHM1.1 to GRCh37.p13 with NCBI’s sequence (heterozygous nonrepetitive: HNR variants), as these remap API. After remapping and omitting coordinates from may be sites of cryptic duplication in the reference sequence or patches, there were 138 Mbp of CHM1.1 duplications. The two structural variation in CHM1. The HNR variants were overlapped assemblies shared 123 Mbp of duplications corresponding to 89% with the RefSeq annotation (Supplemental Table S6) and the of CHM1.1 duplications and 90% of GRCh37 (Fig. 3). functional consequences of each variant were predicted (Fig. 4). The genes with the most HNR variants were then compared to Identification of misassemblies genes missing copies in the reference and genes with significantly population stratified copy number (high Vst) from Sudmant et al. The goal for this project is a completely closed reference assembly (2010). Genes with known missing copies in the reference as- containing no gaps. Therefore, it is critical for us to identify the sembly, such as GPRIN2 and DUSP22, have 20 and 56 HNR vari- extent of misassembly as well as the specific regions involved for ants, respectively, while high Vst genes such as PDE4DIP have 267 targeted correction. We have already begun the process of loading HNR variants. The gene with the most HNR variants (N = 618) is the assembly and curation regions into the GRC curation database PRIM2 that is part of interchromosomal duplications of Chromo- and framework. We performed three separate analyses to assess the somes 6 and 3 and represents cryptic SDs in the GRCh37 reference integrity of the assembly and identify potential misassemblies. genome (Genovese et al. 2013). Additionally, two regions that were incorrectly represented in GRCh37 and subsequently resolved in Identification of heterozygous SNVs GRCh38 using the CHM1 derived BAC library, SRGAP2 (Dennis CHM1 is an essentially homozygous resource. Thus, there should et al. 2012) and IGH (Watson et al. 2013), both had high counts of be no heterozygous SNVs identified upon aligning the CHM1 reads HNR variants (39 and 54, respectively) providing additional sup- to GRCh37, and there should be no SNVs identified when these port for the hypothesis that heterozygous calls are indicative of reads are aligned to the CHM1_1.1 assembly. We were therefore reference assembly errors. The majority of the heterozygous calls interested in using SNV detection to identify potentially mis- are errors that arise during variant detection due to paralogous assembled regions in both GRCh37 as well as CHM1_1.1. First, we sequences mapping to low complexity regions. aligned the Illumina reads from CHM1 libraries to the GRCh37.p13 We then aligned the Illumina reads from the CHM1 libraries primary assembly and identified 99,572 heterozygous sites and to the CHM1_1.1 assembly. A total of 86,544 SNVs were called, and 2,445,270 homozygous sites. We stratified heterozygous SNVs 79% of these variants overlap repetitive sequence (RepeatMasker based on whether they overlapped repetitive or low complexity and WGAC) (Supplemental Table S7). There is a significant en- sequence (Supplemental Table S5). A recent study demonstrated richment of variants in repetitive sequence compared to sequence that up to 60% of heterozygous SNVs called from CHM1 Illumina not annotated as repetitive (1000 permutations, simulation based reads aligned to the reference are within low complexity regions P-value < 0.001). Thirty-four regions totaling 49 Mb have SNV (including SDs) of the human genome (Li 2014). We focused on density per kilobase two standard deviations higher than the mean 25,529 heterozygous variants that did not fall within a repetitive SNV density per kilobase of 0.03 (Supplemental Table S8). Sixty- four percent of the bases in SNV-rich re- gions are annotated as repetitive. There are 294 unique RefSeq and 198 unique Gnomon genes in SNV rich regions in- cluding the beta-defensin gene cluster on Chromosome 8 and NBPF1 on Chromo- some 1 (Supplemental Table S9). These regions are highly duplicated and the variant calls likely represent paralogous sequence variants. CH17 BAC ends mapped to CHM1_1.1 assembly We aligned the BAC end sequences de- rived from the CHORI-17 BAC library to the CHM1_1.1 assembly. As this is the same DNA source as the assembly, there should be no structural variation. The majority of placements were concordant (96.22%), suggestive of a high-quality assembly; however, regions with multiple discordant alignments may represent as- sembly errors. A query set of 306,838 BAC end sequences representing 158,396 unique clones from the CH17 BAC library was aligned to the CHM1_1.1 assembly (Supplemental Table S10). We identified 1192 regions with 3927 unique clones Figure 3. (A) Comparison of segmental duplications (SDs) in GRCh37 and CHM1.1 assemblies pre- that likely contained assembly errors dicted by WGAC analysis by chromosome. The duplication content is comparable between GRCh37 and CHM1_1.1 assemblies indicating good assembly quality. (B) Venn diagram of SDs in GRCh37 and based on an unexpected size distribution CHM1_1.1 assemblies shows that most duplications are shared between the assemblies. of the aligned BACs. Among unique dis-

2070 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

CHM1 assembly of the human genome

Figure 4. Functional consequences of CHM1 heterozygous variants not in repetitive sequence (HNR variants). Approximately 97% of HNR variants are intergenic or intronic. Of the remaining 3% of other variants, ;48% are in the 39 or 59 UTR, 17% are silent, and 35% are coding (missense, nonsense, essential splice site). cordant clones, 2840 suggested a deletion in the assembly and 443 bly, we aligned CHM1 PacBio reads, which became available after suggested additional sequence in the CHM1_1.1 assembly not rep- the CHM1_1.1 assembly was complete and annotated, to the resented in the BAC resource. The regions demonstrating insertion CHM1_1.1 assembly. We hypothesized that these alignments in may be due to instability in BAC clones. On average, there are sig- such regions of CHM1_1.1 would exhibit one or more of the fol- nificantly more bases in SDs (based on WGAC) in the single dis- lowing characteristics: (1) low coverage with respect to coverage in cordant and multiple mapped clone ends compared to the single surrounding regions; (2) sharp boundaries at which alignment concordant clone ends (mean: 0.24, 0.96, 0.04; standard deviations: coverage drops off; or (3) inversions. Low coverage is often asso- 0.18, 0.14, 0.02, respectively, for single discordant, multiple, and ciated with highly fragmented assembly regions, which are them- single concordant; Student’s t-test, two tailed P = 0foreach selves hallmarks of assembly problems (though they may not comparison). The remaining unique discordant placements were necessarily reflect errors introduced by GRCh37). Sharp bound- comprised of incorrectly oriented ends, indicating that the as- aries could occur at component boundaries (indicative of GRCh37 sembly and clone sequences are inverted relative to one another. tiling path errors) or within assembly components (indicative of CH17 clones with discordant placements on the CHM1_1.1 component assembly errors in GRCh37). Although other assembly assembly may be used to identify regions misassembled due to features (i.e., repetitive elements or structural variation) can also errors in the reference or genomic variation. For example, in the result in read alignments having similar characteristics, such re- SMA duplication region at 5q13.3 (NC_018916.2) (Supplemental gions should be enriched for assembly errors. Fig. S3), the GRCh37 reference chromosome represents a sin- To identify CHM1_1.1 assembly errors corresponding to un- gle resolved SMA haplotype (Schmutz et al. 2004). However, recognized GRCh37 errors, we focused on CHM1_1.1 assembly many CH17 clones aligning to the corresponding region of the sites where alignment coverage dropped off sharply. To this end, CHM1_1.1 assembly have discordant placements that are charac- we produced a list of regions where there were PacBio aligned reads terized by inversions and size discrepancies, suggesting that the that met the above criteria, and we refer to these reads as ‘‘cliffs.’’ CHM1_1.1 assembly does not faithfully represent the CHM1 ge- We focused on bins where the cliff count is greater than or equal to nome at this locus. This observation is consistent with the known 10 and the depth is less than 23 the coverage (< 108) to eliminate variability of this genomic region in the human population, which artifacts from repetitive elements. There are 274 loci where cliffs is associated with its complex SD structure (Ogino et al. 2004). It are within 1 kbp of the component boundary and 2109 loci where should be noted, however, that the clone placements located cliffs are > 1 kbp from the component boundary (Supplemental within the local BAC assemblies in this region are largely concor- Data: PacBio). Using this approach we are able to clearly visualize dant, whereas those associated with WGS contigs are discordant. regions with assembly errors such as the one on Chromosome 11, This result demonstrates how the use of sequence from large insert where two tiling path components are inverted in the CHM1_1.1 clones can resolve regions too complex for even the reference- assembly and require correction (Fig. 5). guided assembly of WGS contigs. Assembly with additional BAC clones will likely be required to close the existing gaps and fully Discussion resolve the CHM1 genomic sequence in these complex regions. There has been a dramatic decrease in sequence cost with a con- Alignment of a long read data set comitant increase in throughput leading to the availability of To identify errors in the CHM1_1.1 genome assembly thousands of sequenced genomes and exomes. However, analysis (GCF_000306695.2) introduced as a consequence of errors in the of individual genomes depends upon the availability of a high- GRCh37 primary assembly unit that was used to guide its assem- quality reference assembly. Despite the high quality of the human

Genome Research 2071 www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Steinberg et al.

Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio ‘‘cliff’’ reads where the alignment coverage dropped off sharply. WGS component (light green lines) boundaries flanked by such reads are marked with red dashed lines. The ends of each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads are marked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest that the two WGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio reads in these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottom light green lines show a proposed tiling path with the orientation corrected; the letters indicate where each end of the initial tiling path components should be placed.

reference assembly, many groups have described shortcomings of Haplotype information is critical to interpreting clinical and this resource, including remaining gaps, single-nucleotide errors, personal genomic information as well as genetic diversity and an- or gross misassembly due to complex haplotypic variation (Eichler cestry data, and most previously sequenced individual human ge- et al. 2004; Doggett et al. 2006; Kidd et al. 2010; Chen and Butte nomes are not haploresolved. The current reference human genome 2011; The 1000 Genomes Project Consortium 2012). Both gaps sequence represents a mosaic that further complicates haplotyping; and misassembled regions often arise because the DNA sequence within a BAC clone there is a single haplotype representation, but used for the assembly was from multiple diploid sources contain- haplotypes can switch at BAC clone junctions. By utilizing an es- ing complex structural variation. Because such loci often contain sentially haploid DNA source, we resolved a single haplotype across medically relevant gene families, it is important to resolve varia- complex regions of the genome where the reference genome con- tion at these sites, as the structural and single-nucleotide diversity tained a mixture of haplotypes from various sources and/or con- is likely associated with clinical phenotypes (Eichler et al. 2004). tained unresolved gaps. For example, a gap on Chromosome 4p14 Thus, to resolve structurally complex regions and provide a more in GRCh37 (Chr 4: 40296397–40297096) was completely resolved effective reference resource for such loci, we combined WGS data using CHM1 WGS data. The gap was flanked by repetitive elements and BAC sequences from a haploid DNA source to create a single that were not traversed by a clone. This region has subsequently haplotype assembly of the human genome. been updated with a complete tiling path in GRCh38.

2072 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

CHM1 assembly of the human genome

The addition of high-quality BAC sequence to our assembly the CHM1 data. A recent study provides the basis for a more formal was vital to resolving gaps. For example, in GRCh37 at Chromo- graph representation (Paten et al. 2014) but a great deal of tool some 15q25.2 there was a 79-kbp gap due to over-expansion of development needs to occur before we can formally move to such a hypervariable region. This region contains many GOLGA6L core an assembly representation. While this development occurs, the duplicon genes (Jiang et al. 2007) and highly identical SDs. RP11 current multiallelic reference provides data that allow us to explore BAC clones on one side and RP13 BAC clones on the other side complex genomic regions. This is important for analyzing genes flanked the gap. Using the BAC-based sequence-resolved CH17 that are only represented on the alternate loci, as they need to haplotype, and the gap was filled in GRCh38 (Supplemental Fig. S4). somehow be represented in the assembly. In the current GRCh38 A preliminary analysis of PacBio data shows this region remains there are 153 genes present only on the alternate loci, some of unresolved even using long read sequencing. This underscores the which are medically relevant such as HLA-DRB3 and LILRA3.A importance of curation and employing multiple sequencing strat- linear genome representation would not be able to correctly han- egies to obtain an accurate genome representation. dle these cases. The use of the single haplotype CHM1 resource has Despite the high quality of the CHM1_1.1 assembly, we did proven quite valuable in resolving several complex regions of the identify regions that require further improvement. Some of these human genome. In many of these cases, the GRCh37 representa- problems are due to the repetitive nature of the loci, while others tion was the mixture of several haplotypes and not likely found in are due to using GRCh37 to guide the CHM1_1.1 assembly. The any individual. We plan on continuing to develop this resource in availability of diverse, assembly independent resources, including an effort to ensure that we have at least one correct representation the recently released long read data set from PacBio provide a of all loci in the human genome. pathway for problem identification and correction. Unfortunately, the CHM1_1.1 assembly was completed before the PacBio data Methods were released; however, we anticipate that the next version of the assembly will incorporate this valuable data set. Additionally, a Cell line preliminary analysis of data from BioNano Genomics also demon- CHM1 cells were grown in culture from one such conception at strates that a second, independent source of data can be useful for Magee-Women’s Hospital (Pittsburgh, PA) after parental consent improving the assembly (Supplemental Data: BioNano Analysis). and IRB approval. Cryogenically frozen cells from this culture were The GRC has established the infrastructure to support assembly grown and transformed using human telomerase reverse tran- curation and the development of highly refined reference assem- scriptase (TERT) to develop a cell line. This cell line retains a 46,XX blies, as evidenced by the release of two successive human genome karyotype and complete homozygosity. It was subsequently used assemblies (GRCh37 and GRCh38) and a mouse genome assembly for genomic research by multiple investigators and was also used to (GRCm38). We have already begun using these resources to im- prepare a BAC library (CHORI 17; https://bacpac.chori.org/) for prove the CHM1_1.1 assembly. further research. We chose a reference-guided assembly method rather than performing a de novo assembly of the short WGS reads. An anal- Illumina sequencing ysis of a de novo assembly from short reads using the SOAP algo- We performed whole-genome shotgun sequencing on the CHM1 rithm (Li et al. 2008) found significant contamination and missing DNA. KAPA qPCR (KK4854, Kapa Biosystems) was performed sequences (Alkan et al. 2011). In general, the de novo assemblies on the LightCycler 480 System (Roche Life Sciences) to quantify ; > were 16% shorter than the reference genome, and 99% of the libraries and determine the appropriate concentration to pro- previously validated duplications were missing translating to more duce optimal recommended cluster density on a HiSeq 2000 V2 than 2300 missing coding exons. Another human assembly from or V3 2 3 100-bp sequencing run. HiSeq 2000 V2 and V3 runs massively parallel sequences using the ALLPATHS-LG (Gnerre et al. were completed according to the manufacturer’s recommenda- 2011) showed improvements over the SOAP assembly but still only tions (Illumina). We generated > 617 Gb of sequence used for the covered ;40% of SDs. As described above, the gene and repetitive assembly. The average insert size was 315 bp for three libraries, 3 element coverage of the CHM1_1.1 assembly is comparable to kbp for three libraries, and 8 kbp for two libraries. GRCh37. We did not do a formal comparison to GRCh38 because many of the CHM1 BAC tiling paths are used in both assemblies, Assembly meaning they are no longer completely independent. Approxi- mately 29 Mb of clone sequence and 134 kbp of WGS sequence Assembly of the CHM1 genome used deep coverage WGS se- from the CHM1_1.1 assembly has been incorporated into the quence reads generated using the Illumina HiSeq platform. These GRCh38 primary assembly, while > 13 Mb of clone sequence has data are publicly available in NCBI’s Sequence Read Archive (SRA; been utilized for alternative sequence representations. The some- http://www.ncbi.nlm.nih.gov/sra) under project SRP017546. The project has ten experiments, of which one was a pilot experiment what fragmented nature of the CHM1_1.1 assembly means it is not using 25-bp unpaired reads while the remaining nine were all ready to become the Primary assembly in the GRCh series of ref- paired-end reads. Eight experiments had a total of 31 runs and were erence assemblies; however, our goal is to improve this assembly so used in producing the assembly (Supplemental Table S11). Reads that it could serve this role. were aligned to the GRCh37 primary assembly using SRPRISM v2.3 A single haplotype reference assembly will not be sufficient aligner. for alignment and variant detection in large-scale human genomic A reference-guided assembly was produced using ARGO v1.0. studies. Individual-specific sequences between a random pair of This assembly has 2,818,728,129 bp in 47,737 contigs with N50 human individuals ranges between 1.8 and 4 Mb (Li et al. 2010). of 139,647 bp. Both SRPRISM and ARGO were developed at NCBI, The GRC formalized the concept of multiallelic representation of but are not yet published. SRPRISM v2.5 is available at ftp://ftp. complex genome regions with the release of GRCh37. The newest ncbi.nlm.nih.gov/pub/agarwala/srprism/. Briefly, SRPRISM creates reference genome GRCh38 contains 261 alternative sequence an index on the reference genome and uses the index to find representations at 178 regions, many of which were resolved using locations on the genome to do extensions. It has resource

Genome Research 2073 www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Steinberg et al. requirements and performance characteristics comparable to the identified were remapped to GRCh37 using NCBI’s remap tool fastest available aligners, yet provides explicit criteria for search (http://www.ncbi.nlm.nih.gov/genome/tools/remap); these map- sensitivity and reports all results that have the same quality. pings are listed in Supplemental Table S12. ARGO uses conservative heuristics that take into account the insert size and read orientation to produce a most likely sequence Segmental duplication annotation for the assembly. In addition to the Illumina data, we selected BAC clones that We applied whole-genome assembly comparison (WGAC) and overlapped SDs and corresponded to regions in SDs, known read depth CNV (WSSD) methods to discover SDs in the CHM1.1 structural variation, and loci around gaps. Two hundred ninety six reference assembly. For WGAC analysis, we eliminated all repetitive clones in 45 tiling paths and 104 singleton clones were provided. sequences from the assembly as annotated by RepeatMasker, By mapping the clone information to the reference-guided as- identified alignments > 1 kbp and with higher than 90% identity, sembly using BLAST and manually reviewing the alignments to and refined alignments into pairwise duplication calls as previously decide the best location in the assembly to incorporate the clone described (Bailey et al. 2001). Duplication and RepeatMasker files sequence, some of the worst regions of the assembly were signifi- are in Supplemental Data: Duplication Analysis. cantly improved. Four singleton clones could not be used as they For read depth CNV analysis, we aligned Illumina whole-ge- are significantly diverged from the assembly. Fourteen additional nome shotgun (WGS) reads from 11 lanes (SRR642629, SRR642634, clones were redundant with other clones. Clone AC243629.2 has SRR642635, SRR642638, SRR642639, SRR642640, SRR642641, an internal expansion that was discovered after the assembly re- SRR642642, SRR642643, SRR642683, SRR642746) with mrsFAST lease and has now been subsequently removed. After incor- (v. 2.5.0.4) (Hach et al. 2010) and called raw copy number across porating clone information, the assembly had 2,846,046,639 bp in 1-kbp windows as previously described (Sudmant et al. 2010). From 41,406 contigs with an N50 of 143,718 bp. Prior to submission of these raw copy number calls, we identified duplications as regions with the assembly to GenBank, the contigs were subsequently filtered to copy number $ 3and$ 10 kbp of non-repeat, non-gap sequence. remove some WGS that was redundant to one of the clone paths, to remove small WGS contigs at chromosome termini, to trim Assembly–assembly alignment terminal Ns from WGS contigs, and to accommodate a newly finished clone component and then scaffolded according to We aligned the CHM1_1.1 assembly to the GRCh37 and HuRef alignment with the GRCh37 primary assembly. assemblies using the two-phase NCBI pipeline. Aligning the two The reference assembly process includes (1) taking the aligned assemblies using BLAST generates the first phase alignments and reads and building an insert size histogram implied by the align- any locus on the query assembly must have 0 or 1 alignment to the ments, (2) determining valid insert size range and orientation per target assembly. Additionally, we use in-database masking through run, (3) filtering alignments to retain alignments that are consis- precomputed WindowMasker masked regions. BLAST alignments tent with the insert size range and orientation, (4) subdividing the are then trimmed and post-processed to remove low quality and reference into regions of matches, mismatches, and gaps depend- spurious alignments. Chromosome to chromosome alignments ing on whether all reads aligning to the reference position match are favored over chromosome to scaffold or scaffold-to-scaffold the base, at least some reads do not match the base, or the base is alignments. Alignments based on common components are then not aligned to by any read, respectively, and (5) producing se- merged into the longest, consistent stretches possible, resulting in quence for each region when possible. Sequence for matching re- a set of alignments called the ‘‘Common component set.’’ We then gions is reproduced as it is in the reference. Sequence for regions eliminate the remaining BLAST alignments that are redundant with mismatches is produced by doing a multiple alignment of the with the common component alignments. The remaining align- portions of reads that are aligning to that region with the sequence ments are then merged independently of the common component with most copies in the multiple alignment retained as the sequence alignments and redundant alignments are removed. The two for the region. Sequence for gaps between two consecutive contigs alignment sets are then combined into a single set of alignments on the reference is produced by using reads whose one mate aligns and then sorted to select the ‘‘First pass set,’’ which are ranked to to either contig and the other mate has a 30-mer matching the end favor, in order: (1) common component alignments, (2) chromo- of one of the contigs such that the mate results in extension of the some to chromosome alignments, (3) alternate to alternate align- contig and preserves insert size and orientation constraints. This ments, (4) chromosome to alternate alignments, and (5) count of process may or may not result in filling the gap completely. identities. Finally, only alignments with nonconflicting query/ subject ranges are kept for the First Pass set. Conflicting alignments are reserved for evaluation in the ‘‘Second Pass.’’ In order to capture Gene annotation duplicated sequences, we do a ‘‘Second Pass’’ to capture large re- The CHM1_1.1 assembly was masked using RepeatMasker (Smit gions (> 1 kb) within an assembly that have no alignment, or et al. 1996–2010) and annotated using the NCBI Eukaryotic Ge- a conflicting alignment, in the First Pass. In the ‘‘Second Pass’’ nome Annotation Pipeline. Briefly, the assembly is masked using alignments, a given region in the query assembly can align to more RepeatMasker and then aligned to a set of same-species RefSeq than one region in the target assembly. transcripts and genomic sequences to directly annotate the gene, RNA, and CDS features. The assembly is also aligned to Gnomon gene prediction models. Gnomon is a two-step gene prediction Variant analysis program that assembles overlapping alignments into ‘‘chains’’ For variant analyses, Illumina reads from CHM1 genomic DNA followed by a prediction step that extends the chains into complete were mapped to the GRCh37 primary assembly reference using models and creates full ab initio models, using a hidden Markov BWA version 0.5.9 (Li and Durbin 2009). Single-nucleotide vari- model (HMM). If the RefSeq and Gnomon models are predicted to ants (SNVs) were called using both SAMtools (Li et al. 2009) and have the same splice pattern, the RefSeq transcripts are given VarScan v2.2.9 (Koboldt et al. 2012). Variants were filtered to precedence. Gnomon predictions are included in the final set of remove false positives due to alignment and sequencing errors annotations if they do not share all splice sites with a RefSeq using the values in Supplemental Table S13. The Illumina reads transcript and if they meet certain quality thresholds. Novel genes were then aligned to the CHM1_1.1 assembly and variants were

2074 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

CHM1 assembly of the human genome called using the same parameters as above. We overlapped the GenBank Assembly ID for the CHM1_1.1 assembly is GCA_ variants with RefSeq and Gnomon gene annotations as well as SDs 000306695.2. Accessions for clones can be found in Supplemental (based on WGAC) and RepeatMasker. SNV density per kilobase and Table S1. Illumina sequencing data can be found under study ac- transition:transversion ratio (Ts:Tv) were calculated in 1 MB non- cession SRP017546, and individual library accessions are listed in overlapping windows using vcftools version 0.1.11 (http:// Supplemental Table S11. Gene annotation is available from the vcftools.sourceforge.net/). NCBI genome annotation FTP site (ftp://ftp.ncbi.nlm.nih.gov/ genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/). PacBio data are available from http://datasets.pacb.com/2014/ BAC end sequence mapping Human54x/fast.html and have also been submitted to SRA under BAC end sequences from the CH17 BAC library generated from the accession number SRX533609. The BioNano Genomics map con- CHM1 cell line were aligned to the CHM1_1.1, GRCh37 and tigs are available in the Supplemental Material and at http:// GRCh38 assemblies and clone placements generated as described genome.wustl.edu/pub/supplemental/KMS_genome_research_2014/. in Schneider et al. (2013). BAC end mappings are provided in Additional analyses and data sets are available in the Supple- Supplemental Data: BAC end mapping. On the CHM1_1.1 as- mental Material and at figshare (http://dx.doi.org/10.6084/m9. sembly, the average insert length = 208,638 and the standard de- figshare.1091429). viation = 20,197. On GRCh37, the average insert length = 208,637 and the standard deviation = 20,149. BAC ends from single con- cordant, single discordant, and multiply mapped clones were Competing interest statement evaluated for SD content and overlapped with gene annotation E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. from RefSeq and Gnomon from the CHM1_1.1 assembly. and was an SAB member of Pacific Biosciences, Inc. (2009–2013) and SynapDx Corp. (2011–2013). Clinical allele analysis We obtained data from the NHGRI GWAS catalog using the UCSC Acknowledgments Genome Browser track intersected with the dbSNP137 track. If the We thank Pieter de Jong for the creation of the CHORI-17 BAC risk allele was reported on the negative strand, we were able to use library used extensively in this project. We acknowledge Nathan the dbSNP137 information to correctly assign risk alleles to the Bouk for his expertise in sequence alignment and insightful dis- positive strand. Additionally, we downloaded the vcf file con- cussions of alignment data. We acknowledge the efforts of the taining the ClinVar data from NCBI. We took the unique union of production and finishing groups at The Genome Institute, par- risk alleles from both sources and remapped them to CHM1_1.1 ticularly Aye Wollom, Susie Rock, Milinn Kremitzki, and Derek coordinates using NCBI default remap parameters. If there were Albrecht. We are greatly appreciative of BioNano Genomics and two or more locations we chose the preferred mapping or discarded Dr. Pui Kwok for generously sharing the genomic map and align- both. We then compared the risk allele at each locus with the ments to the CHM1_1.1 assembly. E.E.E. is an investigator of the CHM1 genotype. Howard Hughes Medical Institute. This work was supported by NIH Grants 2R01HG002385 and 5P01HG004120 to E.E.E. PacBio alignment References We obtained CHM1 reads from the Pacific Biosciences website and aligned them to the CHM1_1.1 assembly using BLASR with The 1000 Genomes Project Consortium 2012. An integrated map of genetic the following parameters (-nproc 4 –sam -clipping soft -bestn 2 - variation from 1,092 human genomes. Nature 491: 56–65. minMatch 12 –affineAlign –sortRefinedAlignments). Read statistics Alkan C, Sajjadian S, Eichler EE. 2011. Limitations of next-generation Nat Methods 8: are outlined in Supplemental Table S14, and a graph of the read genome sequence assembly. 61–65. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. 2001. Segmental length distribution is in Supplemental Figure S5. To call cliff regions, duplications: organization and impact within the current human we required that a PacBio read must have two and only two align- genome project assembly. Genome Res 11: 1005–1017. ments on CHM1_1.1, both alignments must be on the same Barbouti A, Stankiewicz P, Nusbaum C, Cuomo C, Cook A, Hoglund M, CHM1_1.1 sequence, and one of the two alignments must meet the Johansson B, Hagemeijer A, Park SS, Mitelman F, et al. 2004. The criteria of ‘‘Score # À2.0*ReadLength.’’ We also required query cov- breakpoint region of the most common isochromosome, i(17q), in $ human neoplasia is characterized by a complex genomic architecture erage of the smaller of two segments be 10% and that the smaller with large, palindromic, low-copy repeats. Am J Hum Genet 74: 1–10. alignment must still involve at least 10% of the PacBio read. Two Bosnakovski D, Xu Z, Gang EJ, Galindo CL, Liu M, Simsek T, Garner HR, alignments could not overlap each other by > 10% and unique Agha-Mohammadi S, Tassin A, Coppee F, et al. 2008. An isogenetic coverage $ 50%. Coverage drop-offs that occurred within 1 kbp of myoblast expression screen identifies DUX4-mediated FSHD-associated EMBO J 27: a CHM1_1.1 boundary were flagged. molecular pathologies. 2766–2779. Carvalho CM, Lupski JR. 2008. Copy number variation at the breakpoint The PacBio reads used for this analysis aligned to the CHM1_1.1 region of isochromosome 17q. Genome Res 18: 1724–1732. assembly at an average coverage depth of 543. As expected, coverage Chen R, Butte AJ. 2011. The reference human genome demonstrates high at regions containing repetitive sequence was notably higher. To risk of type 1 diabetes and other disorders. Pac Symp Biocomput 2011: improve our likelihood of detecting examples of misassemblies, we 231–242. restricted our review of this list to sites where surrounding coverage Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, Chen HC, Agarwala R, McLaren WM, Ritchie GR, et al. 2011. Modernizing did not indicate the presence of repetitive sequence and the drop-off reference genome assemblies. PLoS Biol 9: e1001091. in coverage was roughly equivalent to surrounding coverage. Dennis MY, Nuttle X, Sudmant PH, Antonacci F, Graves TA, Nefedov M, Rosenfeld JA, Sajjadian S, Malig M, Kotkiewicz H, et al. 2012. Evolution of human-specific neural SRGAP2 genes by incomplete segmental Data access duplication. Cell 149: 912–922. Doggett NA, Xie G, Meincke LJ, Sutherland RD, Mundt MO, Berbari NS, All Illumina sequence, assembly, and clone data have been submitted Davy BE, Robinson ML, Rudd MK, Weber JL, et al. 2006. A 360-kb to the NCBI GenBank (http://www.ncbi.nlm.nih.gov/genbank) and interchromosomal duplication of the human HYDIN locus. Genomics Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra). The 88: 762–771.

Genome Research 2075 www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Steinberg et al.

Eichler EE, Clark RA, She X. 2004. An assessment of the sequence gaps: Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, Devine SE. unfinished business in a finished human genome. Nat Rev Genet 5: 345–354. 2006. An initial map of insertion and deletion (INDEL) variation in the Fan JB, Surti U, Taillon-Miller P, Hsie L, Kennedy GC, Hoffner L, Ryder T, human genome. Genome Res 16: 1182–1190. Mutch DG, Kwok PY. 2002. Paternal origins of complete hydatidiform Morgulis A, Gertz EM, Schaffer AA, Agarwala R. 2006. WindowMasker: moles proven by whole genome single-nucleotide polymorphism window-based masker for sequenced genomes. Bioinformatics 22: 134– haplotyping. Genomics 79: 58–62. 141. Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K, Ogino S, Wilson RB, Gold B. 2004. New insights on the evolution of the Pasaniuc B, Price AL, Reich D, Morton CC, et al 2013. Using population SMN1 and SMN2 region: simulation and meta-analysis for allele and Nat Genet 45: admixture to help complete maps of the human genome. haplotype frequency calculations. Eur J Hum Genet 12: 1015–1023. 406–414. Pan Q, Tian Y, Li X, Ye J, Liu Y, Song L, Yang Y, Zhu R, He Y, Chen L, et al. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe 2013. Enhanced membrane-tethered mucin 3 (MUC3) expression by T, Hall G, Shea TP, Sykes S, et al. 2011. High-quality draft assemblies of a tetrameric branched peptide with a conserved TFLK motif inhibits mammalian genomes from massively parallel sequence data. Proc Natl bacteria adherence. J Biol Chem 288: 5407–5416. Acad Sci 108: 1513–1518. Paten B, Novak A, Haussler D. 2014. Mapping to a reference genome Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp structure. arXiv:1404.5010. SC. 2010. mrsFAST: a cache-oblivious algorithm for short-read mapping. Pratt WS, Crawley S, Hicks J, Ho J, Nash M, Kim YS, Gum JR, Swallow DM. Nat Methods 7: 576–577. 2000. Multiple transcripts of MUC3: evidence for two genes, MUC3A Hewitt JE, Lyle R, Clark LN, Valleley EM, Wright TJ, Wijmenga C, van Biochem Biophys Res Commun 275: Deutekom JC, Francis F, Sharpe PT, Hofker M, et al. 1994. Analysis of the and MUC3B. 916–923. tandem repeat locus D4Z4 associated with facioscapulohumeral Robledo R, Orru S, Sidoti A, Muresu R, Esposito D, Grimaldi MC, Carcassi C, muscular dystrophy. Hum Mol Genet 3: 1287–1295. Rinaldi A, Bernini L, Contu L, et al. 2002. A 9.1-kb gap in the genome Hoffner L, Surti U. 2012. The genetics of gestational trophoblastic disease: reference map is shown to be a stable deletion/insertion polymorphism a rare complication of pregnancy. Cancer Genet 205: 63–77. of ancestral origin. Genomics 80: 585–592. International Human Genome Sequencing. 2004. Finishing the Ryan DP, da Silva MR, Soong TW, Fontaine B, Donaldson MR, Kung AW, euchromatic sequence of the human genome. Nature 431: 931–945. Jongjaroenprasert W, Liang MC, Khoo DH, Cheah JS, et al. 2010. Jiang Z, Tang H, Ventura M, Cardone MF, Marques-Bonet T, She X, Pevzner Mutations in potassium channel Kir2.6 cause susceptibility to PA, Eichler EE. 2007. Ancestral reconstruction of segmental duplications thyrotoxic hypokalemic periodic paralysis. Cell 140: 88–98. reveals punctuated cores of human genome evolution. Nat Genet 39: Schmutz J, Martin J, Terry A, Couronne O, Grimwood J, Lowry S, Gordon LA, 1361–1368. Scott D, Xie G, Huang W, et al. 2004. The DNA sequence and Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, comparative analysis of human chromosome 5. Nature 431: 268–274. Hansen N, Teague B, Alkan C, Antonacci F, et al. 2008. Mapping and Schneider VA, Chen HC, Clausen C, Meric PA, Zhou Z, Bouk N, Husain N, sequencing of structural variation from eight human genomes. Nature Maglott DR, Church DM. 2013. Clone DB: an integrated NCBI resource 453: 56–64. for clone-associated data. Nucleic Acids Res 41: D1070–D1078. Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Malig M, Ventura M, Giannuzzi G, et al. 2010. Characterization of Clark RA, Schwartz S, Segraves R, et al. 2005. Segmental duplications and missing human genome sequences and copy-number polymorphic copy-number variation in the human genome. Am J Hum Genet 77: 78– Nat Methods 7: insertions. 365–371. 88. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Smit A, Hubley R, Green P. 1996–2010. RepeatMasker Open-2.0. Mardis ER, Ding L, Wilson RK. 2012. VarScan 2: somatic mutation and Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, copy number alteration discovery in cancer by exome sequencing. Sampas N, Bruhn L, Shendure J, Genomes P, et al. 2010. Diversity of Genome Res 22: 568–576. human copy number variation and multicopy genes. Science 330: 641– Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, 646. Palejev D, Carriero NJ, Du L, et al. 2007. Paired-end mapping reveals Sudmant PH, Huddleston J, Catacchio CR, Malig M, Hillier LW, Baker C, extensive structural variation in the human genome. Science 318: 420–426. Mohajeri K, Kondova I, Bontrop RE, Persengiev S, et al. 2013. Evolution Kyo K, Muto T, Nagawa H, Lathrop GM, Nakamura Y. 2001. Associations of Genome distinct variants of the intestinal mucin gene MUC3A with ulcerative and diversity of copy number variation in the great ape lineage. Res 23: colitis and Crohn’s disease. J Hum Genet 46: 5–20. 1373–1382. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott Taillon-Miller P, Bauer-Sardina I, Zakeri H, Hillier L, Mutch DG, Kwok PY. DR. 2014. ClinVar: public archive of relationships among sequence 1997. The homozygous complete hydatidiform mole: a unique resource Genomics 46: variation and human phenotype. Nucleic Acids Res 42: D980–D985. for genome studies. 307–310. Li H. 2014. Towards better understanding of artifacts in variant calling from Teague B, Waterman MS, Goldstein S, Potamousis K, Zhou S, Reslewic S, high-coverage samples. Bioinformatics 30: 2843–2851. Sarkar D, Valouev A, Churas C, Kidd JM, et al. 2010. High-resolution Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows- human genome structure by single-molecule analysis. Proc Natl Acad Sci Wheeler transform. Bioinformatics 25: 1754–1760. 107: 10848–10853. Li R, Li Y, Kristiansen K, Wang J. 2008. SOAP: short oligonucleotide Watson CT, Steinberg KM, Huddleston J, Warren RL, Malig M, Schein J, alignment program. Bioinformatics 24: 713–714. Willsey AJ, Joy JB, Scott JK, Graves TA, et al. 2013. Complete haplotype Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, sequence of the human immunoglobulin heavy-chain variable, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. diversity, and joining genes and characterization of allelic and copy- 2009. The Sequence Alignment/Map format and SAMtools. number variation. Am J Hum Genet 92: 530–546. Bioinformatics 25: 2078–2079. Zody MC, Jiang Z, Fung HC, Antonacci F, Hillier LW, Cardone MF, Graves TA, Li R, Li Y, Zheng H, Luo R, Zhu H, Li Q, Qian W, Ren Y, Tian G, Li J, et al. Kidd JM, Cheng Z, Abouelleil A, et al. 2008. Evolutionary toggling of the Nat 2010. Building the sequence map of the human pan-genome. MAPT 17q21.31 inversion region. Nat Genet 40: 1076–1083. Biotechnol 28: 57–63. McPherson JD, Marra M, Hillier L, Waterston RH, Chinwalla A, Wallis J, Sekhon M, Wylie K, Mardis ER, Wilson RK, et al. 2001. A physical map of the human genome. Nature 409: 934–941. Received July 2, 2014; accepted in revised form October 7, 2014.

2076 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press

Single haplotype assembly of the human genome from a hydatidiform mole

Karyn Meltz Steinberg, Valerie A. Schneider, Tina A. Graves-Lindsay, et al.

Genome Res. 2014 24: 2066-2076 originally published online November 4, 2014 Access the most recent version at doi:10.1101/gr.180893.114

Supplemental http://genome.cshlp.org/content/suppl/2014/10/17/gr.180893.114.DC1 Material

References This article cites 47 articles, 11 of which can be accessed free at: http://genome.cshlp.org/content/24/12/2066.full.html#ref-list-1

Creative This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the Commons first six months after the full-issue publication date (see License http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here.

To subscribe to Genome Research go to: https://genome.cshlp.org/subscriptions

© 2014 Steinberg et al.; Published by Cold Spring Harbor Laboratory Press