Extending Reference Assembly Models Deanna M
Total Page:16
File Type:pdf, Size:1020Kb
Washington University School of Medicine Digital Commons@Becker Open Access Publications 2015 Extending reference assembly models Deanna M. Church Personalis Inc. Valerie A. Schneider National Institutes of Health Karyn Meltz Steinberg Washington University School of Medicine in St. Louis Michael C. Schatz Cold Spring Harbor Aaron R. Quinlan Department of Public Health Sciences and Center for Public Health Genomics See next page for additional authors Follow this and additional works at: https://digitalcommons.wustl.edu/open_access_pubs Recommended Citation Church, Deanna M.; Schneider, Valerie A.; Steinberg, Karyn Meltz; Schatz, Michael C.; Quinlan, Aaron R.; Chin, Chen-Shan; Kitts, Paul A.; Aken, Bronwen; Marth, Gabor T.; Hoffman, Michael M.; Herrero, Javier; Zepeda Mendoza, M. Lisandra; Durbin, Richard; and Flicek, Paul, ,"Extending reference assembly models." Genome Biology.16,. 13. (2015). https://digitalcommons.wustl.edu/open_access_pubs/3703 This Open Access Publication is brought to you for free and open access by Digital Commons@Becker. It has been accepted for inclusion in Open Access Publications by an authorized administrator of Digital Commons@Becker. For more information, please contact [email protected]. Authors Deanna M. Church, Valerie A. Schneider, Karyn Meltz Steinberg, Michael C. Schatz, Aaron R. Quinlan, Chen-Shan Chin, Paul A. Kitts, rB onwen Aken, Gabor T. Marth, Michael M. Hoffman, Javier Herrero, M. Lisandra Zepeda Mendoza, Richard Durbin, and Paul Flicek This open access publication is available at Digital Commons@Becker: https://digitalcommons.wustl.edu/open_access_pubs/3703 Church et al. Genome Biology (2015) 16:13 DOI 10.1186/s13059-015-0587-3 COMMENT Extending reference assembly models Deanna M Church1*, Valerie A Schneider2*, Karyn Meltz Steinberg3, Michael C Schatz4, Aaron R Quinlan5, Chen-Shan Chin6, Paul A Kitts2, Bronwen Aken7, Gabor T Marth8, Michael M Hoffman9,10,11, Javier Herrero12, M Lisandra Zepeda Mendoza13, Richard Durbin14* and Paul Flicek7* introduced errors within regions of complex structural Abstract variation, when sequences unique to one haplotype pre- The human genome reference assembly is crucial for vented construction of clone overlaps. The assembly aligning and analyzing sequence data, and for therefore inadvertently included multiple haplotypes in genome annotation, among other roles. However, the series in some regions [7-9]. models and analysis assumptions that underlie the The Genome Reference Consortium (GRC) began current assembly need revising to fully represent stewardship of the reference assembly in 2007. The GRC human sequence diversity. Improved analysis tools proposed a new assembly model that formalized the in- and updated data reporting formats are also required. clusion of ‘alternative sequence paths’ in regions with complex structural variation, and then released GRCh37 using this new model [10]. The release of GRCh37 also Background marked the deposition of the human reference assembly One of the flagship products of the Human Genome to an International Nucleotide Sequence Database Project (HGP) was a high-quality human reference assem- Collaboration (INSDC) database, providing stable, bly [1]. This assembly, coupled with advances in low-cost, trackable sequence identifiers, in the form of accession high-throughput sequencing, has allowed us to address pre- and version numbers, for all sequences in the assembly. viously inaccessible questions about population diversity, The GRC developed an assembly model that was incorpo- genome structure, gene expression and regulation [2-5]. It rated into the National Centre for Biotechnology Information has become clear, however, that the original models used (NCBI) and European Nucleotide Archive (ENA) assem- to represent the reference assembly inadequately represent bly database that provides a stable identifier for the collec- our current understanding of genome architecture. tion of sequences and the relationship between these The first assembly models were designed for simple sequences that comprise an assembly [11]. Subsequent ‘ ’ linear genome sequences, with little sequence variation minor assembly releases added a number of ‘fix patches’ and even less structural diversity. The design fit the that could be used to resolve mistakes in the reference se- understanding of human variation at the time the HGP quence, as well as ‘novel patches’ that are new alternative began [6]. The HGP constructed the reference assembly sequence representations [10]. by collapsing sequences from over 50 individuals into a The new assembly model presents significant advances single consensus haplotype representation of each chromo- to the genomics community, but, to realize those ad- some. Employing a clone-based approach, the sequence of vances, we must address many technical challenges. The each clone represented a single haplotype from a given new assembly model is neither haploid nor diploid - in- donor. At clone boundaries, however, haplotypes could stead, it includes additional scaffold sequences, aligned switch abruptly, creating a mosaic structure. This design to the chromosome assembly, that provide alternative sequence representations for regions of excess diversity. *Correspondence:[email protected]; [email protected]; Widely used alignment programs, variant discovery and [email protected]; [email protected] 1Personalis Inc., Menlo Park, CA 94025, USA analysis tools, as well as most reporting formats, expect 2National Center for Biotechnology Information, National Library of Medicine, reads and features to have a single location in the refer- NIH, Bethesda, MD 20894, USA ence assembly as they were developed using a haploid 14Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK assembly model. Many alignment and analysis tools 7European Molecular Biology Laboratory, European Bioinformatics Institute, penalize reads that align to more than one location Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK under the assumption that the location of these reads Full list of author information is available at the end of the article © 2015 The article is a work of the United States Government; Title 17 U.S.C 105 provides that copyright protection is not available for any work of the United States government in the United States; licensee BioMed Central Ltd. Church et al. Genome Biology (2015) 16:13 Page 2 of 5 cannot be resolved owing to paralogous sequences in While GRCh37 had only three regions containing nine the genome. These tools do not distinguish allelic dupli- alternative locus sequences, GRCh38 has 178 regions cation, added by the alternative loci, from paralogous containing 261 alternative locus sequences, collectively duplication found in the genome, thus confounding representing 3.6 Mbp of novel sequence and over 150 repeat and mappability calculations, paired-end place- genes not represented in the primary assembly (Table 1). ments and downstream interpretation of alignments in The increased level of alternative sequence representa- regions with alternative loci. tion intensifies the urgency to develop new analysis To determine the efforts needed to facilitate use of the methods to support inclusion of these sequences. Inclu- full assembly, the GRC organized a workshop in con- sion of all sequences in the reference assembly allows us junction with the 2014 Genome Informatics meeting in to better analyze these regions with potentially modest Cambridge, UK (http://www.slideshare.net/GenomeRef). updates to currently used tools and reporting structures. Participants identified challenges presented by the new Although the addition of the alternative loci to current assembly model and discussed ways forward that we analysis pipelines might lead to only modest gains in describe here. analysis power on a genome-wide scale, some loci will see considerable improvement owing to the addition Towards the graph of human variation of significant amounts of sequence that cannot be A graph structure is a natural way to represent a represented accurately in the chromosome assembly population-based genome assembly, with branches in (Figure 1). the graph representing all variation found within the Omission of the novel sequence contained in the alter- source sequences. Most assembly programs internally native loci can lead to off-target sequence alignments, use a graph representation to build the assembly, but and thus incorrect variant calls or other errors, when a ultimately produce a flattened structure for use by sample containing the alternative allele is sequenced and downstream tools [12-14]. Recently, formal proposals for representing a population-based reference graph Table 1 Examples of regions with alternative loci, sequences have been described [15-17]. The newly formed Global within these regions and genes unique to them Alliance for Genomics and Health (GA4GH) is leading Region name Sequence Chromosome Unique genes an effort to formalize data structures for graph-based APOBEC GL383583.2 22 APOBEC3A_B reference assemblies, but it will likely take years to develop MHC GL000250.2 6 HLA-DRB2 the infrastructure and analysis tools needed to support these GL000251.2 HLA-DRB3 new structures and see their widespread adoption across HLA-DRB4 the biological and clinical research communities [18]. GL000252.2 The introduction of alternative loci into the assembly GL000253.2 HLA-DRB7 model provides a stepping-stone towards a full graph- GL000254.2 based representation