UNIVERSITY OF COPENHAGEN

PhD thesis

Hongzhi Cao

MHC region and its related disease study

Method for decipher the polymorphism and fine-mapping disease causal mutations of MHC region

Academic advisor:

Anders Krogh

University of Copenhagen, Denmark

Jun Wang

University of Copenhagen, Denmark

BGI -Shenzhen, China

The PhD School of Science, Faculty of Science, University of Copenhagen

Submitted: 23/03/2015 Dissertation for the degree of philosophiae doctor (PhD)

Department of Biology, University of Copenhagen

Copenhagen, Denmark

And

BGI-Shenzhen

Shenzhen, China

March 2015

Author: Hongzhi Cao

Title: MHC region and its related disease study

Subtitle: Method for decipher the polymorphism and fine-mapping disease causal mutations of MHC region

Academic advisors:

Anders Krogh

University of Copenhagen, Denmark

Jun Wang

University of Copenhagen, Denmark

BGI-Shenzhen, China

The PhD School of Science, Faculty of Science, University of Copenhagen

Submitted: March 2015

2

PREFACE

This PhD project started in 2011 as collaboration between the Department of Biology, University of Copenhagen and BGI-Shenzhen. The work presented here has been performed at both institutions by supervision of Prof. Anders Krogh and Prof. Jun Wang.

3

ACKNOWLEGEMENTS

I would like to express my sincere gratitude to everyone who has helped me during my time as a PhD student. First of all, I would like to thank my supervisors Prof. Anders Krogh and Prof Jun Wang for giving me this opportunity and introducing me to the research field within human genomics and genetics, exploring the mystery of the most complex genomic region – Major histocompatibility complex (MHC), and making the time of my PhD exciting and interesting, in addition to all help, guidance and motivation.

Thanks to my colleagues, friends at BGI and University of Copenhagen for all the help, guidance, collaborations and making me feel very easy and pleasure to work and study here.

I wish to thank all my co-authors, without your helps, comments and guidance with data analysis, interpretation of results and writing, I won’t make it so smooth.

Last, but not least, thanks to the endless support from my family, my gentle and considerate wife Yijie, the kindness and great parents of my wife’s and mine. You are my source of power and happiness. Mostly, to my little princess Little September/小九. Your appearance is the happiest thing of my life and makes my life much more wonderful.

4

ABSTRACT

The major histocompatibility complex (MHC) is one of the most dense regions in the and many disorders, including primary immune deficiencies, autoimmune conditions, infections, cancers and mental disorder have been found to be associated with this region. However, due to a high degree of polymorphisms within the MHC, the detection of variants in this region, and diagnostic HLA typing, still lacks a coherent, standardized, cost effective and high coverage protocol of clinical quality and reliability. And owing to marked linkage disequilibrium of MHC region, /mutations involved in the above diseases have not, with very few exceptions, been identified. Currently popular next-generation sequencing (NGS) technology, comprising massively parallel single-molecule sequencing, can generate billions of bases from amplified single DNA molecules within several days, holding great promise for achieving cost-effective and high- throughput and high accurate genotyping. Based on the NGS platform, by using delicately designed bioinformatics methods, we systematically developed new tools/methods, including high throughput HLA genotyping method – RCHSBT, targeted sequencing based SNV detection as well as HLA gene typing and large structural variation detection using optical mapping technic, to provide comprehensive and accurate information of the MHC region and apply them into disease causal mutation’s fine-mapping.

5

SAMMENDRAG

Major histocompatibility complex (MHC) er et af de mest gen-tætte områder i det menneskelige genom. Mange lidelser - herunder primære immundefekter, autoimmune sygdomme, infektioner, cancer. og anden sindslidelse - er blevet fundet forbundet med denne region. Men på grund af en høj grad af polymorfi i MHC, påvisning af varianter i denne region, og diagnostiske HLA typebestemmelse, mangler der stadig en sammenhængende, standardiseret, omkostningseffektiv og høj dækning protokol med god klinisk kvalitet og pålidelighed. Og på grund af markant bindingsuligevægt af MHC- regionen, er gener / mutationer involveret i disse sygdomme endnu ikke, med meget få undtagelser, blevet identificeret. De nu populære næste-generation sekventering (NGS) teknologier omfatter massivt parallel enkelt molekyle sekventering, og kan generere milliarder af baser udfra amplificerede enkelt DNA-molekyler inden for få dage, og er lovende for at opnå omkostningseffektive og høj-hastighed og høj-nøjagtighed genotypebestemmelse. Baseret på NGS platformen, ved at bruge fint designede bioinformatik metoder, vi systematisk udviklet nye værktøjer / metoder, herunder high throughput HLA genotypebestemmelse metode - RCHSBT, målrettet sekventering baseret SNV detektion samt HLA-gen maskinskrivning og stort strukturelt variation detektion bruger optisk kortlægning teknik , at yde en omfattende og præcis information om MHC- regionen og anvende dem i sygdom kausal mutation bøde-mapping.

6

Table of Contents 1 Introduction ...... 10 1.1 Major histocompatibility complex ...... 10 1.1.1 Function of MHC region gene in human immune system ...... 11 1.1.2 MHC related diseases ...... 11 1.2 High throughput sequencing technology ...... 18 1.2.1 Roche 454 ...... 19 1.2.2 Illunina Solexa/Hiseq/Miseq ...... 19 1.2.3 Life Technologies Ion Torrent PGM/Proton ...... 20 1.2.3 Complete Genomics ...... 20 1.2.4 PacBio ...... 20 1.2.6 Oxford Nanopore ...... 20 1.2.7 Others ...... 21 2 Aims ...... 22 2.1 Current problem for MHC region related study ...... 22 2.2 Specific aims of this project ...... 22 3 Results ...... 23 3.1 Novel method for MHC region study ...... 23 3.1.1 Paper I: High throughput HLA typing Method ...... 23 3.1.2 Paper II: Integrated tools to study whole MHC region ...... 25 3.2 Construction Chinese MHC database ...... 27 3.3 Fine-mapping disease related mutation ...... 30 3.4 Paper III: Novel method and technology in genome study ...... 30 4 Conclusions ...... 32 5 Future Perspective ...... 33 6 References ...... 34 7 Appendix ...... 42

7

ABBREVIATIONS

MHC Major Histocompatibility Complex HLA Human Leukocyte Antigen BAC Bacterial Artificial APC Antigen Presenting Cell IST Immunosuppression UNOS United Network for Organ Sharing DGF Delayed Graft Function PID Primary Immune Deficiency IgAD IgA Deficiency RA Rheumatoid Arthritis SLE Systemic Lupus Erythematous GD Grave’s Disease AIDS Acquired Immune Deficiency Syndrome HIV Human Immunodeficiency Virus HBV Hepatitis B Virus HCV Hepatitis C Virus HGP Human Genome Project NGS Next Generation Sequencing SBS Sequence By Synthesis CG Complete Genomics DNB DNA Nano Ball SBH Sequencing By Hybridization SMRT Single Molecule Real Time sequencing ZMW Zero-Mode Waveguide SBT Sequencing Based Typing SNP Single Nuclear Polymorphism LD Linkage Disequilibrium SV Structural Variation GWAS Genome Wide Association Study HSCT Hematopoietic Stem Cell Transplantation PE Paired End CNV Copy Number Variation KIR Killer cell Immunoglobulin-like Receptor TRA T cell Receptor Alpha

8

TRB T cell Receptor Beta IGH Immunoglobulin Heavy IGL Immunoglobulin Light MAM Maximum Allowable Mismatch

9

1 Introduction

1.1 Major histocompatibility complex The major histocompatibility complex (MHC) is a group of genes that found on the cell surface, and they control a major part of the immune system in all vertebrates and help the immune system recognize foreign materials. In humans, the MHC is also called as human leukocyte antigen (HLA). The genomic region encoding MHC locates on the short arm of on human and its essential for the human immune system [1]. It is also the densest gene region of the human genome, with over 200 identified loci and more than 120 expressed genes. Most of the genes in this region play key roles in human immune system. The typical MHC region is a 3.6M region, which further divided into three major regions according to the genes’ function ( Fig. 1). The class I and class II genes play essential role in antigen presentation and main related with adaptive immunity, including the well-known HLA-A, -B, -C in classADVANCES I region IN and IMMUNOLOGY DRB and DQB gene in class II region. Gene in class III region mainly related with innate immunity [2].

Chromosome 6 q Centromere p

Telomere Telomere

Regions 6p21.31

Class II Class III Class I

HLA class II! region loci TAPBP PSMB9 (LMP2) TAP1 PSMB8 (LMP1) TAP2 DPB2 DPB1 DPA1 DOA DMA DMB DOB DQB1 DQA1 DRB1 DRB2 DRB3 DRA

HLA class III! region loci C21B , a- C4B C4A BF C2 HSPA1B HSPA1A HSPA1L LTB TNF- LTA P450

HLA class I ! region loci MICB MICA HLA-B HLA-C HLA-E HLA-A HLA-G HLA-F HFE

Figure 1. Location and Organization of the HLA Complex on Chromosome 6. FigureThe complex 1. Location is conventionally and organization divided into of threegenes regions: for MHC I, II, [2]and. III. Each region contains numerous loci (genes), only some of which are shown. Of the class I and II genes, only the expressed genes are depicted. Class III genes are not related to class I and class II genes structurally or functionally. BF denotes complement factor B; C2 complement component 2; C21B cytochrome P-450, subfamily XXI; C4A and C4B complement components 4A and 4B, respectively; HFE hemochromatosis; HSP heat-shock ; DueLMP to large its multifunctionalimportance, protease; MHC wasLTA and the LTB first lymphotoxins multi-megabase A and B, respectively; region of MICA the andhuman MICB genmajor-histocompatibility-ome to be completelycomplex class sequenced I chain genes [1]A and. In B, 2004,respectively; the P450sequences cytochrome of P-450;the two PSMB8 common and 9 proteasome and disease b 8 and associated 9, respectively; MHC TAP1 and TAP2 transporter associated with antigen processing 1 and 2, respectively; TAPBP TAP-binding protein (tapasin); TNF-a tumor haplotypenecrosis factors, the a; andPGF HSPA1A, (HLA HSPA1B,-A*03:01 and HSPA1L-B*07:02 heat-shock-C*07:02 protein- DRB1*15:011A A-type, heat-shock-DQB1*06:02) protein 1A B-type, and and COX heat-shock (HLAprotein-A*01:01 1A–like, respectively.-B*08:01-C*07:02-DRB1*03:01-DQB1:02:01, also named as 8.1 haplotype) in Caucasian have been provided by Sanger sequencing the bacterial artificial chromosome (BAC) usingcytoplasm. cell lines The frombags are consanguineous then pinched off individual as endocyt- [3]. originatingSubsequen fromtly, sequencethe pathogen of anotherare also diseaserouted into ic vesicles, and they fuse with primary lysosomes, the processing pathways. With the exception of jawed which are loaded with an assortment of proteolytic vertebrates, no organisms appear to make a distinction enzymes, to form endosomes.5 Within the endo- between peptides derived from their own (self) pro-10 somes, the ingested are degraded, first to teins and those derived from foreign (nonself) pro- peptides and then, in some vesicles, into amino acids teins. Jawed vertebrates, by contrast, use the peptides for reuse. derived from foreign (usually microbial) proteins to This mode of protein processing must have evolved mark infected cells for destruction. at an early stage in the history of living forms. Later, In an uninfected cell, “housekeeping” proteasomes the jawed vertebrates found a new use for it by en- continuously churn out self peptides, some of which listing it in the defense of the body against patho- are picked up by molecules called “transporters asso- gens. Normally, the proteins that undergo recycling ciated with antigen processing,” or TAPs, encoded by are the organism’s own, but in infected cells, proteins the TAP1 and TAP2 genes.5 The TAP1 and TAP2

Volume 343 Number 10 · 703

The New England Journal of Medicine Downloaded from nejm.org at NIH on May 14, 2011. For personal use only. No other uses without permission. Copyright © 2000 Massachusetts Medical Society. All rights reserved. associated MHC haplotype QBL (HLA-A*26:01-B*18:01-C*05:01-DRB1*03:01-DQB1*02:01) was provided by employing the same strategy [4]. To the end of 2008, sequence of eight different MHC haplotypes [5] were provided by the MHC haplotype project in total and the haplotype sequence of PGF has been introduced into the reference sequence of human genome (since NCBI35), and becomes the largest single haplotype sequence of the human genome so far. Pairwise comparison of these eight MHC haplotypes has revealed dramatic variants, showing the great polymorphism of this region and the great diverse of differences between different haplotypes.

1.1.1 Function of MHC region gene in human immune system As illustrated in the review paper [2], a man dies after he transplants a heart; a woman is crippled because of her rheumatoid arthritis; a child becomes a coma after she gets cerebral malaria; another child dies in an infection as he has immunodeficiency; an elderly man gets serious hepatic cirrhosis because of iron overload. Though these five different clinical situations are different, it all involves in human MHC system. The importance of MHC to human immune system main due to two classic gene clusters within the region: HLA class I and class II genes. The class I molecules are expressed on the surface of most human cells although the expression level were different between each tissue. While constitutive expression of class II molecules are normally only expressed in a specific group of immune cells called antigen presenting cells (APC) that includes dendritic cells, macrophages, B cells, activated T cells and thymic epithelial cells [6]. One of the most important functions of class I and class II MHC molecules are to present short, pathogen-derived peptides to T cells to initiate the human adaptive immune responses. Furthermore, class III region comprises important gene involved in innate immunity which including the complement genes and genes encoding for some of the inflammatory cytokines [1]. The key to a healthy immune system is its remarkable ability to distinguish between “self” and “non-self”. Among these process, MHC molecules play a fundamental role. Processing endogenous protein and loading the generated short peptides onto class I molecules are happening all the way in body’s cells to mark as “self” and avoid tricking immune reaction main by cytotoxic CD8+ T cell. Meanwhile, the processed exogenous proteins, presented as small peptides, will load onto class II molecules to mark as “non-self”, hence tricking the immune reaction by CD4+ T cell and B cell [7].

1.1.2 MHC related diseases Due to the importance of genes in this region with immune system, many disorders were discovered to relate with MHC (Fig. 2). According to their mechanism, they can be classified into following groups:

11

GG14CH14-Trowsdale ARI 26 July 2013 14:56

Examples of MHC- Gene annotations disease associations Strand Examples of the role of MHC genes in immune function

Psoriasis ZFP57 PSORS1 (HLA-C*0602) HLA-F

Myasthenia gravis HLA-G HLA-C (*0701) α2 α1 HIV/AIDS HLA-A HLA-B, HLA-C HCG9 MHC class I ZNRD1 molecules Ankylosing -AS1 ZNRD1 HLA-A, HLA-B, spondylitis RNF39 PPP1R11 HLA-B27 TRIM31 HLA-C β2 TRIM40 α3 microglobulin Malaria TRIM10 TRIM15 Virus HLA-B53 TRIM26 HCG17 Abacavir drug HCG18 TRIM39 hypersensitivity RPP21 HLA-B (*5701) CD8+ Immune regulation cytotoxic Dengue shock syndrome NFKBIL1, FKBPL T cell MICB HLA-E Sialidosis GNL1 PRR3 ABCF1 Immunoregulatory CD8 TAP NEU1 PPP1R10 MRPS18B functions DHX16 ATAT1 Complement NRM Peptide- MDC1 de"ciencies FLOT1 TUBB loading IER3 Peptide C2, C4A, C4B HCG20 In!ammation Golgi antigen complex apparatus Congenital adrenal ABCF1, AIF1, IER3, LST1, TIGD1L DDR1 LTA, LTB, NCR3, TNF ER hyperplasia GTF2H4 CYP21A2 SFTA2 VARS2 HCG21 DPCR1 Ehlers-Danlos MUC21 syndrome HCG22 TNXB CDSN PSORS1C2 CCHCR1 PSORS1C1 Sarcoidosis POU5F1 TCF19 PSORS1C3 Proteins requiring BTNL2 HCG27 chaperone Type 1 diabetes HLA-C Stress response DRB1*04–DQA1*0301– HSPA1A, HSPA1B, HSPA1L, Antigen processing HLA-B DQB1*0302, MICA, MICB and presentation DRB1*03–DQA1*0501– MICA DQB1*0201 HCP5 Hsp70 MICB BAT1 Rheumatoid arthritis ATP6V1G2 NFKBIL1 LTB TNF LTA LST1 HLA-DRB1 (*0401), NCR3 AIF1 BAT3 APOM Complement system C1C2 C4 HLA-DQA1 (*0301) BAT4 CSNK2B BAT5 LY6G6B C2, BF, C4A, C4B Systemic lupus LY6G6C LY6G6F Classical pathway Bacteria DDAH2 LY6G6D erythematosus CLIC1 MSH5 triggered by Ag-Ab complex Parasites VARS HLA-DRB1 (*0301) LSM2 HSPA1A HSPA1L HSPA1B Multiple sclerosis NEU1 SLC44A4 EHMT2 C2 Phagosome HLA-DRB1 (*0501) CFB Leukocyte maturation RDBP Endosomal/ Pemphigus vulgaris DOM3Z STK19 C4A LY6G5B, LY6G5C, LY6G6D, lysosomal HLA-DQB1 (*0301) C4B LY6G6E, LY6G6C, DDAH2 compartment CLIP CYP21A2 Leprosy TNXB ATF6B HLA-DRB1, DQA1 PRRT1 PPT2 AGPAT1 EGFL8 AGER RNF5 PBX2 by Universitat Zurich- Hauptbibliothek Irchel on 09/05/13. For personal use only. Narcolepsy NOTCH4 Immunoglobulin HLA-DQB1 (*0602) superfamily DM Ulcerative colitis C6orf10 AGER, BTNL2 CD4 Annu. Rev. Genom. Human Genet. 2013.14:301-323. Downloaded from www.annualreviews.org HLA-DRB1 (*1101) HCG23 BTNL2 CD4+ HLA-DRA Graves’ disease HLA-DRB9 T helper HLA-DRB1 (*0301), cell HLA-DRB5 1 1 HLA-DQA1 (*0501) HLA-DRB6 MHC class II β α HLA-DRB1 Celiac disease molecules HLA-DQA1 HLA-DQA1 (*0501), HLA-DQB1 HLA-DR, HLA-DQ, β2 α2 HLA-DQB1 (*0201) HLA-DP HLA-DQB3 HLA-DQA2 HLA-DQB2 Selective IgA de"ciency HLA-DOB TAP2 HLA-DQB1 (*0201) PSMB8 Bare lymphocyte TAP1 PSMB9 syndrome HLA-DMB Antigen processing TAP1, TAP2, TABP HLA-DMA BRD2 TAP1, TAP2, PSMB8, HLA-DOA HLA-DM, TAPBP Juvenile HLA-DPA1 myoclonic epilepsy HLA-DPA2 HLA-DPB1 COL11A2P HLA-DPB2 BRD2 HLA-DPA3 HCG24

Figure 2: Partial of diseases that found to be related with MHC region [8]. 308 Trowsdale Knight

1.1.2.1 Rejection after transplantation·

1.1.2.1.1 Graft and Patient Survival Despite significant advances in immunosuppression (IST) and post-operative patient management

12 after transplantation, the 5-year survival rates are still significantly low in lung transplant recipients (51%), and better rates are observed in heart (73% for male, 69% female); liver ~78% and Kidney ~89% (http://www.srtr.org/annual_reports). Graft survival records from the united network for organ sharing (UNOS) registry assessing the fraction of graft failures attributed to immunological or non-immunological factors shows that 10-year graft survival rates for living HLA-mismatched donors to be 66%. Analyses of the overall 10-year graft failure rate for cadaver donors (60%) showed that18% of failures were due to HLA factors, as observed through mismatched living donor grafts [9].

1.1.2.1.2 Delayed Graft Function Delayed Graft Function (DGF) is a commonly observed adverse post-transplant that impacts graft survival [10]. Risk factors for DGF include non-immunologic factors such as donor-age, cold ischemia time, and recipient ancestry. Immunologic factors have been shown to be associated with DGF including amino-acid differences at over 60 polymorphic sites of HLA-A in a prospective study involve ~700 kidney transplant recipients showed that combinations of mismatches at crucial HLA-A sites were related with DGF [11]. In European ancestry recipients, amino acid difference at position 62, 95 or 163, which known to play important function as located in the antigen recognition site were found to relate with increased DGF risk. Additionally, decreased risk for DGF was associated with mismatches at HLA-A sites (149, 184, 193 or 246), indicating that evolutionary features of HLA-A variants separating HLA-A families and lineages among D-R pairs may be intertwined with the strength of allogenicity underpinning DGF. While these findings indicate that amino acid variants at important function sites in the antigen recognition site of the HLA-A molecule have great relationship with DGF, there may also be additional common and rare variants with similar protective and deleterious influences in MHC region.

1.1.2.2 Primary immune deficiency Primary immune deficiency (PID), a kind of disorder that part of the body’s immune system is missing or impaired. Most primary immune deficiencies are genetic disorders. Some of selected diseases include:

1.1.2.2.1 IgA deficiency Selective IgA deficiency (IgAD) is the most common form of primary immunodeficiency disorders in the Caucasian population. It is defined as serum IgA levels <= 0.07 g/l with normal serum levels of IgG and IgM in individuals >= 4 years old [12]. It affects approximately 1 in 600 of Caucasians [13]. However, its prevalence varies among different ethnical populations ranging from 1:143 in Saudi Arabia [14] to 1:23,255 in Japan [15]. Though intense researches, the genetic basis of IgAD remains elusive. However, it is well known that IgAD is strongly associated with MHC, especially the 8.1 haplotype [16].

1.1.2.2.2 Complement deficiency Complement deficiency is a disorder caused by one of proteins involving the complement system is absent or does not work properly. There are a group of genes in the MHC class III region relate with human complement system. The most common complement deficiency was in C2. Other deficiencies were also found in C1q, C1r, C1s, C4 and also have been reported to link with immune disease [17].

1.1.2.2.3 MHC Class I deficiency

13

Major histocompatibility complex class I deficiency is a disorder with severe expression reduce of MHC class I genes in the cell surface. It is a rare disease with high biological heterogeneity and has a variable clinical phenotype. Mutations in gene TAP1 or TAP2 which coding for the transporter associated with antigen presentation in the MHC region have been found in patients with this disorder [18].

1.1.2.3 Autoimmune condition So far, autoimmune diseases comprise the largest number of diseases that found to be related with MHC region (Table 1). It characterize as the body’s immune responses directly against one’s own tissues. This may be restricted to specific organs or involve a particular tissue in different places. Some represented autoimmune diseases are given below.

1.1.2.3.1 Psoriasis Psoriasis is a common autoimmune and hyper proliferative skin disease, characterized by thick, silvery-scale patches. The prevalence of psoriasis is 0.1-0.3% in Asian and 2-5% in Caucasian [19]. Psoriasis has a strong hereditary component and the major determinant that defined before is PSORS1, which located in the MHC region. It could explains 35%-50% of psoriasis’ heritability [20].

1.1.2.3.2 Rheumatoid arthritis Rheumatoid arthritis (RA) is a chronic inflammatory disorder, mainly affects the joints of feet and hands. The prevalence of RA is approximately 1% worldwide [21]. Genetic studies reflect this disease is influenced by both genetic and environment factors [22]. The strongest genetic risk factor for RA is HLA-DRB1 gene that located in MHC region. Risk alleles that identified in MHC region include DR4, DR1, and B27.

1.1.2.3.3 System lupus erythematous Systemic lupus erythematous (SLE) is a chronic inflammatory and classical autoimmune disease in which the body’s immune system attacks its healthy tissue by mistake. It characterized by a diverse array of autoantibodies, immune complex deposition and complement activation, and influenced by both genetic and environmental factors. Most of the cases were women and frequently starting at childbearing age. The MHC class II gene DRB1 was reported to relate with SLE (Table 1).

1.1.2.3.3 Graves disease Grave’s disease (GD) is an autoimmune disease and characterized by lymphocytic infiltration of the thyroid gland and the production of thyrotropin receptor autoantibodies (TRAbs), leading to produce excess thyroxin, hence to hyperthyroidism. The incidence of GD shows a geographical variation as well as the seasonal trends, suggesting that environmental triggers such as infections might be an important contributor to the GD pathogenesis [23]. However, twin and sibling studies have reported an increased risk ratio for GD [24], indicating genetic plays an important rule on the development of this disease. The most genetic associated loci is the HLA-DRB1 (DRB1*03,DRB1*11,DRB1*01) gene that is locating in MHC region.

14

Table 1 Autoimmune disease related loci within MHC

Disease Related variants Ankylosing spondylitis B*27, DPA1*01:03, A*02:01, B*13:01 Autoimmune hepatitis rs2187668, DRB1*03:01/DRB1*0301, DRB1*0401 Autoimmune hepatitis type-1 DRB1*0301, DQA1*0101/DRB1*1301 Behçet's disease B*51/rs116799036, rs12525170, rs114854070, C*1602 Celiac disease DQA1*05,DQB1*02/rs2647044 Crohn disease A*33, B*45, C*3, A26, DRB1*03, DRB1*08/DRB1*07, DQB1*0303 Graves disease DRB1*03, DRB1*11, DRB1*01/DRB1*09:01, DRB1*12:02, DPB1*05:01, DQB1*03:02, DRB1*15:01, DRB1*16:02 / DRB1*0301, DRB1*0802, DRB1*1403, DRB1*0701, DRB1*1302 IgA nephropathy DRB1*03, DRB1*10/DRB1*04/rs660895, rs1794275, rs2523946 Multiple sclerosis DRB1*15:01, DRB1*03:01, DRB1*13:03, DRB1*04:04, DRB1*04:01, DRB1*14:01, A*02:01, rs9277489, rs2516489, B*37 Myasthenia Gravis DQA1*01, DQB1*0502 /DRB1*04, DRB1*03, DQB1*02, DQB1*03 / DQB1∗05:02 Narcolepsy DRB1*1501, DQB1*0602/rs2858884/DQB1*06:01, DQB1*03:02, rs3117242, DPB1*05:01 Primary biliary cirrhosis DRB1*0801, DRB1*11, DRB1*13/DRB1*0405, DRB1*0803/DRB1*08, DRB1*11, DRB1*14 /rs2856683, rs9275312, rs9275390, rs7775228, rs2395148, rs9277535, rs3806156, rs9357152, rs3135363, rs9277565, rs2281389, rs660895, rs9501626 Psoriasis C∗06:02, C∗12:03, B(AA 67 ,9), A(AA 95), DQA1(AA 53) Rheumatic heart disease B*51, C*04, DRB1*01, DQB1*08/DRB1*07/B*13, DRB1*01, DRB1*04, DRB1*07, DQB1*02, DRB1*13 Rheumatoid arthritis DRB1(AA 9), B(AA 9)/DRB1(AA 11,71,74), B(AA 9), DPB1(AA 9) Systemic lupus rs3131379, rs1270942/rs9271366, rs9275328/ DRB1*15:01, erythematous DRB1*13:02, DRB1*14:03 Systemic sclerosis rs6457617/rs9275224, rs6457617, rs9275245, rs3130573 /- DRB1*15:02, DRB5*01:02 Ulcerative colitis rs9268480, rs660895/rs2395185/DRB1(AA 11)/DRB1*13, DRB1*08 Vitiligo rs11966200, rs9468925/rs7758128/A*33:01, B*44:03, DRB1*07:01 / rs9468925

1.1.2.4 Infection disease Infection disease is one of the leading reasons for human morbidity and mortality, especially for children [25]. Genes involved in the immune response locate within regions that experience the most selective pressure. It is generally assumed that resistance to infection exerts selection pressure fuels the generation of MHC variation and, since its discovery, MHC has stood out as the leading candidates for infection disease susceptibility. However, currently, only several infection diseases have been clearly identified to be associated with this region (Table 2). Some of the selected infection diseases were given below.

15

Table 2 Infection disease related loci within MHC

Disease Relavent variants SARS A*0201 HIV-1 control AA_B_97 HIV-TB B*4006 HCV - sustained response to therapy B*04 Hepatitis C virus-associated hypertrophic cardiomyopathy DPB1*0901 Mycosis fungoides DQB1* 03 Malaria DRB1*04 Leprosy DRB1*15 Allergic bronchopulmonary aspergillosis DRB1*1501, DRB1*1503 Melioidosis DRB1*1602 AIDS rs2395029 Dengue shock syndrome rs3132468 Hepatitis B vaccine response rs3135363(C) Kaposi's sarcoma rs6902982(G) HCV-induced liver cirrhosis rs910049(A) Hepatitis B virus-related hepatocellular carcinoma rs9275319(G) HPV rs9357152(G)

1.1.2.4.1 Acquired immune deficiency syndrome Acquired immune deficiency syndrome (AIDS) is the final as well as the most serious stage after infected with human immunodeficiency virus (HIV). The HIV infection and HLA genes’ information of MHC region are one of the most convincing correlations of human infection and offer the opportunity to develop novel interventions and therapies. Gao et al. [26] reported specific HLA-B*35-Px subtypes as being responsible for the correlation between HLA-B*35 and rapid progression to AIDS, and they also have confirmed the known protective effect of the HLA-B*27 and B*57 subtypes against progression to AIDS.

1.1.2.4.2 Hepatitis virus infection Chronic hepatitis virus infection is an important healthy issue, especially in Asian region like China that cause huge public health concern. Hepatitis B and C virus infections are estimated to account for 70% of the global burden of liver disease [27]. The clinical outcomes after infection were different, from clearance of infection to hepatocellular carcinoma. Many researches proved that genetic information play important role in determining the clinical phenotypes. Most of these researches focused on the relationship between HLA genes of MHC region and infection and HLA class II heterozygote advantage has been demonstrated for clearance of hepatitis B virus (HBV) infection [28].

1.1.2.5 Cancer Human immune system and HLA genes in MHC region plays important role in the development of cancer but the detailed mechanism still poorly understood. Cells of cancer tissue could express some genes that the normal pairs do not, and the short peptides derives from these specific expressed genes will load to MHC class I molecules. These peptides may sever as the target for the immune system. One of the first cancer that reported to relate with MHC genes was Hodgkin’s lymphoma [29]. Many tumors could modulate MHC class I genes’ expression to escape the immune system’ recognition and attack. Cancers that have been showed to related with MHC

16 region so far were given below.

Table 3: Cancer related loci within MHC

Disease Related variants Abdominal aortic aneurysm DQA1*0102 Acute lymphoblastic leukemia DQB1*0201, DQB1*0501, DQB1*0301/DRB1*15/rs2227956, rs1043618, rs1061581 Basal cell carcinoma DRB1*07/DRB1*01 Breast cancer DRB1*1301/A*30,A*31, A*24 /DQA1*0301,DRB1*1303, DQA1*0505, DRB1*1301/DQB1* 02 Cervical cancer DRB1*1501/rs4282438/DRB1*13,DRB1*3,DRB1*09,DRB1*1201 Chronic lymphocytic leukemia A*02:01,DRB4*01:01 Diffuse large B cell Lymphoma DRB1*01,C*03 Esophageal squamous cell rs35399661, rs3763338, rs2844695 carcinoma Follicular lymphoma DQB1*05,DQB1*06/rs10484561, rs6457327/DPB1*03:01,rs2647012 /DRB1(AA 13)/rs17203612, rs3130437 Hepatocellular carcinoma DRB1*07, DRB1*12/ DQB1*0502, DQB1*0602/rs2596542, rs9275319 Hodgkin lymphoma rs6903608, rs2281389, DQA1*02:01 Large B-cell lymphoma DRB1*01,C*03 Lung adenocarcinoma DQA1*03 Lung cancer A*0201, A*2601, B*1518, B*3802, DRB1*0401, DRB1*0402, DRB1*1201, DRB1*1302/rs1052486, rs3117582 Nasopharyngeal carcinoma rs2517713, rs2975042, rs29232, rs3129055 , rs9258122/A*31, A*33, A*19, DRB1*01, DQB1*05, DQB1*02/ B∗46, A∗24/rs3869062 Ovarian cancer A*02 Papillary thyroid carcinoma B*51:01 Testicular germ cell tumors DRB1*0405, DQB1*0401/ B*41

1.1.2.6 Neurological disorder Immune system abnormal during brain development can cause serious effects on neuronal function and there are more and more evidences show that immune dysregulation is associated with neurodevelopmental disorders [30]. Other neurological disorders, including Alzheimer’s disease [31], Parkinson diseases [32], Schizophrenia [33] and Autism spectrum disorder (ASD) [30] were reported to related with MHC region, which indicated the importance of immune system with these disorders.

1.1.2.7 Others Remarkably, MHC is also found to be associated with some other conditions, such as smoking behavior [34] and drag sensitivity which includes abacavir induced hypersensitivity with HLA- B*57:01 [35] and carbamazepine sensitivity with B*15:02 and A*31:01 [36]. It has also been reported that MHC is related with mate choice [37].

17

1.2 High throughput sequencing technology DNA sequencing is a process of reading and determining the precise order of nucleotides in a DNA molecule. Sanger sequencing method has dominated the sequencing filed for decades since it first developed in 1977 and has paved the road for our understanding of the human genome as well as many other species. Sequencing technology has exploded since the completion of the Human Genome Project (HGP). The rapid development of sequencing methods, especially the arises of next generation sequencing (NGS) methods (Fig. 3) which enable to produce thousands or millions of sequences parallel has greatly accelerated the medical and biological research and discovery. The most advantage of NGS method is its high-throughput, thus lead to produce an enormous volume of data cheaply.

DNA sequencing with Real-time DNA Life Technologies SOLiD in 2008; chain-terminating sequencing using Life Technologies SOLiD 5500xl in 2010; inhibitors (1977) detection of Life Technologies Ion Torrent PGM pyrophosphate release in 314/316/318 chip in 2011; 1996 Life Technologies Ion Proton in 2012;

1950 1960 1970 1980 1990 2000 2010 2013

Helicos (2008) The structure of DNA sequencing by The Polony sequencing (2005) DNA was chemical degradation PacBio RS system (2010) established in 1953 (1977) Intelligent BioSystems (2006) Oxford Nanopore The first semi-automated Technologies (2012) DNA sequencing machine in 1986 454 GS 20 in 2005; 454 GS FLX in 2007; 454 GS FLX Titanium in 2009; Massively parallel 454 GS FLX+ in 2011; signature sequencing (MPSS) (1992) Illumina GA lanuch in 2006 Illumina GA II in 2007; Illumina GAIIx in 2009; Illumina Hiseq2000 in 2010; Illumina Miseq in 2011; Illumina Hiseq2500 in 2012;

Figure 3: Development of sequencing methods

Currently, the most popular NGS technology are Illumina’s Hiseq, Roche’s 454, Life’s Ion Torrent/Proton, PacBio’s SMRT sequencing and Complete Genome’s CG. Detailed review of the mechanism and comparison of these technologies have been published before [38-43]. Along with the emergence of massively parallel sequencing technology, the cost of sequencing a single human genome is dramatically decreased (Fig. 4) and there has been a drive to improve. The newest sequencers, named as ‘third generation’ or ‘single molecular’ sequencing method [44], are able to sequencing DNA directly. Although the throughputs are not comparable to current NGS methods, they have great potential and advantage in clinical usage currently, and may dominate the sequence filed in the future after persistent technical improvement.

18

Figure 4. Cost for sequencing one human genome [45].

1.2.1 Roche 454 Roche 454 DNA sequencer was the first next-generation sequencing method that has been commercialized (Fig. 3). The method was developed by 454 life Sciences and was acquired by Roche in 2007. The technology is derived from the technological convergence of pyrosequencing and emulsion PCR [46]. DNA was first sheared into small fragments and ligated with adapter. Then the adapter-ligated NDA fragments were fixed to beads in a water-in-oil emulsion and amplified by PCR. Finally, these DNA-bound beads were placed to a fiber chip and placed into the system for sequencing.

1.2.2 Illunina Solexa/Hiseq/Miseq This sequencing technology is based on sequencing by synthesis (SBS) and was first released in 2006 by Solexa [47]. It employs a modified dNTPs that contains a terminator to blocks further synthesis that similar to the Sanger sequencing. The whole sequence process constitutes by three major steps: 1) Sample preparation. DNA is sheared to an appropriate size, i.e., 500bp (according to the sequencing strategy). Then, the ends of fragmented DNA are polished and ligated with unique adapters. 2) Cluster Generation by Bridge PCR. The constructed DNA library is loaded to the flow cell. Using a specific designed process, DNA fragment ligate with the oligonucleotides of the flow cell surface and amplified. 3) Sequencing by synthesis. Each sequencing cycle consists of loading a single fluorescent nucleotide and imaging by a high-resolution camera. In 2007, Illumina acquired Solexa and have provided a series of next-generation sequencing machines based on the same theory which including Genome Analyzer IIx, Hiseq, Miseq and so on. As the development of this technology, the output of raw sequencing data increased from 1G (GA) at the first begin to near 600G per run (Hiseq2000) and read length increased from single-end 35bp to current pair-end 300bp (Miseq).

19

1.2.3 Life Technologies Ion Torrent PGM/Proton Ion Torrent semiconductor sequencing [48] also uses sequencing by synthesis technology, but has moved away from optics altogether. Library preparation involves coating beads in individual library fragments and each bead is then deposited in its own well of a semiconductor plate. The wells are flooded sequentially with each of the four DNA base pairs. If one is incorporated into the growing strand, the hydrogen ions released causes a change in pH and is recorded by the semiconductor chip at the bottom of each well. Homopolymer runs are detected as larger spikes in pH change. Ion Torrent runs are faster due to the absent of optics and lack of modified bases which requiring chemical alteration. Currently, Ion Torrent provides two different sequencing systems: Proton and PGM. The Ion Proton system enables the highest throughput with up to 10 Gb reads in 2-4 hours with read length up to 200bp. While for Ion PGM, although the throughput is only ~2G, the read length could be as long as 400bp.

1.2.3 Complete Genomics By brought diverse technologies together, Complete Genomics (CG) created a comprehensive platform [49-50] for large-scale human genomics studies. Like other NGS method, the whole sequencing process can also be divide into three major process: 1) Library Construction. Besides shearing DNA into an appreciated size, CG inserts four adaptors into each DNA fragment during the library construction. 2) DNA library amplification and loading to patterned surface. Unlike other sequencing methods which DNA amplification is performed on surfaces or in emulsions, CG’s amplification process performed in solution and produces amplified DNA clusters which named as DNA nano-balls (DNBs). These produced DNBs are loaded on the patterned surface. 3) Sequencing by hybridization (SBH). Pools of probes labeled with four different dyes were hybridized and ligated to the template to produce high accuracy reads.

1.2.4 PacBio The PacBio sequencing system provided by Pacific Biosciences is one of current next-generation sequencing technologies which uses a single molecule real time sequencing (SMRT) method [51]. This method is also SBS technology that based on real-time imaging of fluorescently marked nucleotides when synthesizing along the DNA template molecules. The DNA sequencing is performed on a chip that contains many zero-mode waveguide (ZMW). Inside each ZMW, a single active DNA polymerase with a single molecule DNA template is immobilized to the bottom through which light can penetrate and create a visualization chamber that allows monitoring of the activity of the DNA polymerase at a single molecule level. The signal from a phospho-linked nucleotide incorporated by the DNA polymerase is detected as the DNA synthesis proceeds which results in the DNA sequencing in real time. As a result, the read length is not uniform but has an approximately lognormal distribution. Meanwhile, sequencing error could be reduced after the process of circular consensus sequencing and assembly.

1.2.6 Oxford Nanopore The basic idea of Nanopore based sequencing method is threading the DNA molecule through a small hole and measuring the physical property which is related with each nucleotide that passed the hole. It has great potential to become a direct, fast and inexpensive sequencing technology.

20

Oxford Nanopore was founded in 2005. It translate the academic nanopore research into a commercial, electronics-based sensing technology [52] which could be used into analysis of single molecules, including DNA, RNA and proteins.

1.2.7 Others Besides these commercial sequencing methods showed above, there were some other high throughput sequencing methods that appeared and used before. Such as Applied Biosystems SOLiD system [53] which using 2-based encoding and ligation sequencing rather than SBS and Helios Biosciences’ Helicos Genetic Analysis System [54], which was the first commercial NGS method to use the principle of single molecule fluorescent sequencing. However, these technologies were gradually withdrawn from the sequencing market because of technical or commercial problems.

21

2 Aims

2.1 Current problem for MHC region related study 1) Choosing a reliable and user-friendly genotyping method is one of the most important issues for researches or studies which mainly rely on HLA typing information. Sanger sequencing, distinguished for its long read length and accuracy, has dominated high-quality sequencing for decades since it adapted as the gold standard of genotyping. However, Sanger sequencing based typing (SBT) is labor intensive and low throughput, hindering it being used as a routine method, especially in large-scale genotyping projects. 2) Despite the impressive cumulative list of MHC associations of different disorders during recent decades [8], there has been a surprising lacking of progress in identifying causal genes and variants behind these disease associations. This is likely to be due to the complex allelic and genetic structure of this region with an extended linkage disequilibrium (LD) and a high degree of polymorphisms [55]. 3) Effective imputation methods and large-scale reference panels are available to determine the high polymorphism HLA genes in silico and fine-map diseases associations within classical HLA genes [56–58]. However, adequately sized reference panels are still not available for many ethnic groups. Moreover, current available reference panels lack MHC-wide ascertainment of variation, and may thus not allow fine mapping of the causal mutation or even miss parts of disease associations. In addition, it has been shown that MHC haplotypes, polymorphisms and recombination rates are highly divergent among individuals and populations [59]. 4) Structural variation (SV) have been proven to be an important source of human genetic diversity and disease susceptibility [60-63]. differences arising from SVs occur on a significantly higher order (>100 fold) than point mutations [64-65], and data from the 1000 Genomes Project show population-specific patterns of SV prevalence [66]. The MHC is one of the most polymorphism regions on genome, there exists abundant SVs between different MHC haplotypes [5], [67]. However, due to its complexity, current methods could hardly reveal this important information.

2.2 Specific aims of this project Take the advantage of next generation sequencing methods together with advanced bioinformatics analysis to:

1) Develop novel, accurate and cost-effective HLA typing method for routinely usage, especially in large-scale genotype projects. 2) Develop novel tools that could efficiently obtain information of the whole MHC region which could be used to solve the problem that hard to identify the disease causal mutation due to the high polymorphism and linkage dis-equilibrium in genome wide association study (GWAS) for complex disease study in MHC region. 3) Construct a MHC reference panel database that contains comprehensive variants information, including SNPs, Indels and HLA typing, for Han Chinese population, which could serve as a basis for studies on a variety of MHC associated disorders. 4) Adopt new technology in revealing SVs of MHC region as well as in whole genome.

22

3 Results

3.1 Novel method for MHC region study

3.1.1 Paper I: High throughput HLA typing Method Sanger based PCR-SBT is the golden standard for genotyping but has been underused because of its low-throughput, labor intensive and high cost of sequencing. NGS methods provide high throughput and inexpensive sequencing, holding great promise for applying to PCR-SBT. However, the short read length is a major challenge. HLA gene typing is one of the most important applications of genotyping as high-resolution donor- recipient HLA matching is essential for the success of unrelated donor hematopoietic stem cell transplantation (HSCT). However the extremely polymorphic and complex of this region poses a great challenge for obtaining high accurate and resolution result. Given this situation, we developed a novel method RCHSBT (Fig. 5), a NGS based PCR-SBT assays that takes the advantage of massively parallel high-throughput and paired-end sequencing, which provides accurate HLA gene typing result using assembled haploid sequences. Utilizing this method, amplicons with length as long as Sanger sequencing reads can be readily genotyped.

Figure 5. Outline of RCHSBT procedures.

In order to develop a high throughput, high accurate and low cost HLA typing methods, we adapted steps as described below: 1) First, each amplicon was amplified with Index PCR. Then, amplicons from different sample were pooled together and send for NGS sequencing. By using the index information within each sequenced read, it enables us to group the sequenced reads into each sample’ corresponding amplicon.

23

Figure 6. Amplicon information of each target gene. (A) For HLA-A,-B and -C, exon 1 to exon 7 were amplified by four reactions. For DQB1, exon 2 and exon 3 were amplified together. For DRB1, Only part of exon 2 was amplified. Q stands for DQB1, R stands for DRB1; (B) Agarose gel showing amplicons from PCR. The amplicons ranged between 250 bp and 1,000 bp in length. Amplicon Q2 and Q3 were loaded in the same lane.

2) After align the sequenced reads to corresponding reference, we adopted a Maximum Allowable Mismatches (MAM) cutoff that used for filtering out the aligned low-quality reads to get accurate variants of each amplicon (Table 4). 3) Detected heterozygous variants were phased into haplotype by using reads’ pair-end information. 4) Re-construction of each amplicon’s two haplotype sequences. Sequenced reads were classified into two groups according to previous phased heterozygous markers and assembled into consensus sequences.

24

Table 4. Variants calling accuracy statistic

Sanger True Positive (TP) NO. False Positive (FP) False Negative Total Stage Gene variants NO. (%) NO.(%) (FN) NO.(%) variants NO. Raw A 4100 4086 (99.51) 20 (0.49) 0 (0.00) 4106 Filtered A 4100 4092 (99.71) 12 (0.29) 0 (0.00) 4104 Phasing A 4100 4090 (99.68) 10 (0.24) 3 (0.07) 4103 Raw B 4350 4513(97.73)/4339 (98.75)* 55 (1.19)/55 (1.25)* 50 (1.08)/0 (0.00)* 4618/4394* Filtered B 4350 4471 (97.75)/4350 (100.00)* 0 (0.00)/0 (0.00)* 103 (2.25)/0 (0.00)* 4574/4350* Phasing B 4350 4475(97.84)/4350(100%)* 0 (0.00)/0 (0.00)* 99 (2.16)/0 (0.00)* 4574/4350* Raw C 2822 2786 (73.43) 1008 (26.57) 0 (0.00) 3794 Filtered C 2822 2809 (99.43) 5 (0.18) 11 (0.39) 2825 Phasing C 2822 2819 (99.79) 5 (0.18) 1 (0.04) 2825 Raw DRB1 2427 2405 (98.28) 42 (1.72) 0 (0.00) 2447 Filtered DRB1 2427 2406 (98.45) 38 (1.55) 0 (0.00) 2444 Phasing DRB1 2427 2406 (98.36) 40 (1.64) 0 (0.00) 2446 Raw DQB1 3973 3960 (98.98) 41 (1.02) 0 (0.00) 4001 Filtered DQB1 3973 3972 (99.60) 16 (0.40) 0 (0.00) 3988 Phasing DQB1 3973 3973 (100.00) 0 (0.00) 0 (0.00) 3973 Variants calling accuracy of most frequently studied HLA exons of HLA-A,-B,-C,DQB1 and DRB1 by RCHSBT was evaluated before MAM filtering(raw), after MAM filtering(filtered) and after haploid assembly(phasing).

5) HLA gene typing by using haplotype sequence information.

All in all, the key to provide accurate HLA gene typing result is by taking the advantage of advanced bioinformatics analysis method to construct each genotyped site’s two-haplotype sequences using the pair-end information from NGS sequencing data. It fully takes the advantage of high throughput NGS sequencing technology which enable ultra-high sequence depth and high polymorphic of HLA genes to phase the haplotype sequence.

3.1.2 Paper II: Integrated tools to study whole MHC region Although previous studies have broaden the database of variations in MHC regions, detecting the variations in human MHC region remains a challenge due to the relatively high cost for whole genome sequencing or PCR-based Sanger sequencing. High throughput NGS technologies have decreased the cost for genome sequencing dramatically. But, sequencing a genome with high coverage (~30-fold, about 100 Gb in total) would still costs several thousands dollars. In addition to the cost for sequencing, analyzing and storage of these vast amounts sequencing data would place a substantial burden on a research center’s bioinformatics infrastructure. For researchers who only interesting in MHC region or HLA gene’s typing information only, targeted sequencing stands out as a promising approach. Target capture sequencing mainly used to enrich and sequence the interested region, such as exons of genes or the whole exome of entire human genome [68-69]. This strategy could be a cost- effective solution for analyzing MHC region as well. However, It was thought to be difficult to enrichment of large and high variable genomic region, as there exists ubiquitous repetitive sequence and diverse between different haplotypes. In 2011, Pröll et al [70] used microarray capture together with 454 sequencing to analyze the 3.5 million base pair MHC region of an acute myeloid leukemia patient who underwent unrelated HLA-matched allogeneic stem cell transplantation. It proved the feasibility of capture-based method in studying genomic region as complex as MHC. However, the low capture efficiency hinders it from routine usage.

25

Given this situation, taking the usage of the eight different MHC haplotype sequences in the hg19, together with NimbleGen, we sophisticated design probes that could efficiently capture the whole MHC region (Fig. 7).

Figure 7. Variant and probe density of MHC. The figure in middle indicates the variant density along MHC. In the figure on the bottom, a short black vertical line stands for a probe. Additionally, we align all the probes to the eight haplotypes of MHC and find all the regions with higher depth have more probes than other regions.

Based on the captured data, not only we could provide comprehensive variants information of the covered region, but also, can provide the typing result of the 26 most polymorphism HLA genes that recorded in the IMGT/HLA database. These enable researchers to study the relationship between genetic variants with relevant traits [8] at different levels and fine-map the disease causal variant (Fig. 8).

26

LETTERS

Table 1 Effect estimates for the five amino acids associated with risk of rheumatoid arthritis Unadjusted allele HLA-DRB1 amino acid at position Multivariate frequency 11 13 71 74 OR 95% CI Controls Cases Classical HLA-DRB1 alleles Val His Lys Ala 4.44 4.02–4.91 0.106 0.316 *04:01 Val His or Phe Arg Ala 4.22 3.75–4.75 0.056 0.141 *04:08, *04:05, *04:04, *10:01 Leu Phe Arg Ala 2.17 1.94–2.42 0.109 0.143 *01:02, *01:01 Pro Arg Arg Ala 2.04 1.59–2.62 0.013 0.012 *16:01 Val His Arg Glu 1.65 1.24–2.19 0.010 0.009 *04:03, *04:07 Asp Phe Arg Glu 1.65 1.29–2.10 0.011 0.013 *09:01 Val His Glu Ala 1.43 1.04–1.96 0.011 0.006 *04:02 Ser Ser Lys Ala 1.04 0.76–1.41 0.012 0.006 *13:03 Pro Arg Ala Ala 1.00 Reference 0.142 0.092 *15:01, *15:02 Gly Tyr Arg Gln 0.91 0.80–1.03 0.133 0.064 *07:01 Ser Ser or Gly Arg Ala 0.88 0.77–1.00 0.103 0.049 *11:01, *11:04, *12:01 Ser Ser Arg Glu 0.84 0.67–1.05 0.025 0.012 *14:01 Leu Phe Glu Ala 0.73 0.42–1.27 0.004 0.002 *01:03 Ser Gly Arg Leu 0.71 0.57–0.89 0.028 0.013 *08:01, *08:04 Ser Ser Lys Arg 0.63 0.54–0.73 0.128 0.083 *03:01 Ser Ser Glu Ala 0.59 0.51–0.68 0.112 0.041 *11:02, *11:03, *13:01, *13:02

HLA-B amino acid at position 9 Classical HLA-B allele Asp 2.12 1.89–2.38 0.118 0.130 *08 His, Tyr 1.00 Reference 0.882 0.870 *07, *13, *14, *15, *18, *27, *35, *37, *38, *39, *40, *41, *44, *45, *47, *49, *50, *51, *52, *53, *55, *56, *57, *58, *73 HLA-DPB1 amino acid at position 9 Classical HLA-DPB1 alleles Phe 1.40 1.31–1.50 0.728 0.799 *02:01, *02:02, *04:01, *04:02, *05:01, *16:01, *19:01, *23:01 His, Tyr 1.00 Reference 0.272 0.201 *01:01, *03:01, *06:01, *09:01, *10:01, *11:01, *13:01, *14:01, *15:01, *17:01, *20:01 Estimated effects for haplotypes of HLA-DRB1, HLA-B and HLA-DPB1. Classical alleles of HLA-DRB1 are grouped based on the amino acid residues present at positions 11 (or 13), 71 and 74 within DRB1. The classical shared epitope alleles are shown in bold. For each haplotype, the multivariate effect is given as an odds ratio (OR), taking the most frequent haplotype (ProArgAlaAla) in the control samples as the reference (that is, giving that haplotype an OR of 1). All effects are conditional on Asp9 in HLA-B and Phe9 in HLA-DPB1. Unadjusted haplotype frequencies are given for cases and controls. HLA-DRB1 haplotypes in aggregate explain 9.7% of the phenotypic variance of rheu- matoid arthritis. The multivariate effect sizes, allele frequencies and classical alleles corresponding to Asp9 in HLA-B and Phe9 in HLA-DPB1 are also listed. 95% CI, 95% confidence interval. of the classical HLA-B*08 allele (P > 0.68). Similar to positions 11, responds to Phe9 in HLA-DPB1 (OR = 1.40 relative to His9 or Tyr9;

© All rights reserved. 2012 Inc. Nature America, 71 and 74 in HLA-DRB1, position 9 in HLA-B is also located in the Table 1, Fig. 3 and Supplementary Table 4). The Phe9 effect was peptide-binding groove (Fig. 4). Many of the previously described significantly stronger than that of any two- or four-digit HLA-DPB1 associations across the MHC, including markers in the TNF region, classical allele; Phe9 is in LD with and is indistinguishable from the are in LD with Asp9 (ref. 10). Val8 allele in HLA-DPB1. Amino acid position 9 is within the binding As previously observed associations of HLA-B*08 to autoimmune groove of HLA-DP (Fig. 4). diseases, including to rheumatoid arthritis, have been attributed We observed no residual signals across the MHC after conditioning specifically to the long ancestral 8.1 haplotype, which contains on the effects of HLA-DRB1, HLA-B*08 Asp9 in HLA-B and Phe9 in HLA-B*08 on the HLA-DRB1*03 background9,11, we tested whether HLA-DPB1 (P > 3 × 10−6; Fig. 1d). We also did not observe any evi- the HLA-B*08-Asp9 effect is common to all HLA-DRB1 backgrounds. dence of epistatic interactions between known risk loci13,20,21 and any Because HLA-B*08 and HLA-DRB1*03 are not in perfect LD and of the HLA alleles described here (P > 0.0003; Supplementary Note). are both seen independent of the 8.1 hap- lotype, we were able to apply a conditional haplotype analysis to show that HLA-B*08- Asp9 increases disease risk roughly twofold HLA-DR HLA-B HLA-DP regardless of the HLA-DRB1 background 9 9 (Fig. 5). Therefore, this risk effect is not 11 restricted to the 8.1 haplotype. Risk alleles 13 for HLA-B and HLA-DRB1 contribute risk additively (on a log-odds scale) even though HLA-DR HLA-DP they are in strong (but incomplete) LD. 74 71 Conditioning on the effects of HLA-DRB1 Figure 4 Three-dimensional ribbon models for the HLA-DR, HLA-B and HLA-DP proteins. These and HLA-B, we observed theFigure most signifi 8. Three- structures-dimensional are based ribbonon Protein models Data Bank for entries the 3pdo, HLA 2bvp-DR, and HLA 3lqz,- Brespectively, and HLA with-DP a direct proteins that related cant association at HLA-DPB1with in the seropositive class II view Rheumatoidof the peptide-binding arthr groove.itis [56] Key .amino acid positions identified by the association analysis −20 25 HLA region (P < 10 ; Fig. 1 c), which cor- are highlighted. This figure was prepared using UCSF Chimera .

4 3.2 Construction Chinese MHC databaseADVANCE ONLINE PUBLICATION NATURE GENETICS It has been shown that MHC haplotypes, polymorphisms and recombination rates are highly divergent among individuals and populations [59], [71]. However, our knowledge of sequence features and diseases/traits associations in MHC region was mainly based on European population. To better understand the genomic feature and to fine mapping disease associated loci in MHC region, constructing a reference database based on each studied population is essential. Using our developed MHC capture array [72], we set out to deep sequence the entire MHC region which close to 5 Mb, from upstream of HLA-A to downstream of HLA-DP, in 9,906 normal Chinese donors in order to provide a population wide database (Table 5), which including SNP (Fig. 9), Indel, HLA gene type (Fig. 10) and MHC haplotype, that will serve as a basis for studies on a variety of MHC associated disorders.

Table 5 Basic data statistics. There are 9,906 samples across MHC 5M region. For example the number 1.35 in the formula “1.46±0.38” represents the mean value while the 0.38 represents the standard deviation. The average depth of MHC is about 55X and the coverage of MHC is about 94.67%. It’s adequate for us to do a credible result.

Capture sequencing data #of individuals 9906 Raw bases(G) 1.46±0.38 Mapped bases(G) 1.16±0.27 Mapped bases on target(G) 0.63±0.14 Capture specificity (%) 55±6.14 Average sequence depth(fold) 55.15±12.15 Fraction of target covered >= 1X (%) 94.67±0.58 Fraction of gene covered >= 1X (%) 97.98±0.12 Fraction of target covered >= 5X (%) 90.22±1.77 Fraction of gene covered >= 5X (%) 94.78±1.00 Fraction of target covered >= 10X (%) 84.51±2.66 Fraction of gene covered >= 10X (%) 90.52±5.88

27 a

b 140

120

100

80 synonymous

SNP number 60 nonsynonymous 40

20

0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Minor allele frequency

Figure 9. Heterozygosity and functional catalog of common SNPs in the MHC region. (a) The fraction of variants and their contribution to heterozygosity at different allele frequencies. 27% variants are relatively low frequent (MAF<0.1), but 85% of heterozygous sites were due to common variants (MAF>0.05). (b) The number of synonymous SNPs and nonsynonmous SNPs as the increase of minor allele frequency. The nsSNPs have greater diversity among individuals than sSNPs, especially in low frequency variants.

28 a.

b.

c.

d.

e.

Figure 10. Allelic frequencies at five classic HLA sites. Allele frequency for HLA-A, HLA-B, HLA-C, HLA- DRB1, HLA-DQB1 in Chinese population were given in the figure. Only the top 20 alleles at each locus were showed in diagram. The common alleles (Frequency >5%) were showed in crimson pillars, the low- frequency alleles were showed in light blue pillars and the rare alleles were showed in green pillars.

29

3.3 Fine-mapping disease related mutation One of the most important usages of the constructed database is fine mapping of disease causing mutations in the MHC region. For example, rs2281388 (located 5.1 kb downstream of HLA-DPB1) were previously reported to be associated with Graves’ disease in the Han Chinese population [73] but were in insufficient LD (r2 <0.8) with currently known nsSNP variants. However, using our database, rs2281388 shows a strong LD (r2=0.91) with the nearby nsSNP rs11551421, which alters a Valine to a Methionine in the fourth exon (HLA-DPB1:NM_002121:exon4:c.G700A:p.V215M) of HLA-DPB1. Rs2281388 was an almost perfect predictor of the DPB1*0501 allele (tagging LD=0.96), suggesting that nsSNP rs11551421 in the DPB1*0501 allele is a potential or even causal functional variant underlying Graves’ disease (Fig. 11).

r2 0.8 − 1 1.2 r2 0.6 − 0.8 HLA−DPA1 SLC39A7 r2 0.4 − 0.6 HLA−DPB2 RXRB 2 r 0.2 − 0.4 HLA−DPB1 COL11A2 2 r 0 − 0.2 rs2281388 ●● 1.0 ● ● ●●● ●●●● ●● ● ●● rs11551421 ● ● ● ● ● ● ● ●

0.8 ● ● ● ● ● ● ● ●●●●● ●●● ●●● ●●●●●●●●● ●●●●●●●●●● ●

LD ● 0.6 ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ●●●● ● ●●●● ● ● ● ●●●●●●●● ● ● ● ● ● ●● ●● ●● ●●●●● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ●●● ● ●●● ●● 0.4 ●● ● ● ● ● ●● ●● ●● ● ● ● ●● ●●●●●●● ● ● ●● ● ● ●●●●●● ● ● ●●●●●● ● ● ● ●●● ●● ●●●● ● ●●●●●● ● ●● ●●●●●●●●● ● ●●●●●● ● ● ●●●●●●●● ● ●●● ● ●●●●●●● ● ● ●● ●● ● ●●●● ●● ● ● ● ●●● ●● ● ● ●● ● ●● ●●●● ● ● ●●● ● ● ●● ●● 0.2 ●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ● ● ●●●●● ● ●●● ●●● ● ● ●●●●●● ● ●●● ● ● ●● ● ●● ● ● ●● ●● ●●●● ●● ●●●● ● ● ● ●●●●●●●●●● ●● ● ● ● ● ●● ● ● ●●● ●●●●● ● ● ●● ● ● ● ●● ●● ● ●●●●●● ● ● ●●●● ●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●● ● ●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●● ●●●● ●● ●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● ●●● ● ●● ●● ● ● ● ●●●●● ● ●●●●●●● ● ●●●●● ●● ● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ● ● ●●●●●●●● ●●●●●●●●●● ●● ●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.0

33.1M 33.2M 33.3M 33.4M

Position Figure 11. Fine map disease causal variants using constructed database. The SNP rs2281388 was reported to be related with Grave disease in a previous study, based on our constructed databased, this snp were in high LD (r2>0.8) with a functional non-synonymous SNP in HLA-DPB1.

3.4 Paper III: Novel method and technology in genome study As the development of NGS sequencing technology, it’s much easier and faster to obtain information of a single [46-47], [53-54], [74] or population’s genome [66], [75-76] information than before [77-78]. The well-developed bioinformatics [79–85] could provide each genome’s variants accurately. However, due to the disadvantage of all currently NGS technology – short read length, most of the analyses were focus on the small variants including SNP, Indel as well as copy number variation (CNV) which main takes the usage of depth information. It has little power to detect and analysis the large SVs of human genome, especially for complex region like MHC, T cell receptor alpha/beta (TRA/TRB), Killer cell immunoglobulin-like receptor (KIR), and immunoglobulin heavy/light locus (IGH/IGL) region. Optical mapping and new genome mapping technologies based on nicking enzymes provide low resolution but long range genomic information and has been successfully used for assisting the genome de novo assembly [86] and for detecting large-scale structural variants and rearrangements

30

[67] that cannot be detected using the paired end information provide by current NGS sequencing methods. Thus, taking the advantage of a high throughput optical map technique – Irys platform that based on Nano-channel, we provided the comprehensive large SVs information in the human genome. Moreover, it also gave out valuable information for the most complex genomic regions (Fig. 12).

Figure 12. Consensus genome maps compared to hg19 in the MHC region [87]. The green bars represent the hg19 in silico motif map; the blue bars represent consensus genome maps. Large SVs can be seen in the RCCX, HLA-D and HLA-A regions. The Cox and PGF genome maps are shown below for the HLA-A region. HLA: human leukocyte antigen; RCCX: RP-C4-CYP21-TNX module.

31

4 Conclusions

The demand of developing novel and advanced methods along with emerged new technology is the eternal motivation for genomic study as well as other fields. NGS methods, which is well known for its high throughput, and hence, low sequencing cost, has great potential utility in the sequence field ever since their emergence. However, the relatively shorter read length and higher sequence errors, compare to Sanger sequencing, are two major challenges before using them as a routine method.

Due to the importance of human MHC to immune system, it's one of the most intensive studied genomic regions. However, due to the limitation of previous methods, there was a lot of room for improvement. SBT was the golden standard for HLA gene typing. Sanger based sequencing method, well known for its long read length and accuracy, has dominated this fields for decades. The High throughput NGS sequencing technologies have great potential of usage in genome analyze, especially in genotyping which has been dominated by Sanger sequencing method. However, the two major challenges NGS method need to be solved before it could be applied in genotyping.

Based on sophisticate designed bioinformatics analysis method and making the usage of specific feature of HLA genes – extremely high polymorphism, we achieved nearly the same accuracy as Sanger sequencing based method for HLA genotyping. Situation was the same for the entail MHC and genome region analysis. Results of the studies showed above fully demonstrate the great advantage of new technology and advanced bioinformatics analysis method in genomic related analysis and applications.

32

5 Future Perspective

Being able to obtain accurate genotyping information of HLA gene, MHC as well as the entail genome was just the beginning. The ultimate goal for genomic research is to reveal the relationship between genotype and phenotype, then utilize this information in personalized medicine, including disease diagnosis, organ transplantation guidance, disease risk prediction, particularly, for complex diseases. To reach this aim, a lot of work still needed to do in the future which includes, but not limited to, novel technology development, new bioinformatics method conception and new biological mechanism discovery.

33

6 References

[1] The MHC sequencing consortium, “Complete sequence and gene map of a human major histocompatibility complex,” Nature, vol. 401, no. October, pp. 921–923, 1999.

[2] J. Klein, A, Sato, “Advances in Immunology, The HLA system,” N Engl J Med, vol. 343, pp. 702-709, September. 2000.

[3] C. A. Stewart, R. Horton, R. J. N. Allcock, J. L. Ashurst, A. M. Atrazhev, P. Coggill, I. Dunham, S. Forbes, K. Halls, J. M. M. Howson, S. J. Humphray, S. Hunt, A. J. Mungall, K. Osoegawa, S. Palmer, A. N. Roberts, J. Rogers, S. Sims, Y. Wang, L. G. Wilming, J. F. Elliott, P. J. de Jong, S. Sawcer, J. a Todd, J. Trowsdale, and S. Beck, “Complete MHC haplotype sequencing for common disease gene mapping.,” Genome Res., vol. 14, no. 6, pp. 1176–87, Jun. 2004.

[4] J. a Traherne, R. Horton, A. N. Roberts, M. M. Miretti, M. E. Hurles, C. A. Stewart, J. L. Ashurst, A. M. Atrazhev, P. Coggill, S. Palmer, J. Almeida, S. Sims, L. G. Wilming, J. Rogers, P. J. de Jong, M. Carrington, J. F. Elliott, S. Sawcer, J. a Todd, J. Trowsdale, and S. Beck, “Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history.,” PLoS Genet., vol. 2, no. 1, p. e9, Jan. 2006.

[5] R. Horton, R. Gibson, P. Coggill, M. Miretti, R. J. Allcock, J. Almeida, S. Forbes, J. G. R. Gilbert, K. Halls, J. L. Harrow, E. Hart, K. Howe, D. K. Jackson, S. Palmer, A. N. Roberts, S. Sims, C. A. Stewart, J. a Traherne, S. Trevanion, L. Wilming, J. Rogers, P. J. de Jong, J. F. Elliott, S. Sawcer, J. a Todd, J. Trowsdale, and S. Beck, “Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project.,” Immunogenetics, vol. 60, no. 1, pp. 1–18, Jan. 2008.

[6] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. a R. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. W. de Bakker, M. J. Daly, and P. C. Sham, “PLINK: a tool set for whole-genome association and population-based linkage analyses.,” Am. J. Hum. Genet., vol. 81, no. 3, pp. 559–75, Sep. 2007.

[7] M. Kronenberg and A. Rudensky, “Regulation of immunity by self-reactive T cells.,” Nature, vol. 435, no. 7042, pp. 598–604, Jun. 2005.

[8] J. Trowsdale and J. C. Knight, “Major Histocompatibility Complex Genomics and Human Disease,” 2013.

[9] Terasaki PI, Deduction of the fraction of immunologic and non-immunologic failure in cadaver donor transplants, Clin Transpl. 2003:449052.

[10] J. M. Nogueira, A. Haririan, S. C. Jacobs, M. R. Weir, H. A. Hurley, H. S. Al-qudah, M. Phelan, C. B. Drachenberg, S. T. Bartlett, M. Cooper, and J. M. Nogueira, “The Detrimental Effect of Poor Early Graft Function After Laparoscopic Live Donor Nephrectomy on Graft Outcomes,” no. 53, pp. 337–347, 2009.

34

[11] M. Kamoun, J. H. Holmes, A. K. Israni, J. D. Kearns, V. Teal, W. Peter, S. E. Rosas, M. M. Joffe, H. Li, and H. I. Feldman, “HLA-A amino acid polymorphism and delayed kidney allograft function,” vol. 105, no. 48, pp. 18883–18888, 2008.

[12] L. D. Notarangelo, A. Fischer, R. S. Geha, J.-L. Casanova, H. Chapel, M. E. Conley, C. Cunningham-Rundles, A. Etzioni, L. Hammartröm, S. Nonoyama, H. D. Ochs, J. Puck, C. Roifman, R. Seger, and J. Wedgwood, “Primary immunodeficiencies: 2009 update.,” J. Allergy Clin. Immunol., vol. 124, no. 6, pp. 1161–78, Dec. 2009.

[13] Q. Pan-Hammarström and L. Hammarström, “Antibody deficiency diseases.,” Eur. J. Immunol., vol. 38, no. 2, pp. 327–33, Feb. 2008.

[14] R. A. Al-attas and A. H. S. Rahi, “Primary Antibody Deficiency in Arabs : First Report from Eastern Saudi Arabia,” vol. 18, no. 5, pp. 1996–1999, 2000.

[15] Kanoh T, Mizumoto T, Yasuda N, Koya M, Ohno Y, Uchino H, Yoshimura K, Ohkubo Y, Yamaguchi H. Selective IgA deficiency in Japanese blood donors: frequency and statistical analysis. Vox Sang. 1986;50(2):81-6.

[16] R. C. Ferreira, Q. Pan-Hammarström, R. R. Graham, G. Fontán, A. T. Lee, W. Ortmann, N. Wang, E. Urcelay, M. Fernández-Arquero, C. Núñez, G. Jorgensen, B. R. Ludviksson, S. Koskinen, K. Haimila, L. Padyukov, P. K. Gregersen, L. Hammarström, and T. W. Behrens, “High-density SNP mapping of the HLA region identifies multiple independent susceptibility loci associated with selective IgA deficiency.,” PLoS Genet., vol. 8, no. 1, p. e1002476, Jan. 2012.

[17] C. Liu, S. Manzi, A. H. Kao, J. S. Navratil, M. J. Ruffing, and J. M. Ahearn, “Reticulocytes Bearing C4d as Biomarkers of Disease Activity for Systemic Lupus Erythematosus,” vol. 52, no. 10, pp. 3087–3099, 2005.

[18] S. D. G. A. D. O. La, H. T. M. E. Nc, J. T. Ale, W. L. G. R. Oss, V. C. E. R. U. N. D. Olo, and B. Bramstedt, “IMMUNODEFICIENCY REVIEW TAP deficiency syndrome,” 2000.

[19] R. Parisi, D. P. M. Symmons, C. E. M. Griffiths, and D. M. Ashcroft, “Global epidemiology of psoriasis: a systematic review of incidence and prevalence.,” J. Invest. Dermatol., vol. 133, no. 2, pp. 377–85, Mar. 2013.

[20] F. O. Nestle, D. H. Kaplan, J. Barker, “Mechanisms of disease, Psoriasis,” N Engl J Med, vol. 361, pp. 496–509, 2009.

[21] D. L. Scott, F. Wolfe, and T. W. J. Huizinga, “Rheumatoid arthritis.,” Lancet, vol. 376, no. 9746, pp. 1094–108, Sep. 2010.

[22] J. E. Oliver and A. J. Silman, “Risk factors for the development of rheumatoid arthritis,” no. 13, pp. 169–174, 2006.

[23] B. S. Prabhakar, R. S. Bahn, and T. J. Smith, “Current perspective on the pathogenesis of Graves’ disease and ophthalmopathy.,” Endocr. Rev., vol. 24, no. 6, pp. 802–35, Dec. 2003.

35

[24] E. M. Jacobson, A. Huber, and Y. Tomer, “The HLA gene complex in thyroid autoimmunity: from epidemiology to etiology.,” J. Autoimmun., vol. 30, no. 1–2, pp. 58–62, 2008.

[25] D. Burgner, S. E. Jamieson, and J. M. Blackwell, “Genetic susceptibility to infectious diseases : big is beautiful , but will bigger be even better ?,” vol. 6, no. October, pp. 653–663, 2006.

[26] S. Applications and I. C. Frederick, “Effect of a single amino acid change in MHC class I molecules on the rate of progression to AIDS,” vol. 344, no. 22, pp. 1668–1675, 2001.

[27] R. Singh, R. Kaul, A. Kaul, and K. Khan, “A comparative review of HLA associations with hepatitis B and C viral infections across global populations,” vol. 13, no. 12, pp. 1770–1787, 2007.

[28] M. R. Thursz, H. C. Thomas, B. M. Greenwood, A. V.S. Hill, “Heterozygote advantage for HLA class-II type in hepatitis B virus infection.” Nat. Genet, vol. 17, pp. 11-12, 1997

[29] R. Küppers, “The biology of Hodgkin’s lymphoma.,” Nat. Rev. Cancer, vol. 9, no. 1, pp. 15– 27, Jan. 2009.

[30] L. a Needleman and a K. McAllister, “The major histocompatibility complex and autism spectrum disorder.,” Dev. Neurobiol., vol. 72, no. 10, pp. 1288–301, Oct. 2012.

[31] F. Listì, G. Candore, C. R. Balistreri, M. P. Grimaldi, V. Orlando, S. Vasto, G. Colonna- Romano, D. Lio, F. Licastro, C. Franceschi, and C. Caruso, “Association between the HLA- A2 allele and Alzheimer disease.,” Rejuvenation Res., vol. 9, no. 1, pp. 99–101, Jan. 2006.

[32] Y. Guo, X. Deng, W. Zheng, H. Xu, Z. Song, H. Liang, J. Lei, X. Jiang, Z. Luo, and H. Deng, “HLA rs3129882 variant in Chinese Han patients with late-onset sporadic Parkinson disease.,” Neurosci. Lett., vol. 501, no. 3, pp. 185–7, Sep. 2011.

[33] M. Debnath, D. M. Cannon, and G. Venkatasubramanian, “Variation in the major histocompatibility complex [MHC] gene family in schizophrenia: associations and functional implications.,” Prog. Neuropsychopharmacol. Biol. Psychiatry, vol. 42, pp. 49–62, Apr. 2013.

[34] G. J. Arason, J. Kramer, C. Szalai, G. Fu, Y. Yang, E. K. Chung, B. Zhou, C. A. Blanchong, M. Lokki, S. Bo, G. Thorgeirsson, C. Y. Yu, and M. Kova, “Genetic basis of tobacco smoking : strong association of a specific major histocompatibility complex haplotype on chromosome 6 with smoking behavior,” vol. 1940, no. May, 2004.

[35] P. T. Illing, J. P. Vivian, N. L. Dudek, L. Kostenko, Z. Chen, M. Bharadwaj, J. J. Miles, L. Kjer-Nielsen, S. Gras, N. a Williamson, S. R. Burrows, A. W. Purcell, J. Rossjohn, and J. McCluskey, “Immune self-reactivity triggered by drug-modified HLA-peptide repertoire.,” Nature, vol. 486, no. 7404, pp. 554–8, Jun. 2012.

[36] J. J. Farrell, D. Kasperavi, D. Ph, M. Carrington, M. Molokhia, B. Ch, M. R. Johnson, D. Phil, E. L. Heinzen, N. Walley, M. Pandolfo, W. Pichler, B. K. Park, C. Depondt, S. M.

36

Sisodiya, and D. B. Goldstein, “HLA-A*3101 and Carbamazepine-Induced Hypersensitivity Reactions in Europeans,” 2011.

[37] C. Cao and P. Donnelly, “Is Mate Choice in Humans MHC-Dependent ?,” vol. 4, no. 9, pp. 1–5, 2008.

[38] M. L. Metzker, “Sequencing technologies — the next generation,” Nat. Rev. Genet., vol. 11, no. 1, pp. 31–46, 2009.

[39] E. R. Mardis, “Next-Generation DNA Sequencing Methods,” 2008.

[40] J. Shendure and H. Ji, “Next-generation DNA sequencing,” vol. 26, no. 10, pp. 1135–1145, 2008.

[41] K. V Voelkerding, S. A. Dames, and J. D. Durtschi, “Next-Generation Sequencing : From Basic Research to Diagnostics CONTENT : SUMMARY :,” vol. 658, pp. 641–658, 2009.

[42] N. J. Loman, R. V Misra, T. J. Dallman, C. Constantinidou, S. E. Gharbia, J. Wain, and M. J. Pallen, “Performance comparison of benchtop high-throughput sequencing platforms,” vol. 30, no. 5, 2012.

[43] M. A. Quail, M. Smith, P. Coupland, T. D. Otto, S. R. Harris, T. R. Connor, A. Bertoni, H. P. Swerdlow, and Y. Gu, “A tale of three next generation sequencing platforms : comparison of Ion Torrent , Pacific Biosciences and Illumina MiSeq sequencers,” BMC Genomics, vol. 13, no. 1, p. 1, 2012.

[44] B. Division, C. Biology, and S. Cruz, “Characterization of individual polynucleotide molecules using a membrane channel,” vol. 93, no. November, pp. 13770–13773, 1996.

[45] E. C. Hayden, "Technology: The $ 1,000 genome," Nature, vol. 507, pp. 294-295, March. 2014.

[46] D. A. Wheeler, M. Srinivasan, M. Egholm, Y. Shen, L. Chen, A. Mcguire, W. He, Y. Chen, V. Makhijani, G. T. Roth, X. Gomes, K. Tartaro, F. Niazi, C. L. Turcotte, G. P. Irzyk, J. R. Lupski, C. Chinault, X. Song, Y. Liu, Y. Yuan, L. Nazareth, X. Qin, D. M. Muzny, M. Margulies, G. M. Weinstock, R. A. Gibbs, and J. M. Rothberg, “The complete genome of an individual by massively parallel DNA sequencing,” vol. 452, no. April, pp. 872–877, 2008.

[47] S. Fig and T. G. Analyzer, “Accurate whole human genome sequencing using reversible terminator chemistry,” vol. 456, no. November, 2008.

[48] D. Rank, P. Baybayan, B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark, R. Dalal, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, and S. Turner, “Single Polymerase Molecules,” no. January, pp. 133–138, 2009.

37

[49] R. Drmanac, K. P. Pant, J. Baccash, A. P. Borcherding, A. Brownley, R. Cedeno, L. Chen, D. Chernikoff, A. Cheung, R. Chirita, B. Curson, J. C. Ebert, C. R. Hacker, R. Hartlage, B. Hauser, S. Huang, Y. Jiang, V. Karpinchyk, M. Koenig, C. Kong, T. Landers, C. Le, J. Liu, C. E. Mcbride, M. Morenzoni, R. E. Morey, K. Mutch, H. Perazich, K. Perry, B. A. Peters, J. Peterson, C. L. Pethiyagoda, K. Pothuraju, C. Richter, A. M. Rosenbaum, S. Roy, J. Shafto, U. Sharanhovich, K. W. Shannon, C. G. Sheppy, M. Sun, J. V Thakuria, A. Tran, D. Vu, A. W. Zaranek, X. Wu, D. G. Ballinger, G. M. Church, and C. A. Reid, “Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays Radoje Drmanac,” vol. 78, no. 2009, 2010.

[50] P. Carnevali, J. Baccash, A. L. Halpern, I. Nazarenko, G. B. Nilsen, K. P. Pant, J. C. Ebert, A. Brownley, M. Morenzoni, V. Karpinchyk, B. Martin, D. G. Ballinger, and R. Drmanac, “Computational Techniques for Human Genome Resequencing Using Mated Gapped Reads,” vol. 19, no. 3, pp. 279–292, 2012.

[51] M. J. Levene, J. Korlach, S. W. Turner, M. Foquet, H. G. Craighead, and W. W. Webb, “Zero-mode waveguides for single-molecule analysis at high concentrations.,” Science, vol. 299, no. 5607, pp. 682–6, Jan. 2003.

[52] D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang, S. B. Jovanovich, P. S. Krstic, S. Lindsay, X. S. Ling, C. H. Mastrangelo, A. Meller, J. S. Oliver, Y. V Pershin, J. M. Ramsey, R. Riehn, G. V Soni, V. Tabard-cossa, M. Wanunu, and M. Wiggin, “The potential and challenges of nanopore sequencing,” vol. 26, no. 10, pp. 1146–1153, 2008.

[53] J. Shendure, M. D. Wang, K. Zhang, R. D. Mitra, and G. M. Church, “Genome Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome,” vol. 1728, no. 2005, 2012.

[54] D. Pushkarev, N. F. Neff, and S. R. Quake, “Single-molecule sequencing of an individual human genome,” Nat. Biotechnol., vol. 27, no. 9, pp. 847–850, 2009.

[55] P. I. W. de Bakker and S. Raychaudhuri, “Interrogating the major histocompatibility complex with high-throughput genomics.,” Hum. Mol. Genet., vol. 21, no. R1, pp. R29–36, Oct. 2012.

[56] S. Raychaudhuri, C. Sandor, E. a Stahl, J. Freudenberg, H.-S. Lee, X. Jia, L. Alfredsson, L. Padyukov, L. Klareskog, J. Worthington, K. a Siminovitch, S.-C. Bae, R. M. Plenge, P. K. Gregersen, and P. I. W. de Bakker, “Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis.,” Nat. Genet., vol. 44, no. 3, pp. 291–6, Mar. 2012.

[57] N. a Patsopoulos, L. F. Barcellos, R. Q. Hintzen, C. Schaefer, C. M. van Duijn, J. a Noble, T. Raj, P.-A. Gourraud, B. E. Stranger, J. Oksenberg, T. Olsson, B. V Taylor, S. Sawcer, D. a Hafler, M. Carrington, P. L. De Jager, and P. I. W. de Bakker, “Fine-mapping the genetic association of the major histocompatibility complex in multiple sclerosis: HLA and non- HLA effects.,” PLoS Genet., vol. 9, no. 11, p. e1003926, Nov. 2013.

[58] Y. Sheng, X. Jin, J. Xu, J. Gao, X. Du, D. Duan, B. Li, J. Zhao, W. Zhan, H. Tang, X. Tang, Y. Li, H. Cheng, X. Zuo, J. Mei, F. Zhou, B. Liang, G. Chen, C. Shen, H. Cui, X. Zhang, C.

38

Zhang, W. Wang, X. Zheng, X. Fan, Z. Wang, F. Xiao, Y. Cui, Y. Li, J. Wang, and S. Yang, “new susceptibility loci for psoriasis,” Nat. Commun., vol. 5, pp. 1–6, 2014.

[59] P. I. W. de Bakker, G. McVean, P. C. Sabeti, M. M. Miretti, T. Green, J. Marchini, X. Ke, A. J. Monsuur, P. Whittaker, M. Delgado, J. Morrison, A. Richardson, E. C. Walsh, X. Gao, L. Galver, J. Hart, D. a Hafler, M. Pericak-Vance, J. a Todd, M. J. Daly, J. Trowsdale, C. Wijmenga, T. J. Vyse, S. Beck, S. S. Murray, M. Carrington, S. Gregory, P. Deloukas, and J. D. Rioux, “A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC.,” Nat. Genet., vol. 38, no. 10, pp. 1166–72, Oct. 2006.

[60] E. Tuzun, A. J. Sharp, J. A. Bailey, R. Kaul, V. A. Morrison, L. M. Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. V Olson, and E. E. Eichler, “Fine-scale structural variation of the human genome,” vol. 37, no. 7, pp. 727–732, 2005.

[61] L. Feuk, A. R. Carson, and S. W. Scherer, “Structural variation in the human genome,” vol. 7, no. February, pp. 85–97, 2006.

[62] J. M. Kidd, G. M. Cooper, W. F. Donahue, H. S. Hayden, N. Sampas, T. Graves, N. Hansen, B. Teague, C. Alkan, F. Antonacci, E. Haugen, T. Zerr, N. A. Yamada, Z. Cheng, H. M. Ebling, N. Tusneem, R. David, P. Tsang, T. L. Newman, E. Tu, W. Gillett, K. A. Phelps, M. Weaver, D. Saranga, A. Brand, W. Tao, E. Gustafson, K. Mckernan, L. Chen, M. Malig, J. D. Smith, J. M. Korn, S. A. Mccarroll, D. A. Altshuler, D. A. Peiffer, M. Dorschner, J. Stamatoyannopoulos, D. Schwartz, D. A. Nickerson, J. C. Mullikin, R. K. Wilson, L. Bruhn, M. V Olson, R. Kaul, D. R. Smith, and E. E. Eichler, “Mapping and sequencing of structural variation from eight human genomes,” vol. 453, no. May, 2008.

[63] Y. Li, H. Zheng, R. Luo, H. Wu, H. Zhu, R. Li, H. Cao, B. Wu, S. Huang, H. Shao, H. Ma, F. Zhang, S. Feng, W. Zhang, H. Du, G. Tian, J. Li, X. Zhang, S. Li, L. Bolund, K. Kristiansen, A. J. De Smith, A. I. F. Blakemore, L. J. M. Coin, H. Yang, J. Wang, and J. Wang, “A n a ly s ly i s s Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly,” Nat. Biotechnol., vol. 29, no. 8, pp. 725–732, 2011.

[64] Z. Cheng, M. Ventura, X. She, P. Khaitovich, T. Graves, K. Osoegawa, M. Rocchi, E. E. Eichler, D. Church, P. Dejong, R. K. Wilson, and S. Pa, “ARTICLES A genome-wide comparison of recent chimpanzee and human segmental duplications,” vol. 437, no. September, 2005.

[65] J. R. Lupski, “Genomic rearrangements and sporadic disease,” vol. 39, no. July, pp. 43–47, 2007.

[66] The 1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing,” Nature, vol. 467, pp. 1061-1073, October, 2010.

[67] E. T. Lam, A. Hastie, C. Lin, D. Ehrlich, S. K. Das, M. D. Austin, P. Deshpande, H. Cao, N. Nagarajan, M. Xiao, and P.-Y. Kwok, “Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly.,” Nat. Biotechnol., vol. 30, no. 8, pp. 771–6, Aug. 2012.

39

[68] A. Gnirke, A. Melnikov, J. Maguire, P. Rogov, E. M. Leproust, W. Brockman, T. Fennell, G. Giannoukos, S. Fisher, C. Russ, S. Gabriel, D. B. Jaffe, E. S. Lander, and C. Nusbaum, “Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing,” vol. 27, no. 2, pp. 182–189, 2009.

[69] A. E. Shearer, A. P. Deluca, M. S. Hildebrand, K. R. Taylor, J. Gurrola, and S. Scherer, “Comprehensive genetic testing for hereditary hearing loss using massively parallel sequencing,” vol. 107, no. 49, pp. 21104–21109, 2010.

[70] M. A. Danzer, S. T. Stabentheiner, N. O. Niklas, C. H. Hackl, P. E. Hufnagl, C. H. Gu, H. A. Hauser, O. T. T. O. Krieger, and C. H. Gabriel, “Sequence Capture and Next Generation Resequencing of the MHC Region Highlights Potential Transplantation Determinants in HLA Identical Haematopoietic Stem Cell Transplantation,” no. May, pp. 201–210, 2011.

[71] P.-A. Gourraud, P. Khankhanian, N. Cereb, S. Y. Yang, M. Feolo, M. Maiers, J. D. Rioux, S. Hauser, and J. Oksenberg, “HLA diversity in the 1000 genomes dataset.,” PLoS One, vol. 9, no. 7, p. e97282, Jan. 2014.

[72] H. Cao, J. Wu, Y. Wang, H. Jiang, T. Zhang, X. Liu, Y. Xu, D. Liang, P. Gao, Y. Sun, B. Gifford, M. D’Ascenzo, X. Liu, L. C. a M. Tellier, F. Yang, X. Tong, D. Chen, J. Zheng, W. Li, T. Richmond, X. Xu, J. Wang, and Y. Li, “An Integrated Tool to Study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing,” PLoS One, vol. 8, no. 7, p. e69388, Jan. 2013.

[73] X. Chu, C.-M. Pan, S.-X. Zhao, J. Liang, G.-Q. Gao, X.-M. Zhang, G.-Y. Yuan, C.-G. Li, L.- Q. Xue, M. Shen, W. Liu, F. Xie, S.-Y. Yang, H.-F. Wang, J.-Y. Shi, W.-W. Sun, W.-H. Du, C.-L. Zuo, J.-X. Shi, B.-L. Liu, C.-C. Guo, M. Zhan, Z.-H. Gu, X.-N. Zhang, F. Sun, Z.-Q. Wang, Z.-Y. Song, C.-Y. Zou, W.-H. Sun, T. Guo, H.-M. Cao, J.-H. Ma, B. Han, P. Li, H. Jiang, Q.-H. Huang, L. Liang, L.-B. Liu, G. Chen, Q. Su, Y.-D. Peng, J.-J. Zhao, G. Ning, Z. Chen, J.-L. Chen, S.-J. Chen, W. Huang, and H.-D. Song, “A genome-wide association study identifies two new risk loci for Graves’ disease.,” Nat. Genet., vol. 43, no. 9, pp. 897–901, Sep. 2011.

[74] J. Wang, W. Wang, R. Li, Y. Li, G. Tian, L. Goodman, W. Fan, J. Zhang, J. Li, J. Zhang, Y. Guo, B. Feng, H. Li, Y. Lu, X. Fang, H. Liang, Z. Du, D. Li, Y. Zhao, Y. Hu, Z. Yang, H. Zheng, I. Hellmann, M. Inouye, J. Pool, X. Yi, J. Zhao, J. Duan, Y. Zhou, J. Qin, L. Ma, G. Li, Z. Yang, G. Zhang, B. Yang, C. Yu, F. Liang, W. Li, S. Li, D. Li, P. Ni, J. Ruan, Q. Li, H. Zhu, D. Liu, Z. Lu, N. Li, G. Guo, J. Zhang, J. Ye, L. Fang, Q. Hao, Q. Chen, Y. Liang, Y. Su, a San, C. Ping, S. Yang, F. Chen, L. Li, K. Zhou, H. Zheng, Y. Ren, L. Yang, Y. Gao, G. Yang, Z. Li, X. Feng, K. Kristiansen, G. K.-S. Wong, R. Nielsen, R. Durbin, L. Bolund, X. Zhang, S. Li, H. Yang, and J. Wang, “The diploid genome sequence of an Asian individual.,” Nature, vol. 456, no. 7218, pp. 60–5, Nov. 2008.

[75] The Genome of the Netherlands Consortium, “Whole-genome sequence variation, population structure and demographic history of the Dutch population.,” Nat. Genet., Jun. 2014.

[76] J. Kaye, M. Hurles, H. Griffin, J. Grewal, M. Bobrow, N. Timpson, C. Smee, P. Bolton, R. Durbin, S. Dyke, D. Fitzpatrick, K. Kennedy, A. Kent, D. Muddyman, F. Muntoni, L. F.

40

Raymond, R. Semple, and T. Spector, “Managing clinically significant findings in research: the UK10K example.,” Eur. J. Hum. Genet., vol. 22, no. 9, pp. 1100–4, Sep. 2014.

[77] International Human Genome Sequencing Consortium, “Initial sequencing and analysis of the human genome,” Nature, vol. 409, no. February, 2001.

[78] International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome,” Nature, pp. 931–945, 2004.

[79] R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotide alignment program.,” Bioinformatics, vol. 24, no. 5, pp. 713–4, Mar. 2008.

[80] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. a DePristo, “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.,” Genome Res., vol. 20, no. 9, pp. 1297–303, Sep. 2010.

[81] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform.,” Bioinformatics, vol. 25, no. 14, pp. 1754–60, Jul. 2009.

[82] R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang, “SNP detection for massively parallel whole-genome resequencing.,” Genome Res., vol. 19, no. 6, pp. 1124–32, Jun. 2009.

[83] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools.,” Bioinformatics, vol. 25, no. 16, pp. 2078–9, Aug. 2009.

[84] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, “Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.,” Bioinformatics, vol. 25, no. 21, pp. 2865–71, Nov. 2009.

[85] K. Chen, J. W. Wallis, M. D. McLellan, D. E. Larson, J. M. Kalicki, C. S. Pohl, S. D. McGrath, M. C. Wendl, Q. Zhang, D. P. Locke, X. Shi, R. S. Fulton, T. J. Ley, R. K. Wilson, L. Ding, and E. R. Mardis, “BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.,” Nat. Methods, vol. 6, no. 9, pp. 677–81, Sep. 2009.

[86] A. R. Hastie, L. Dong, A. Smith, J. Finklestein, E. T. Lam, N. Huo, H. Cao, P.-Y. Kwok, K. R. Deal, J. Dvorak, M.-C. Luo, Y. Gu, and M. Xiao, “Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome.,” PLoS One, vol. 8, no. 2, p. e55864, Jan. 2013.

[87] H. Cao, A. R. Hastie, D. Cao, E. T. Lam, Y. Sun, H. Huang, X. Liu, L. Lin, W. Andrews, S. Chan, S. Huang, X. Tong, M. Requa, T. Anantharaman, A. Krogh, H. Yang, H. Cao, and X. Xu, “Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology,” vol. 3, no. 1, pp. 1–11, 2014.

41

7 Appendix

I. Hongzhi Cao et, al. A Short-Read Multiplex Sequencing Method for Reliable, Cost-Effective and High-Throughput Genotyping in Large-Scale Studies. Hum Mutat. 2013 Dec; 34(12): 1715-20. Doi: 10.1002/humu.22439..

II. Hongzhi Cao et, al. An integrated Tool to study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing. PLoS ONE, 2013. 8(7): e69388: doi:10.1371/journal.pone.0069388.

III. Hongzhi Cao et, al. Rapid detection of structural variation in a human genome using nanochannel- based genome mapping technology. GigaScience 2014, 3:34

42

Paper I:

A Short-Read Multiplex Sequencing Method for Reliable, Cost- Effective and High-Throughput Genotyping in Large-Scale Studies

By Hongzhi Cao1,2, Yu Wang1,3, Wei Zhang1, Xianghua Chai1, Xiandong Zhang1 ,Shiping Chen1, Fan Yang1, Caifen Zhang1, Yulai Guo1, Ying Liu1, Zhoubiao Tang1, Caifen Chen1, Yaxin Xue1, Hefu Zhen1, Yinyin Xu1, Bin Rao1, Tao Liu1, Meiru Zhao1, Wenwei Zhang1, Yingrui Li1, Xiuqing Zhang1, Laurent C. A. M. Tellier1,2, Anders Krogh2, Karsten Kristiansen2, Jun Wang1,2,4,5 and Jian Li1,2

1BGI-Shenzhen, Shenzhen, China; 2Department of Biology, University of Copenhagen, Copenhagen, Denmark; 3School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China; 4King Abdulaziz University, Jeddah, Saudi Arabia; 5The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Publication details Published in Human mutation 35(2):270 (2014)

43

METHODS

OFFICIAL JOURNAL A Short-Read Multiplex Sequencing Method for Reliable, Cost-Effective and High-Throughput Genotyping in www.hgvs.org Large-Scale Studies

Hongzhi Cao,1,2 ∗ † Yu Wang,1,3 † Wei Zhang,1 † Xianghua Chai,1 † Xiandong Zhang,1 † Shiping Chen,1 Fan Yang,1 Caifen Zhang,1 Yulai Guo,1 Ying Liu,1 Zhoubiao Tang,1 Caifen Chen,1 Yaxin Xue,1 Hefu Zhen,1 Yinyin Xu,1 Bin Rao,1 Tao Liu,1 Meiru Zhao,1 Wenwei Zhang,1 Yingrui Li,1 Xiuqing Zhang,1 Laurent C. A. M. Tellier,1,2 Anders Krough,2 Karsten Kristiansen,2 Jun Wang,1,2,4,5 and Jian Li1,2 ∗ † 1BGI-Shenzhen, Shenzhen, China; 2Department of Biology, University of Copenhagen, Copenhagen, Denmark; 3School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China; 4King Abdulaziz University, Jeddah, Saudi Arabia; 5The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark Communicated by Ian N. M. Day Received 17 April 2013; accepted revised manuscript 29 August 2013. Published online 6 September 2013 in Wiley Online Library (www.wiley.com/humanmutation). DOI: 10.1002/humu.22439

 ABSTRACT: Accurate genotyping is important for genetic Hum Mutat 00:1–6, 2013. C 2013 Wiley Periodicals, Inc. testing. Sanger sequencing-based typing is the gold stan- KEY WORDS: short-read next-generation sequencing; high- dard for genotyping, but it has been underused, due to throughput; genotyping; HLA its high cost and low throughput. In contrast, short-read sequencing provides inexpensive and high-throughput se- quencing, holding great promise for reaching the goal of cost-effective and high-throughput genotyping. However, the short-read length and the paucity of appropriate geno- Introduction typing methods, pose a major challenge. Here, we present Genetic testing plays an important role in physician decision RCHSBT—reliable, cost-effective and high-throughput making [Carrington and O’Brien, 2003; Napolitano et al., 2005], sequence based typing pipeline—which takes short se- and population genetic screening for common inherited single- quence reads as input, but uses a unique variant call- gene disorders is an economical and effective way to reduce the ing, haploid sequence assembling algorithm, can accurately incidence of disease [Bell et al., 2011; Castellani et al., 2009; Liao genotype with greater effective length per amplicon than et al., 2005]. One of the most common types of genetic testing even Sanger sequencing reads. The RCHSBT method was is genotyping. A variety of genotyping tests are available, includ- tested for the human MHC loci HLA-A, HLA-B, HLA-C, ing Restriction Fragment Length Polymorphism, PCR-sequence- HLA-DQB1,andHLA-DRB1, upon 96 samples using Il- specific oligonucleotides [Ng et al., 1993], PCR-sequence-specific lumina PE 150 reads. Amplicons as long as 950 bp were primers [Natio et al., 2001], reverse dot-blot [Gravitt et al., 1998], readily genotyped, achieving 100% typing concordance and PCR-sequencing based typing (PCR-SBT) [Bettinotti et al., between RCHSBT-called genotypes and genotypes pre- 1997]. Among these methods, PCR-SBT based on Sanger sequenc- viously called by Sanger sequence. Genotyping through- ing offers the highest resolution, and is the accepted genotyping put was increased over 10 times, and cost was reduced reference method [Danzer et al., 2007, 2008]. over five times, for RCHSBT as compared with Sanger Sanger sequencing, well known for its accuracy and long-read sequence genotyping. We thus demonstrate RCHSBT to length (800–1000bp), has dominated high-quality sequencing for be a genotyping method comparable to Sanger sequencing- about 30 years since its introduction as the gold standard of geno- based typing in quality, while being more cost-effective, typing. But high cost and low throughput has made it underused, and higher throughput. particularly for large-scale genotyping projects. Currently popu- lar small-size next-generation sequencing (NGS) technology, com- prising massively parallel single-molecule sequencing, can gener- ate billions of bases from amplified single DNA molecules within Additional Supporting Information may be found in the online version of this article. days, holding great promise for achieving cost-effective and high- †These authors contributed equally to this work. throughput genotyping. However, short-read length and error- ∗Correspondence to: Jian Li, BGI-Shenzhen. Main Building, Beishan Industrial Zone, prone reading, are significant weaknesses of the method [Metzker, Yantian District, Shenzhen 518083, China. E-mail: [email protected]; Hongzhi Cao, BGI- 2009; Voelkerding et al., 2009]. Shenzhen. Main Building, Beishan Industrial Zone, Yantian District, Shenzhen 518083, 454 sequencing—a sequencing technique developed by Roche China. E-mail: [email protected] Nimblegen with the advantage of longer reads—was the first NGS Contract grant sponsors: National High Technology Research and Development used for genotyping. Several groups have reported successful use of Program of China – 863 Program (2012AA02A201); The Guangdong Enterprise Key 454 sequencing for HLA genotyping [Bentley et al., 2009; Danzer Laboratory of Human Disease Genomics, ShenZhen Engineering Laboratory for Clinical et al., 2013; Gabriel et al., 2009; Galan et al., 2010; Holcomb Molecular Diagnostic et al., 2011; Lank et al., 2010]. Similar in strategy, they combine

C 2013 WILEY PERIODICALS, INC. multiplex identifier sequence (MIDs) tagged PCR with direct paral- (PE) sequencing, effective variant phasing, and haploid sequence as- lel deep sequencing of the pooled amplicons. Genotypes are called sembling pipeline. To rigorously assess its performance, we tested it by comparing sorted sequencing data to allele sequences in the gene for HLA-A (MIM #142800), HLA-B (MIM #142830), HLA-C (MIM database. Deep sequencing minimizes the importance of random #142840), HLA-DQB1 (MIM #604305), and HLA-DRB1 (MIM sequencing errors, but this is still not much of an improvement over #142857), upon 96 previously diagnosed samples using Illumina Sanger SBT in cost and throughput. At the same time, the amplicons PE 150 reads. We demonstrate RCHSBT to be a genotyping method involved are restricted to having a length within the maximum-read comparable to Sanger sequencing based typing in quality, while length of the sequencing machine, which increases the complexity being more cost effective, and higher throughput. and number of routes to failure in the method. Few reports about the success or failure of short-read NGS based genotyping methods have been published, though many short-read Material and Methods NGS platforms, such as Illumina sequencing, have lower sequencing costs and much higher sequencing throughput compared with 454 Samples sequencing [Bamshad et al., 2011]. The short sequence-read length, and the lack of a coherent genotyping methodology, is a major A total of 96 purified DNA samples with known HLA types from challenge. Currently, most short-read NGS methods have a read volunteers were selected. Written consent forms were collected from length of no more than 150 bp (sequence with paired-end read, all volunteers who were involved in this study. These samples had 150 × 2), so applying these methods to genotyping with the strategy been extensively characterized by a variety of HLA testing methods, mentioned above can only involve amplicons shorter than 300 bp, including Sanger SBT.The DNA sample information is listed inSupp. otherwise many ambiguous genotypes will be obtained. For target Table S1. DNA fragments longer than 300 bp, multiple iterations of PCR will RCHSBT follows four steps (Fig. 1A and B): be needed to control the amplicon length, increasing the complexity and the cost of the test. (1) Amplification, shearing, and sequencing No previously reported NGS-based genotyping method has so 15 HLA Class I and Class II exonic regions were amplified far showed significant improvement in cost and throughput, when by tag PCR (Supp. Fig. S1a). In this assay, both forward and compared with Sanger SBT. reverse PCR primers respectively contain an identifier 14 PCR It is for this reason that we have developed RCHSBT: an NGS- amplifications were performed for each of the 96 samples (Supp. based genotyping method that uses a massively parallel paired-end Fig. S1b). PCR products of these 96 samples were then pooled

Figure 1. Schematic representation of RCHSBT pipeline. A: The illustration of RCHSBT. Amplicons with read length comparable to Sanger sequencing reads, tagged with MIDs at both ends. They are pooled together and sheared, in pools with fragments ranging from NGS sequencing read length, to complete amplicon length, which are then recovered and sequenced by PE sequencing. Amplicons from different samples have different MIDs, and sequencing data are grouped to the specific loci of individual samples. Consensus sequences of amplicons are assembled by mapping grouped reads to the reference sequence, and used for genotyping. B: An overview of the bioinformatics pipeline of RCHSBT. After excluding low quality reads, sequencing reads are grouped to specific loci of individual samples. Grouped reads are mapped to the reference, and variants are called. Variants are phased, and used for constructing haploid sequences. The genotypes of amplicons are obtained by comparing haploid sequences to the gene database. The final genotype can then be obtained by logical deduction, as illustrated.

2 HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 together. The pooled amplicons were fragmented, recovered specific samples. 0.59% had a MID/genomic primer mis-pairing, and used for library construction. The library was sequenced likely caused by template switching during PCR cycles [Odelberg with PE 150 bp reads on Illumina Miseq (See Supp. Methods et al., 1995]. The remaining 28.96% reads, without MIDs or genomic for detail). primer sequences, could not be readily utilized. (2) Grouping The poor amplification and amplification failure of some PCR DNA sequence reads of low quality were filtered out. The output can be confirmed by grouped reads. Groups with apparently cleaned PE reads were grouped to individual samples, based low or entirely absent reads will be excluded from the following on the MIDs, and grouped to the specific loci, by genomic analysis. Grouped reads were aligned to reference (Supp. Table S2) primer sequence. at HLA-A, HLA-B, HLA-C, HLA-DQB1,andHLA-DRB1 by BWA (3) Mapping and haploid sequence assembling [Li and Durbin, 2009; Li et al., 2008]. As the alignment parameters The MIDs and genomic primer sequence were trimmed before set were not strict (See Supp. Methods for detail), reads with many grouped PE reads were aligned locally to the reference sequence sequencing errors could still be mapped to the reference. 93.05% of (Supp. Table S2) by BWA. Variants were called by using SAM- all the grouped PE reads, with both ends intact, were successfully tools. The aligned PE reads with a number of mismatches larger mapped to the reference sequence, under the initial alignment pa- than its maximum allowable mismatches (MAM) cutoff (see rameters. The unaligned and single-end aligned reads were purged Supp. Methods for detail) was filtered. After filtering, the vari- from mapping. ants were called again by SAMtools. The number of aligned PE Variants were called by SAMtools [Li et al., 2009], in diploid reads supporting linkage of any two particular, heterozygous mode (See Supp. Methods for detail). The ratios of each ampli- SNPs in the amplicon region was investigated (Fig. 2A,I). All con’s heterozygous variants’ two alleles were analyzed. 94.16% had the reliable blocks were integrated to phase the heterozygous a ratio above 1:6 (the limit of detection for Sanger sequencing) on SNPs of the amplicon. The alignments were sorted into two average. We evaluated variant calling accuracy of RCHSBT at this parts (heterozygous alleles) or one part (homozygous allele), step by comparing consistency of variants on all 96 samples de- according to each given pair of complete SNP haploids. Consen- tected by RCHSBT and Sanger SBT (Supp. Table S3). The rate of sus sequences were assembled for each part by using SAMtools false positive and false negative variant calling was 6.21% and 0%, in haploid mode. Pairs of haploid sequence were used for the respectively—loci with allele amplification failure were not included genotyping which ensued. in this statistic. (4) Genotyping Based on the high accuracy of the initial variant calling, we used Genotypes were obtained by comparing the thus determined the MAM strategy to individually filter aligned PE reads (See Supp. haploid sequences to allele sequences already described in the Methods for detail). After this filtering, 18.38% of the alignments gene database (IMGT, http://www.ebi.ac.uk/imgt/hla/). Utiliz- with mismatch counts larger than MAM were removed from variant ing RCHSBT, amplicons with the length of Sanger sequencing calling. reads were readily genotyped by short PE sequencing. The rate of heterozygous variants with allele ratios above 1:6 was increased to 98.87%, and the rate of false positive calling reduced to 0.48%, but the rate of false negative calling was increased to 0.08%, Results as the result of this filtering. This increase in the rate of heterozygous variants with an allele ratio above 1:6, and this concurrent reduction PCR Amplification and Miseq Sequencing in false positives, mainly came from amplicons of HLA-C (partic- ularly C4). Before MAM filtering, 21.10% of heterozygous variants A total of 96 purified DNA samples with known HLA type were from amplicons of HLA-C had allele ratios below 1:6, and the rate amplified with a unique set of PCR primers at five loci (HLA-A, of false positive variant calling was 26.57%. After MAM filtering, HLA-B, HLA-C, HLA-DQB1,andHLA-DRB1), using tag PCR. The the ratio of heterozygous variants from amplicons of HLA-C with amplicons ranged from 290 to 950 bp in length. For most PCR, only an allele ratio below 1:6 was reduced to 0.22% and the rate of false a single band was observed following amplification and gel elec- positive variant calling was reduced to 0.18% (Supp. Table S3). trophoresis. Multiple bands, apparently weak bands, or no bands Depth of each site, for each amplicon of each of the 96 samples, were observed by gel electrophoresis when there was nonspecific was plotted in Supp. Figure S2. Average depth was 789× for all amplification, poor amplification, or amplification failure. All am- amplicons. For the exonic region, 99.9% have depths above 100×, plicons of the 96 samples were pooled, and randomly sheared. and 86.67% above 700×. For all grouped PE reads, because they have Sequencing libraries with fragments including library adapters at least one end containing MIDs and genomic primer sequences, between 400 and 700 bp in length were recovered by gel slicing. The all amplicons appeared to have the highest depth at the two ends, length range of the recovered DNA fragments was confirmed by use except for HLA-DRB1, which had highest depth in the middle. This of DNA bioanalyzer 2100. Sequencing was performed on Illumina was because HLA-DRB1 was the only amplicon shorter than 300 bp, Miseq with PE 150 bp reads, in a single run. A total of 1.89 Gb of a length that was within the sequence length of the Miseq PE 150 read data, 6.30 M PE reads was generated. 97.08% and 85.23% of the read. sequence fragments had an average quality value over Q20 and Q30, respectively. The average percentage of GC content was 59.94%. Variant Phasing and Haploid Sequence Assembly Sequencing Data Grouping and Variants Calling In order to further improve variant calling accuracy, we chose to Of the original 1.89 Gb of raw data, 1.86 Gb of high quality refine sequence reads aligned to the reference sequence. We con- sequence data remained for analysis after all low quality PE reads structed a series of blocks for every amplicon, using the PE reads. were purged (See Supp. Methods for detail). For the cleaned PE Each block reflected the linkage pattern of two particular heterozy- reads, 70.45% contained MIDs and genomic primer sequence in gous variants. We integrated all of those blocks to phase heterozy- at least one end, and could be grouped to specific amplicons of gous variants of the amplicons (Fig. 2A).

HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 3 Figure 2. Schematic representation of variant phasing and haploid sequence assembly. A: Schematic of variant phasing: A,I: Record linkage relationships between bases located at heterozygous variant sites, according to paired-end reads. A,II: Blocking: All types of linkage relationships between heterozygous variants (blocks) are determined, as supported by PE reads. A,III: Phasing: Haploid variants are obtained, by integrating all the blocks to phase. B: Haploid sequence assembly: B,I: Before sorting, reads from both alleles are mixed. B,II: Sorting: According to the haploid variants, reads from the same alleles are sorted and clustered. B,III: Haploid sequences of the alleles are assembled according to their assigned read cluster.

For amplicons shorter than 770 bp, 99.87% were completely mine exonal genotypes. For each individual sample, the genotypes phased. 0.09% had one phasing breakpoint, 0.04% had two phasing of the most frequently studied exons (See Supp. Methods for detail) breakpoints. For amplicons longer than 770 bp, few amplicons were of the given HLA gene were defined as the candidate HLA allele completely phased, due to there not being enough long DNA frag- genotype, and used for method comparison. ments sequenced to support the linkage of heterozygous variants. The genotypes of other, less frequently studied exons were used We analyzed the length distribution of the DNA fragments, using to filter the candidate genotypes. After filtering, in cases where there the aligned PE reads (Supp. Fig. S3). Over 85.02% of the recovered were more than two possible HLA allele genotypes, two of the prob- fragments were within a length range of 150–400 bp, and only 2.52% able allele types were randomly paired. The genotypes, as defined had a length over 400 bp. The incomplete haploid pairs within each by these paired alleles, were then confirmed by a logical algorithm. amplicon were cross matched with each other to construct complete By logical deduction, if there was any ambiguity in the choice of haploid pairs. With pairs of haploid variants, we sorted the aligned genotype, information was inferred from intron variants to resolve sequence reads of every amplicon into two parts, and constructed the ambiguity. consensus haploid sequences for each portion of the sorted sequence Of the 480 genotypes from all 96 samples considered, assign- reads (Fig. 2B). ment was possible in 97.50% of the cases, and of those 97.50%, Variants were called by SAMtools for each haploid sequence, 100% of assignments were concordant with the original Sanger running in the haploid mode (See Supp. Methods for detail). We genotypes attached to the samples. Notably, among the assigned evaluated the improvement of variant calling accuracy by variant genotypes, there were several InDel genotypes, such as C∗04:82, phasing and haploid sequence assembly (Supp. Table S3). Through which has nine bases inserted in exon 5, compared with the ref- variant phasing and haploid sequence assembly, the rate of false erence. The assigned genotypes for HLA-A, HLA-B, HLA-C,and positive and false negative variant calling was reduced to 0.41% HLA-DQB1 of all samples were 100% unambiguous (Table 1). and 0.02%, respectively. Pairs of haploid variants were used for the Though one of the assigned genotypes for HLA-B of sample L002 ensuing genotyping. was still ambiguous after logical judgment (B∗15:02:01/15:01:01:01, B∗15:25:01/15:15), the ambiguity was resolved by inference, using Genotyping heterozygous variant information of intron 2 from the same gene. Because we only amplified parts of exon 2 of HLA-DRB1 for geno- The resulting haploid variants of each exon were compared with typing, the unambiguous assignment rate of HLA-DRB1 was only the local HLA variants database (Supp. Table S4), in order to deter- 48.96%.

4 HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 Table 1. Genotyping Consistence Statistic Result of HLA Genes for 96 Samples

HLA-A HLA-B HLA-C HLA-DQB1 HLA-DRB1

Number of sample with unambiguous genotype assignment 94 92 93 96 47 Number of sample with ambiguous genotype assignment 0 0 0 0 46 Number of sample with no genotype assignment 2 4 3 0 3 Consistence of assigned genotype with Sanger result (%) 100 100 100 100 100

We performed an analysis to discover the reason why some alleles dent of allele database and allele frequency information, so, it can be were not assigned genotypes. For the 12 unassigned alleles, half of used for the genotyping of genes without allele database and allele them were caused by nonspecific amplification, 1/3 of them were frequency information, opening a wider application than previous caused by allele amplification failure, and the rest were caused by reported methods. Consequently, RCHSBT can be run indepen- insufficient data, resulting from poor PCR amplification efficiency. dently of allele database and frequency information, in order to genotype new and rare alleles. Discussion As other NGS-based genotyping methods mainly adopt the 454 sequencing platform, the short-read NGS based RCHSBT has a Using RCHSBT, whichtakes short sequencing reads as input, but substantive advantage in terms of sequencing costs and sequenc- uses a unique variant calling, haploid sequence assembling algo- ing throughput. RCHSBT capacity to genotype long amplicons (up rithm, we can accurately genotype with greater effective length per to 1.6 kb for Illumina) with short reads enables genotyping with amplicon than even Sanger sequencing reads. In the study, ampli- fewer amplicons than current 454 based genotyping methods, re- cons as long as 950 bp—over sixfold the length of the sequence-read ducing cost and improving throughput. A multisite study, using length were readily genotyped, using Illumina PE 150 sequence high-resolution HLA genotyping by 454 GS FlX, reported that they reads. According to the principle of RCHSBT, the longest sequence could genotype 20 samples for 10 HLA genes (17 exons involved) assembled can be two times the length of the longest PE sequencing in a single sequencing run, with reagent costs approximately equiv- insert. As the maximum DNA length in Illumina sequencing flow- alent to Sanger SBT [Holcomb et al., 2011]. In our pilot study, we cells is ca. 800 bp, the longest amplicon that can be assembled by genotyped 96 samples on 5 HLA genes (24 exons involved) in a RCHSBT is 1,600 bp with a single amplification. This high maxi- single Illumina Miseq sequencing run, enabling us to cut HLA typ- mum length could greatly increase the flexibility of primer design, ing costs by 80%, and to improve HLA typing throughput by 10× and consequently reduce the complexity of the assay. compared with Sanger SBT. Besides comparatively short-read length, a high rate of sequencing Compared with other short-read NGS based genotyping meth- error is another major difficulty for NGS-based genotyping when ods, RCHSBT constitutes an immense improvement. In a previously compared with Sanger, especially for clinical genotyping, where reported Illumina genotyping study [Wang et al., 2012], encoded se- resultant diagnosis errors are poorly tolerated. This is why the re- quencing adaptors were used to identify the source of sequence reads liability of RCHSBT, reliant on more accurate methods for variant for each sample, requiring library construction for each sample. In calling, variant phasing, and haploid assembly, is important. In this study, both primer MIDs and encoded sequencing adaptors RCHSBT, the MAM strategy effectively filters out both reads with were used to trace the source of sequence reads for each sample. excess sequencing errors, and reads which are mapped improperly. Amplicons with different primer MIDs were pooled together when This filtering facilitates accurate variant calling. constructing library, which greatly reduced the number of libraries In the pilot study, 18.38% of aligned reads with mismatch counts constructed. Although the initial synthesis of MID primers may be larger than their MAM were removed from variant calling, by MAM expensive, the cost of primer synthesis for each test will be compar- filtering. Using each sample’s known genotype corresponding vari- atively lower in large-scale projects using RCHSBT, and the number ants as a reference, overall rates of false positive variant calling were of primers synthesized can be adjusted according to the scale of the reduced from 6.21% to 0.48%, while overall rates of false negative project. The previous study reported genotyping four HLA genes variant calling increased from 0% to 0.08%—this latter increase due (25 exons in total) in 180 samples, for a single lane of Illumina Hiseq to MAM filtering removing reads with both an extraordinarily high 2000. With the same rate of output (30G), we estimate that at least error rate, but also utile information content, particularly impact- 1,500 samples, genotyping on five HLA genes, can be processed at ing variant calling in low depth regions (Supp. Table S3). However, the same time in a single Hiseq 2000 lane using RCHSBT, based on the majority of false negatives can be reduced by using the haploid the result of our pilot study. In unpublished data, we have randomly assembly process, hence the only moderate increase in FN. In the picked half of the 1.89 Gb raw data of the pilot study for genotyp- study, the overall rate of false negative variant calling was reduced ing, run RCHSBT on that data, and the results were comparable from 0.08% to 0.02%, after application of the haploid assembly pro- to results obtained by running RCHSBT on the whole set of raw cess. Besides this, the overall rate of false positive variant calling was data. This implies potential for further improvement of cost and reduced to 0.41% (Supp. Table S3). throughput of genotyping. Beyond this—and without considering nonspecific amplifica- Although the performance of RCHSBT was obtained from testing tion, amplification failure, and poor amplification factors—the final of HLA genes, similar result can be predicted when testing nonpoly- variant calling was 100% consistent with the previous Sanger SBT morphic genes. As few variants will be called for nonpolymorphic results, as previously described. Combined with deep sequencing genes, variants phasing and haploid sequence assembly will have lit- (789× on average), the analysis pipeline provides extreme reliabil- tle effect on genotyping accuracy. But as each sequence reads contain ity, effectively comparable to the best-of-class, as assigned genotypes few SNPs, their initial mapping are accurate, and deep sequencing were 100% concordant with the Sanger results. help remove the influence of sequencing error, people can still geno- ThekeypointofRCHSBTistoconstructthehaploidsequence type long amplicons accurately by RCHSBT using short sequence accurately. The haploid sequence construction process is indepen- read.

HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 5 RCHSBT can meet the requirements of both medium-size geno- Bettinotti MP, Mitsuishi Y, Bibee K, Lau M, Terasaki PI. 1997. Comprehensive method typing projects, and large-scale genotyping projects. Combined with for the typing of HLA-A, B, and C alleles by direct sequencing of PCR products one of the very rapid NGS platforms, such as Illumina Miseq, hun- obtained from genomic DNA. J Immunother 20(6):425–430. Carrington M, O’Brien SJ. 2003. The influence of HLA genotype on AIDS. Annu Rev dreds of samples can be genotyped in under a week, at levels suitable Med 54:535–551. for clinical genotyping. When reporting time is not very impor- Castellani C, Picci L, Tamanini A, Girardi P, Rizzotti P, Assael BM. 2009. Association tant, people can choose high-throughput NGS platforms, such as between carrier screening and incidence of cystic fibrosis. JAMA 302(23):2573– Illumina Hiseq 2000. Tens of thousands of samples can be geno- 2579. ¨ ¨ typed in a single Hiseq 2000 run, at costs lower than Miseq, making DanzerM,NiklasN,StabentheinerS,HoferK,Proll J, Stuckler C, Raml E, Polin H, Gabriel C. 2013. Rapid, scalable and highly automated HLA genotyping using next- this method very suitable for large-scale genotyping projects, such generation sequencing: a transition from research to diagnostics. BMC Genomics as HLA genotyping for global scale bone marrow registries, gene 14(1):221. screening for common monogenic disease like thalassemia, and the Danzer M, Polin H, Hofer K, Proll J, Gabriel C. 2008. Characterisation of two novel ∗ ∗ like. HLA alleles, HLA-Cw 0429 and HLA-DRB3 0223. Tissue Antigens 72(5):498– 499. Although the advantages of RCHSBT were obvious, the limita- Danzer M, Polin H, Proll J, Hofer K, Fae I, Fischer GF, Gabriel C. 2007. High-throughput tions must be considered. The longer the amplicon, the less per- sequence-based typing strategy for HLA-DRB1 based on real-time polymerase centage of raw data can be used. PE reads having no MIDs and chain reaction. Hum Immunol 68(11):915–917. genomic primer sequence in at least one end are useless, as they Gabriel C, Danzer M, Hackl C, Kopal G, Hufnagl P, Hofer K, Polin H, Stabentheiner S, ¨ cannot be grouped to specific samples for the following analysis. Proll J. 2009. Rapid high-throughput human leukocyte antigen typing by massively parallel pyrosequencing for high-resolution allele identification. Hum Immunol In unpublished data, we tested amplicons with a length of 1.4 kb 70(11):960–964. using RCHSBT, and we found only about 40% of the raw data were Galan M, Guivier E, Caraux G, Charbonnel N, Cosson J-F. 2010. A 454 multiplex useful. Also, as we found the longer the amplicon, the lower the sequencing method for rapid and reliable genotyping of highly polymorphic genes depth in the middle of amplicon was, much more data are needed in large-scale studies. BMC Genomics 11(1):296. Gravitt PE, Peyton CL, Apple RJ, Wheeler CM. 1998. Genotyping of 27 human papil- for genotyping long amplicons using RCHSBT. Another limitation lomavirus types by using L1 consensus PCR products by a single-hybridization, is about SNP phasing. For SNPs within the same amplicons, we reverse line blot detection method. J Clin Microbiol 36(10):3020–3027. can phase them well, and the maximum length of amplicon can be Holcomb CL, Hoglund¨ B, Anderson MW, Blake LA, Bohme¨ I, Egholm M, Ferriola D, two times the length of the longest PE sequencing insert. For SNPs Gabriel C, Gelber SE, Goodridge D, Hawbecker S, Klein R, et al. 2011. A multi- within different amplicons, we can phase them well only in the case site study using high-resolution HLA genotyping by next generation sequencing. Tissue Antigens 77(3):206–217. of the two amplicons are overlapped and they have common SNPs Lank SM, Wiseman RW, Dudley DM, O’Connor DH. 2010. A novel single cDNA between the two SNPs we are going to phase. amplicon pyrosequencing method for high-throughput, cost-effective sequence- In this assay, most of our time and labor was spent on PCR am- based HLA class I genotyping. Hum Immunol 71(10):1011–1017. plification. We know that the key to scaling PCR-based sequence Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760. enrichment involves automation, miniaturization, and multiplex- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, ing of reactions. There are commercially available platforms which Durbin R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinfor- enable miniaturized PCR by using microfluidics (Fluidigm). The matics 25(16):2078–2079. combination of RCHSBT and microfluidic PCR can further im- Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads and calling variants prove genotyping throughput and cost, making it highly suitable using mapping quality scores. Genome Res 18(11):1851–1858. LiaoC,MoQH,LiJ,LiLY,HuangYN,HuaL,LiQM,ZhangJZ,FengQ,ZengR, for studies involving large numbers of amplifications, or large num- Zhong HZ, Jia SQ, et al. 2005. Carrier screening for alpha- and beta-thalassemia in bers of genes for each sample. pregnancy: the results of an 11-year prospective program in Guangzhou Maternal and Neonatal hospital. Prenat Diagn 25(2):163–171. Metzker ML. 2009. Sequencing technologies—the next generation. Nat Rev Genet Acknowledgments 11(1):31–46. Naito H, Hayashi S, Abe K. 2001. Rapid and specific genotyping system for hepatitis B The authors thank the members of BGI-Shenzhen genome sequencing team virus corresponding to six major genotypes by PCR using type-specific primers. J for help with sequencing. Clin Microbiol 39(1):362–364. Disclosure statement: The authors declare no conflict of interest. Napolitano C, Priori SG, Schwartz PJ, Bloise R, Ronchetti E, Nastoli J, Bottelli G, Cerrone M, Leonardi S. 2005. Genetic testing in the long QT syndrome: development and validation of an efficient approach to genotyping in clinical practice. JAMA 294(23):2975–2980 References NgJ,HurleyCK,Baxter-LoweLA,ChopekM,CoppoPA,HeglandJ,KuKurugaD, Monos D, Rosner G, Schmeckpeper B, Yang SY, Dupont B, et al. 1993. Large-scale Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. oligonucleotide typing for HLA-DRB1/3/4 and HLA-DQB1 is highly accurate, 2011. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev specific, and reliable. Tissue Antigens 42(5):473–479. Genet 12(11):745–755. Odelberg SJ, WeissRB, Hata A, White R. 1995. Template-switching during DNA synthe- Bell CJ, Dinwiddie DL, Miller NA, Hateley SL, Ganusova EE, Mudge J, Langley RJ, Zhang sis by Thermus aquaticus DNA polymerase I. Nucleic Acids Res 23(11):2049–2057. L, Lee CC, Schilkey FD, Sheth V,Woodward JE, et al. 2011. Carrier testing for severe Voelkerding KV, Dames SA, Durtschi JD. 2009. Next-generation sequencing: from basic childhood recessive diseases by next-generation sequencing. Sci Transl Med 3(65): research to diagnostics. Clin Chem 55(4):641–658. 65ra4. Wang C, Krishnakumar S, Wilhelmy J, Babrzadeh F, Stepanyan L, Su LF, Levin- Bentley G, Higuchi R, Hoglund B, Goodridge D, Sayer D, Trachtenberg EA, Erlich son D, Fernandez-Vina MA, Davis RW, Davis MM, Mindrinos M. 2012. High- HA. 2009. High-resolution, high-throughput HLA genotyping by next-generation throughput, high-fidelity HLA genotyping with deep sequencing. Proc Natl Acad sequencing. Tissue Antigens 74(5):393–403. Sci USA 109(22):8676–8681.

6 HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 Paper II

An integrated Tool to study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing.

By Hongzhi Cao1,3, Jinghua Wu1, Yu Wang1, Hui Jiang1,3, Tao Zhang1, Xiao Liu1,3, Yinyin Xu1, Dequan Liang1, Peng Gao1, Yepeng Sun1, Benjamin Gifford2, Mark D’Ascenzo2, Xiaomin Liu1, Laurent C. A. M. Tellier1,3, Fang Yang1, Xin Tong1, Dan Chen1, Jing Zheng1, Weiyang Li1, Todd Richmond2, Xun Xu1, Jun Wang1,3,4,5, Yingrui Li1

1Science and Technology Department, BGI-Shenzhen, Shenzhen, China, 2Roche NimbleGen, Inc., Madison, Wisconsin, United States of America, 3Department of Biology, University of Copenhagen, Copenhagen, Denmark, 4King Abdulaziz University, Jeddah, Saudi Arabia, 5The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Publication details Published in Plos One 8(7):e69388 (2013)

An Integrated Tool to Study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing

Hongzhi Cao1,3., Jinghua Wu1., Yu Wang1., Hui Jiang1,3., Tao Zhang1., Xiao Liu1,3, Yinyin Xu1, Dequan Liang1, Peng Gao1, Yepeng Sun1, Benjamin Gifford2, Mark D’Ascenzo2, Xiaomin Liu1, Laurent C. A. M. Tellier1,3, Fang Yang1, Xin Tong1, Dan Chen1, Jing Zheng1, Weiyang Li1, Todd Richmond2, Xun Xu1, Jun Wang1,3,4,5*, Yingrui Li1* 1 Science and Technology Department, BGI-Shenzhen, Shenzhen, China, 2 Roche NimbleGen, Inc., Madison, Wisconsin, United States of America, 3 Department of Biology, University of Copenhagen, Copenhagen, Denmark, 4 King Abdulaziz University, Jeddah, Saudi Arabia, 5 The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Abstract The major histocompatibility complex (MHC) is one of the most variable and gene-dense regions of the human genome. Most studies of the MHC, and associated regions, focus on minor variants and HLA typing, many of which have been demonstrated to be associated with human disease susceptibility and metabolic pathways. However, the detection of variants in the MHC region, and diagnostic HLA typing, still lacks a coherent, standardized, cost effective and high coverage protocol of clinical quality and reliability. In this paper, we presented such a method for the accurate detection of minor variants and HLA types in the human MHC region, using high-throughput, high-coverage sequencing of target regions. A probe set was designed to template upon the 8 annotated human MHC haplotypes, and to encompass the 5 megabases (Mb) of the extended MHC region. We deployed our probes upon three, genetically diverse human samples for probe set evaluation, and sequencing data show that ,97% of the MHC region, and over 99% of the genes in MHC region, are covered with sufficient depth and good evenness. 98% of genotypes called by this capture sequencing prove consistent with established HapMap genotypes. We have concurrently developed a one-step pipeline for calling any HLA type referenced in the IMGT/HLA database from this target capture sequencing data, which shows over 96% typing accuracy when deployed at 4 digital resolution. This cost-effective and highly accurate approach for variant detection and HLA typing in the MHC region may lend further insight into immune-mediated diseases studies, and may find clinical utility in transplantation medicine research. This one-step pipeline is released for general evaluation and use by the scientific community.

Citation: Cao H, Wu J, Wang Y, Jiang H, Zhang T, et al. (2013) An Integrated Tool to Study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing. PLoS ONE 8(7): e69388. doi:10.1371/journal.pone.0069388 Editor: Xi (Erick) Lin, Emory Univ. School of Medicine, United States of America Received April 1, 2013; Accepted June 7, 2013; Published July 24, 2013 Copyright: ß 2013 Cao et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The study was supported by the National High Technology Research and Development Program of China (NO.2012AA02A201), and the funder’s website is: http://www.863.gov.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: This manuscript was finished by collaboration between Roche NimbleGen and BGI, in which Roche NimbleGen was in charge of the design and production of probes and BGI was in charge of the selection of target region and test of the probes. The collaboration has been published as news in website of Roche NimbleGen (http://www.roche.com/media/media_releases/med_dia_2011-11-28.htm) and the website of BGI (http://www.genomics.cn/en/ news/show_news?nid = 98954). Three persons in the author list are employee of Roche NimbleGen, they are Benjamin Gifford, who was the primary designerof the probes; Mark D’Ascenzo, who was in charge of design optimization; and Todd Richmond, who was consulted on the design. Roche NimbleGen has the right to produce and sale the product freely and this does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials. * E-mail: [email protected] (YL); [email protected] (J. Wang) . These authors contributed equally to this work.

Introduction reported to be associated with proteins coded by the genes in MHC region [4]. The MHC region, one of the most gene-dense regions of the In addition to its high gene density, the MHC region is also one human genome, is located on the short arm of human of the most complex regions in the human genome, due to the chromosome 6. It covers over 200 genes, 128 of which are extremely high density of polymorphism and linkage disequilib- predicted to be expressed [1]. Most MHC genes play a rium (LD). This inherent complexity has made identification of the fundamental role in immunity, and show a close relationship underlying, causative variants contributing to disease phenotypes with immune-mediated diseases [2,3]. Over 30 years, numerous by genome-wide association study (GWAS) a challenge. The single studies have demonstrated association between some MHC nucleotide variations (SNV) which distinguish PGF and COX, the alleles and some disease susceptibilities. Today, more than 100 two most significant MHC haplotypes in the European popula- diseases, - with relation to infection, inflammation, autoimmu- tion, are at variant densities of ,3.46 10–3, higher than the nity, drug sensitivity, and transplantation medicine -have been estimated average heterozygosity between any other two haplo-

PLOS ONE | www.plosone.org 1 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region types in the human genome [5]. Recognizing the importance of Results fully informative polymorphism and haplotype maps of the MHC region, as pertaining to MHC-related-diseases, the MHC Haplo- Evaluation of Target Capture Sequencing, in Terms of type Consortium has conducted the MHC Haplotype Project Coverage between 2000 and 2006, and provided the sequence and We generate sequence libraries from genomic DNA (gDNA) of annotations of eight different HLA-homozygous-typing haplotypes the three samples mentioned in the method. After filtering reads (PGF, COX, QBL, APD, DBB, MANN, MCF and SSTO) [6]. with low sequence quality or sequencing adaptor, the purged data The Single Nucleotide Polymorphism (SNP) count in dbSNP for are mapped to the human genome reference sequence hg19, and the MHC region has increased rapidly from 69,076 SNPs in more than 60% of the mapped reads are proved to align to the database version 130 (Apr 30, 2009) to 169,279 in database MHC region (Table 1). Although the probes themselves cover only version 135 (Oct 13, 2011), due in large part to the progress of the 72.8% of the defined MHC region (3,620,871 bp of 1000 genome project. Although next generation sequencing (NGS) 4,970,558 bp), the alignment result proves that 97% of the has made the cost of human genome sequencing decrease defined MHC region is covered by at least one read, and over dramatically, the detection and analysis of all variants in the 95% by at least four reads (Table 1). human MHC region for a large human cohort by whole genome Several hundred samples have been captured and sequenced sequencing (WGS) is still a challenge for researchers. Particularly using this method, and the coverage of the MHC region is over for researchers who focus on genomic variations in MHC region, 96% for all samples (data not shown). We investigated the 3–4% of and for researchers who use genomic information to do HLA uncovered regions and found that more than 99% of the typing, targeted region sequencing stands out as a promising and uncovered bases were located in a long repeating region with neotypal approach. length .2000 base per repeat (data not shown), which was difficult The human leukocyte antigen (HLA) genes, the most studied to solve with capture method due to NGS reads typically being too genes in the MHC region, encode cell-surface proteins responsible short to span the breadth of the long repeat region. The MHC for antigen peptide presentation in adaptive immune response. region depth distribution showed a similar Poisson distribution for Considering the key role of HLA genes in immunology, HLA all three samples (Figure 1), indicating an even enrichment of the typing is widely used for matching donors and recipients in organ MHC region. and hematopoietic stem cell transplantation in the clinical context For further evaluation of the coverage of the genes in the MHC [7,8], particularly for variants at the four gene loci HLA-A, -B, -C region, RefSeq genes are downloaded from the UCSC table and -DRB1. High resolution HLA matching is required for browser (Hg19) [14], of which 213 genes and 380 transcripts are unrelated donor hematopoietic cell transplants [9]. Although annotated in the defined MHC region. The average coverage of precise HLA matching improves overall transplant survival and the longest transcripts of the 213 genes in the MHC region was reduces the risk of rejection, graft-versus-host disease (GVHD) 99.69% (Table S1). As to the coverage of the exons, the average remains a significant and potentially life-threatening complication coverage of all exons was 99.94% and 210 of 213 genes were after hematopoietic cell transplantation (HCT), even when using 100% covered for all exons for all three samples (Table S1). This HLA-matched transplants. Besides the five classic HLA genes high coverage of the genes in MHC region makes it possible to (HLA-A, -B, -C, -DRB1 and -DQB1), additional genes in the detect all variations therein. MHC region, encoding unidentified transplantation antigens, are proposed to be responsible for GVHD [10]. Pro¨ll et al. found Evaluation of Variation Detection Accuracy 3,025 small variations between recipients and donors undergoing Use of the capture sequencing method for mutation discovery is unrelated HLA-matched allogeneic stem cell transplantation, and critically dependent on the accurate detection of polymorphisms proposed that various differences causing non-synonymous amino and genotypes. Based on the target capture sequencing data, using acid exchanges may lead to GVHD [11]. Therefore, choosing an GATK tools (v1.43) for YH, NA18532 and NA18555, respective- HLA matched unrelated donor with the smallest number of ly, we detected 19,002, 19,007 and 20,022 SNPs, and 2,332, differences to the recipient may help to reduce risk of rejection and 2,077, and 2,513 insertions and deletions (InDels). increase odds of transplantee survival. When analyzing the distribution of variations, we found that Target region capture sequencing was developed mainly to most of those SNPs were located in HLA-A, -B, -C, -DR, -DP and enrich and sequence specific regions of particular interest, such as -DQ loci (Figure 2), leading to high degree of polymorphism specific exons or entire exomes [12,13]. Target Region Sequenc- within those genes. An average 3% of the SNPs identified in our ing (TRS) may be a cost-effective solution for studying MHC study were determined to be novel, defined by their not being region; however, successful enrichments of large, contiguous, present in dbSNP132, or not being extant in the 1000 Genomes highly variable genomic regions has previously been thought to be project (1000 Genomes Project Consortium, http://www. problematic, because of ubiquitous repetitive regions and other- 1000genomes.org, released on 23 November 2010). Using the wise high diversity between haplotypes. In this study, a probe set ANNOVAR bioinformatics toolset (http://www. was carefully designed for the target regions, using sequences of openbioinformatics.org/annovar) to functionally annotate the eight MHC haplotypes, to capture DNA fragments from the genetic variants in MHC region for the three case samples, we human MHC region, and then processed via high throughput showed that about 2% of all detected variants were within exonic sequencing. Using this probe set, we are able to sequence ,97% of regions, 16% within intronic regions, and 70% within regions the defined MHC region with high SNV calling accuracy by this defined as intergenic (Table S2). Further investigation showed that capture sequencing technology. We have also developed a HLA most of the nonsynonymous variants resulting in changes to amino typing pipeline to type all of the genes in the IMGT/HLA acid sequences - and hence probable functional differences in the database (http://www.ebi.ac.uk/imgt/hla,Release 3.9.0) with protein - were also located in HLA-A, -B, -C, -DR, -DP and -DQ high accuracy, using this capture sequencing data or whole genes, putatively resulting in different peptide-binding interactions genome sequencing data. This toolset will improve and extend the with T cell receptors. usage of both capture sequencing and whole genome sequencing In order to evaluate the accuracy of SNP detection, we data in related studies which rely on HLA typing. compared our genotype result with the Illumina 2.5 M BeadChip

PLOS ONE | www.plosone.org 2 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region

Figure 1. Distribution of per-base coverage depths in the MHC region for three samples. X-axis denotes coverage depth, Y-axis indicates percentage of total target region with a given sequencing depth. The fraction of target bases with zero coverage is not shown in the figure. doi:10.1371/journal.pone.0069388.g001 genotype result for YH, and with HapMap genotype result for HapMap data for NA18532 and NA18555. This count shows that NA18555 and NA18532. There were 6,099 sites on the Illumina most of the sites had equal support reads for the non-reference 2.5 M BeadChip which are within the MHC region, and 13,920 allele and reference allele, by which we infer that our probes and 14,036 genotypes in the HapMap database for NA18532 and display good enrichment balance for the two haplotypes (Figure 3). NA18555 respectively. 6,066 (99.5%), 13,775 (99.0%) and 13,895 In order to validate the authenticity of the novel SNPs and (99.0%) of these genotypes are identified by target capture InDels, two already proved fosmids covering the HLA–B gene sequencing data with an accuracy of 98.83%, 97.82% and locus were generated, based on our previous project using YH 98.26%, respectively, for the three study samples (Table S3, S4 gDNA. Both of the selected fosmids covered the MHC region of and S5). The rate of divergence (about 2%) of our genotype results chr6 :31,317,939 – 31,344,876 (hg19), but they each are from the HapMap genotypes is calculated to be higher than the templated on different haplotypes. Using the sequence data from estimated average error rate of HapMap genotypes of 0.5% [15]. fosmids, we validated the accuracy of 379 of the 382 SNPs Two-thirds of the divergent loci show missing data at one allele in (99.21%) (including 14 of the 15 novel SNPs) and 35 of the 37 the HapMap genotype, which indicating that these inconsistent InDels (94.59%) (data not shown). sites might result from failure of amplification, caused by the high background heterozygosity of the MHC region. Due to this high Description of HLA Typing Method degree of divergence between any given pair of MHC haplotypes, Since the target capture sequencing data can be expected to biased enrichment data may likewise lead to an erroneous result. cover more than 99% of genes in MHC region, it has the potential In order to evaluate the bias of our reads on two different MHC to give all MHC HLA gene types. Here, based on the above cited haplotypes, we therefore count the support read number for each TRS data, we have developed an HLA typing pipeline which allele at the heterozygous locus in the BeadChip for YH, and enables typing all of the HLA genes in the IMGT/HLA database until the date of publishing. The pipeline takes the aligned BAM files as input, and outputs the most reliable HLA types for each Table 1. Data production and mapping results for the three gene (Figure 4). samples used. A brief summary of the pipeline was as follow: (1) Read preparation: Paired reads with either end mapped to target gene region, and unmapped paired reads with high Samples YH NA18532 NA18555 quality, are categorized by gene, and extracted from the BAM. (2) Target region ( bp) 4970558 4970558 4970558 Quality control and error correction: The collected reads of Raw reads 7988210 5377408 5457998 each gene are aligned to the IMGT/HLA database, which Raw data (Mb) 719 484 491 contains the sequence of all currently known types, using Basic Local Alignment Search Tool (BLAST, version 2.2.24). Reads Mapped reads 7766430 4875514 4991960 with a maximum of two discordant (SNP/InDel) to the most Uniquely mapped reads 7317777 4559340 4672521 similar reference sequence in the database are kept and corrected Reads uniquely mapped to target 4794113 3023307 3100770 to the closest reference sequence, while maintaining and adding to Capture specificity 65.51% 66.31% 66.36% a record of the discordant SNP/InDel variants. (3) Haplotype Mean fold coverage depth (x) 87.32 55.09 56.50 sequence assemble: For each exon, reads are randomly assembled, with a minimum overlap region length of 10 bp for Percent bases with coverage $1x 97.29 96.95 97.20 reads with no difference allowed in the overlap region, so as to Percent bases with coverage $4x 96.52 95.66 95.95 construct haplotypes. Artificial haplotypes that do not exist in Percent bases with coverage $10x 95.50 93.84 94.13 IMGT/HLA database are filtered out, and remaining haplotypes are given quality scores as defined below (see method). (4) Typing Target regions here refer to contiguous MHC regions, rather than to the region actually covered by the designed probes. Capture specificity is defined as the according to assembled haplotype sequence: The final percentage of uniquely mapped reads aligning to the target region. haplotype of the sample is determined by rank of quality score (See doi:10.1371/journal.pone.0069388.t001 method).

PLOS ONE | www.plosone.org 3 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region

Figure 2. Single nucleotide variation distribution across the whole MHC region for three samples. The MHC region was split into 4971 parts, with 1000 bp in each part. X-axis denotes the 4971 parts and Y-axis indicates the number of SNPs in each part. doi:10.1371/journal.pone.0069388.g002

Evaluation of HLA Typing Accuracy (data not shown). This result indicates the high performance and Firstly, we test our HLA typing pipeline on simulated data. reliability of our targeted region pipeline. Based on the sequence of eight known haplotypes, sequenced by Sanger’s MHC haplotype project in 2008 with a foreknown Discussion HLA type result, we simulated regional sequence data for a In our study, we have developed an efficient approach for diploid sample MHC, and ran the HLA typing pipeline upon it. targeted region sequencing of the MHC region of the human For each sample, we typed 26 HLA genes, as recorded in genome, focusing on detection of minor variants and HLA types, IMGT/HLA database. The simulated data showed the accuracy in a high-coverage, high-reliability and low-cost framework. of our method at 2 and 4 digital resolutions, and at sequencing Beyond the 3.4 Mb of the standard MHC region, we designed depths ranging from 206 to 1006 with sequencing error rates probes for a continuous 5 Mb genomic region, to include further ranging from 0–2% (Table S6). The results of this typing were information from the extended high linkage disequilibrium area of then validated against the HLA type given by the MHC the broader MHC region. haplotype project. Our data presents 97% coverage for the broader MHC region We then used this pipeline on three real human genome and 99.69% for the transcripts of the crucial MHC genes, in all samples selected for this study. For each sample, we typed the three testing samples. Compared with the method of Proll [11], same 26 HLA genes recorded in the IMGT/HLA database. In our study showed a significantly higher coverage of the MHC order to estimate the accuracy of the HLA typing result, we typed region. Moreover, the accuracy of SNVs identified by our target the nine HLA genes with the highest diversity in the IMGT/HLA capture sequencing data was in particularly high concordance with database using the gold standard: Polymerase Chain Reaction genotype data established by the most reliable methodology for (PCR) based Sanger sequencing, at 4 digital resolution. When our this purpose, indicating the high degree of reliability of our pipeline was compared with this standard, 100% of the 54 HLA approach. alleles typed by our method were found consistent with those of To overcome the challenge of accurately genotyping the the Sanger standard at 4 digital resolution (Table S7). extremely high density of polymorphisms and linkage disequilibria In a second test project, 190 human samples which had already in the human MHC region, our method optimized the standard been typed by traditional SSO or PCR based Sanger sequencing target capture sequencing pipeline. Unlike traditional probes, method (the same gold standard) were captured and sequenced which are designed only based on one of the haplotypes of the using our pipeline, comparing in particular on five genes (HLA-A, human reference sequence, the probe design for the MHC region -B, -C, -DRB1 and -DQB1). The comparison showed that the in this study is based on all eight major MHC haplotype concordance of our method with the traditional gold standard was sequences. This enables capture of less common haplotypes in 96.4% at 4 digital resolution, and 98.8% at 2 digital resolution this region, ultimately reducing capture bias and failure. We also

PLOS ONE | www.plosone.org 4 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region

Figure 3. Capture bias evaluation using heterozygous genotypes in the Beadchip for three samples. X-axis denotes the support reads number of the reference allele and Y-axis denotes the support reads number of the non-reference allele. doi:10.1371/journal.pone.0069388.g003 shear DNA fragments to ,500 bp, increasing the coverage of the higher resolution in typing, as well the potential for the detection MHC region compared to the shorter fragments length around of novel types (a not infrequent occurrence in this highly ,200 bp (Table S8) which is more typically used in conventional polymorphic region), and an important clinical clue in the event exon capture libraries. This is particularly useful in the MHC of tissue rejection and GVHD. High coverage data allows us to because probes must be annealed to unique regions, but some detect all variants and type all HLA genes in the MHC region in repetitive regions, which are not covered by the probes, can be one-step with low cost. Some researchers suggest that the enriched by flanking to unique genome regions close to them, by transplant barrier is defined not only by classical HLA genes, using a longer and more unique genome fragment. but also by non-HLA genetic variation within the MHC [10], but Based on this high-coverage, low-bias capture data, we achieve many previously dominant methods have been unable to gather exceptionally accurate SNP and InDel results, and exceptionally data of sufficient breadth to verify or refute this contention. accurate HLA gene type results in MHC region, at a level which One limitation of our target region method, as well as most before could only be achieved by a more specialized and expensive other methods, is the inability to fully cover long repeated regions, experimental method, or based on a large database. The and present 100% coverage of MHC region. This difficulty may traditionally utilized HLA typing methods - such as sequence later be overcome by shearing gDNA to longer fragments specific oligonucleotide (SSO), sequence specific primer (SSP) [16], (5000 bp, or longer), capturing using probes of this size, and PCR based capillary sequencing and next generation sequencing sequencing on the third generation sequencing platforms. A method [17] - only detect variants in two or three exons, and it is solution to this difficulty is important and may open the possibility very possible for new alleles in other regions of the gene to be of phasing long MHC haplotype blocks at low cost, ultimately missed when using these traditional methods. Another forwar- reducing the risk of transplantation rejection and GVHD for those dlooking method has been reported for the imputation classical less able to afford a more expensive solution. HLA alleles and corresponding amino acid based on the SNP information in the MHC region [18], but it demands a huge well Conclusion studied reference panel. Our data covers essentially all regions of The MHC region harbors the highest density of immunity- known HLA genes, and has the potential for more accuracy and crucial genes in the human genome, and is arguably the region

PLOS ONE | www.plosone.org 5 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region

Figure 4. The workflow of HLA typing method. doi:10.1371/journal.pone.0069388.g004 most central to immunogenetics and tissue transplantation. The corresponding to the genomic region from chr6:28,477,797 to ability to accurately call variants and HLA types is very important chr6:33,448,354 in the human reference genome (NCBI release to immunology related studies in general, and to clinical GRCh37, UCSC release hg19), encompassing a total of application in particular. In this paper, we outline our develop- 4,970,558 bp. ment of a novel, low cost pipeline for the identification of variants Unlike previous studies of the MHC region, we also include the and typing of HLA alleles in the MHC region in a one step other seven MHC haplotypes aside from PGF - namely COX, protocol with high accuracy. This pipeline has great potential for QBL, APD, DBB, MANN, MCF and SSTO - in our probes use in the study of immune disease and transplantation medicine. design so as to improve the coverage of the MHC region and minimize the effect of the regionally high polymorphism density. A Materials and Methods maximum allowance of greater than 5 close matches is set to get optimal capture efficiency of such high polymorphic regions. In Design of MHC Capture Probes total, 3,620,871 bp (about 72.8% of the 5 Mb MHC region) was The classical MHC region is defined as a ,3.4 Mb region covered by the capture probes for PGF, and the coverage rose to stretching between the gene C6orf40 (ZFP57) and the gene 4150331 bp (83.5%) when probes extended by 100 bp at both HCG24 [19]. In this study, due to extreme linkage disequilibrium end. The design name is 110729_HG19_MHC_L2R_D03_EZ of the region, and due to the presence of MHC-relevant genes and design file can be downloaded from Roche NimbleGen nearby [19], we extended the MHC region to the telomeric gene website (http://www.nimblegen.com/products/seqcap/ez/ GPX5 and the centromeric gene ZBTB9 in the probes design, designs/index.html) in the product ‘‘human MHC design’’. Five

PLOS ONE | www.plosone.org 6 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region capture arrays were initially produced for testing the uniformity of detected using GATK for the regions with at least 46 sequencing coverage, and the regions with extremely low or high sequencing depth as described previously [24,25]. depth were rebalanced by adjusting the number of probes - probe count for regions with extremely low sequencing depth was HLA Genes Haplotype Sequences Assemble increased, and for regions with extremely high sequencing depth As showed in Figure 4, De novo assembling method was used to was decreased. Target region sequencing of the MHC was construct the haploid sequence for each exon by using overlapped conducted for three samples using the set of rebalanced probes. reads. The assembled sequence was randomly paired if the whole exon was broken by low depth region. We filtered those sequences Samples Collection that did not exist in the IMGT/HLA database. As for the Three samples were used to evaluate the performance of the remaining haplotype sequences, we scored them according to the MHC region capture strategy. One was YH, which had previously following formula: been whole genome deep sequenced in 2008 [20], and genotyped using Illumina 2.5 M BeadChip. The other two samples were cell Score~C2|N= log (R) line DNA, NA18532 and NA18555, which were purchased from Coriell Institute, and which had previously been analyzed using high-throughput SNP genotyping in the HapMap project [21]. C: Coverage of this assembled sequence to its corresponding Nine HLA allele (HLA-A, HLA–B, HLA–C, HLA-DRB1, exon; HLA-DQB1, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-G) N: Number of reads used to construct the assembled sequence; types for these samples were already detected using the gold R: Reliability of this assembled sequence. It was calculated using standard sequence based typing method, in which the exonic the following formula: regions of HLA genes were initially amplified by PCR, and the amplicons were subsequently sequenced using Sanger method. P hi ðÞXi{X 27X Whole Genome Shotgun Library Construction R~ : logðÞdf High molecular weight gDNA was extracted using DNeasy Blood & Tissue Kits (QIAGEN, 69581). Shotgun libraries were generated from 3 micrograms ( mg) of genomic DNA followed by Xi: The depth of each base of the assembled sequence; the manufacturer’s instruments (Illumina). gDNA in Tris-EDTA X: The mean depth of the assembled sequence; was sheared into 400–600 bp fragments using a Covaris S2 df: The degrees of freedom of the assembled sequence. (Covaris). The overhangs were then converted to blunt ends, using df = N21, N equals the length of assembled sequence. T4 Polynucleotide Kinase, T4 DNA polymerase and Klenow polymerase. The fragments were then A-tailed using Klenow (39– Typing According to Assembled Haplotype Sequence 59 exo-). Next, Illumina sequencing adapters with a single ‘‘T’ base We combined the assembled sequences (called haplotypes) of all overhang were linked to the A-tailed sample using T4 DNA exons together to produce a group of types, and then scored them Ligase. Finally, the fragments with adapters were enriched via four using the following formula: cycles of PCR.

MHC Region Capture and High Throughput Sequencing TScore~N|S One mg of prepared sample library was hybridized to the capture probes following the manufacturer’s protocols (Roche NimbleGen). The hybridization mixture was incubated for 70 h at N: The number of reads that support the haplotype. 65uC. After the hybridization was finished, the target fragments S: The score of the haplotype. were captured with M-280 Streptavidin Dynabeads (Invitrogen), Based on the assumption that number of reads mapped to the and then the captured sample was washed twice at 47uC and three correct haplotype is larger than that mapped to the incorrect more times at room temperature using the manufacturer’s buffers. haplotype, we first select the two types with the highest TScore, The captured target fragments were amplified using PlatinumH which can explain the highest number of the sequence reads for all Pfx DNA Polymerase (Invitrogen) at the following condition: 94uC exons as candidate types, and then we divide the sequence reads for 2 min, followed by 15 cycles of 94uC for 15 s, 58uC for 30 s, into two corresponding groups. We judge whether a HLA gene 72uC for 50 s and a final extension of 72uC for 5 min. The PCR would be a homozygous or heterozygous, depending on the ratio products were purified and sequenced with standard 26 91 of unique supporting reads number of the sub-optimal type, paired-end reads on the Illumina HiSeq2000 sequencer following compared to the supporting reads number of the best type. The manufacturer’s instructions. threshold of a heterozygote is 0.1. If the ratio is smaller then 0.1, the gene is defined as a homozygote, and the type with the largest Sequencing Read Mapping and Variant Calling score is chosen as the dominant type. Otherwise, the typed gene is Low quality and adapter contaminated reads were filtered out defined as heterozygote, and the two types with the highest score to get the clean reads and then they were aligned to the reference are chosen as the final type result. Finally, we assess the reliability assembly (UCSC Hg19) with haplotypes sequence removed using of the final type by comparing it with the closed type using AScore: Burrows-Wheeler Aligner (BWA, version 0.5.9) with the param- eters -o 1 -e 63 -i 15 -L -k 2 -l 31 -q 10–I [22]. PCR duplicates S0{S1 2 ~ ðÞ were removed by Samtools (version 0.1.17) [23]. Base qualities AScore 2 2 ðÞS0{S1 zðÞS0{S2 were recalibrated and reads were realigned around potential insertions and/or deletions using The Genome AnalysisToolkit (GATK) software (version 1.43). SNPs and InDel were also

PLOS ONE | www.plosone.org 7 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region

S0: The full score of the all assembled haplotypes of the typed Table S3 Comparison of MHC capture SNPs and gene; Illumina 2.5 M genotyping alleles for sample YH. We S1: The score of the closed type; classified both the MHC capture alleles and the alleles that were S2: The score of the final type. called by genotyping into three categories: (1) Hom ref On the other hand, all the discordant information recorded in (homozygotes where both alleles are identical to the reference); the process of ‘error correction’ was judged using variant calling (2) Hom mut (homozygotes where both alleles differ from the methods to check whether there is a novel SNP or InDel reference); (3) Het ref (heterozygotes where only one allele is comparing to an extant type. If so, the relationship of the novel identical to the reference); The number of MHC capture variant to the given type was also described. sequencing sites that are consistent with genotyping at both alleles, at one allele, or that are inconsistent at both alleles were Data Simulation and HLA Typing categorized as 2, 1, and 0, respectively. Basing on the sequence of eight MHC haplotypes given by the (XLSX) MHC haplotype project in 2008, we randomly simulated one Table S4 Comparison of MHC capture genotypes and diploid sample’s MHC region sequence data using wgsim (v 0.2.3) provided by the Samtools package. First, we randomly selected HapMap genotypes for sample NA18532. two from the eight haploids, at various sequencing depths and (XLSX) various sequencing error rates using parameters –e and –N. The Table S5 Comparison of MHC capture genotypes and insertion size was set to 500 bp, which was the same as our target HapMap genotypes for sample NA18555. capture sequencing data, and the SD was 10. For each of the given (XLSX) conditions, we simulated 36 human MHC region data sets. We then worked out each simulated sample’s 26 HLA gene’s type Table S6 The accuracy of HLA typing method using result, using the method we describe above. The statistical simulated data at different sequencing depth with accuracy result is given in Table S6. different sequencing error rate. (XLSX) Data Access Table S7 HLA alleles typed by PCR based Sanger The raw target capture sequencing data of MHC region for sequence method and target capture sequence method. sample YH, NA18526 and NA18555 has been deposited at the (XLSX) NCBI Sequence Read Archive (SRA) under accession no. SRA065846. The HLA typing pipeline is implemented in Perl Table S8 Coverage of the MHC region using 200 bp script and can be freely downloaded at http://soap.genomics.org. insert size library. cn/SOAP-HLA.html. (XLSX)

Supporting Information Author Contributions Conceived and designed the experiments: XX J. Wang YL. Performed the Table S1 Coverage of whole gene body and exome of the experiments: J. Wu HJ Xiao Liu JZ WL. Analyzed the data: HC J. Wu YW MHC genes by capture sequencing data. TZ YX DL PG Xiaomin Liu FY XT DC. Wrote the paper: J. Wu. (XLSX) Designed and produced the probe: HC HJ BG MD’A TR. Developed Table S2 Distribution of variations within different HLA typing pipeline: HC TZ. Assisted in writing manuscript: YW TZ. genomics functional regions for three samples. Revised the manuscript: HC HJ Xiao Liu YS LCAMT. (XLSX)

References 1. The MHC Sequencing Consortium (1999) Complete sequence and gene map of 10. Petersdorf EW, Malkki M, Gooley TA, Spellman SR, Haagenson MD, et al. a human major histocompatibility complex. Nature 401: 921–923. (2012) MHC-Resident Variation Affects Risks After Unrelated Donor 2. Fernando MM, Stevens CR, Walsh EC, De Jager PL, Goyette P, et al. (2008) Hematopoietic Cell Transplantation. Sci Transl Med 4: 144ra101. Defining the role of the MHC in autoimmunity: a review and pooled analysis. 11. Proll J, Danzer M, Stabentheiner S, Niklas N, Hackl C, et al. (2011) Sequence PLoS Genet 4(4): e1000024. capture and next generation resequencing of the MHC region highlights 3. Rioux JD, Goyette P, Vyse TJ, Hammarstrom L, Fernando MM, et al. (2009) potential transplantation determinants in HLA identical haematopoietic stem Mapping of multiple susceptibility variants within the MHC region for 7 cell transplantation. DNA Res 18(4): 201–210. immune-mediated diseases. Proc Natl Acad Sci USA 106(44): 18680–18685. 12. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, et al. (2009) Solution 4. Trowsdale J (2011) The MHC, disease and selection. Immunol Lett 137: 1–8. hybrid selection with ultra-long oligonucleotides for massively parallel targeted 5. Stewart CA, Horton R, Allcock RJ, Ashurst JL, Atrazhev AM, et al. (2004) sequencing. Nat Biotechnol 27(2): 182–189. Complete MHC haplotype sequencing for common disease gene mapping. 13. Shearer AE, DeLuca AP, Hildebrand MS, Taylor KR, Gurrola J, et al. (2010) Genome Res 14(6): 1176–1187. Comprehensive genetic testing for hereditary hearing loss using massively 6. Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, et al. (2008) Variation parallel sequencing. Proc Natl Acad Sci U S A 107(49): 21104–21109. analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype 14. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): Project. Immunogenetics 60(1): 1–18. a curated non-redundant sequence database of genomes, transcripts and 7. Morishima Y, Sasazuki T, Inoko H, Juji T, Akaza T, et al. (2002) The clinical proteins. Nucleic Acids Res 35(Database issue): D61–65. significance of human leukocyte antigen (HLA) allele compatibility in patients 15. The International HapMap Consortium (2007) A second generation human receiving a marrow transplant from serologically HLA-A, HLA–B, and HLA- haplotype map of over 3.1 million SNPs. Nature 449(7164): 851–861. DR matched unrelated donors. Blood 99(11): 4200–4206. 16. Dunckley H (2012) HLA typing by SSO and SSP methods. Methods Mol Biol 8. Lee SJ, Klein J, Haagenson M, Baxter-Lowe LA, Confer DL, et al. (2007) High- 882: 9–25. resolution donor-recipient HLA matching contributes to the success of unrelated 17. Erlich RL, Jia X, Anderson S, Banks E, Gao X, et al. (2011) Next-generation donor marrow transplantation. Blood 110(13): 4576–4583. sequencing for HLA typing of class I loci. BMC Genomics 12: 42. 9. Bray RA, Hurley CK, Kamani NR, Woolfrey A, Muller C, et al. (2008) National 18. de Bakker PI, McVean G, Sabeti PC, Miretti MM, Green T, et al. (2006) A marrow donor program HLA matching guidelines for unrelated adult donor high-resolution HLA and SNP haplotype map for disease association studies in hematopoietic cell transplants. Biol Blood Marrow Transplant 14(9 Suppl): 45– the extended human MHC. Nat Genet 38(10): 1166–1172. 53. 19. Horton R, Wilming L, Rand V, Lovering RC, Bruford EA, et al. (2004) Gene map of the extended human MHC. Nat Rev Genet 5: 889–899.

PLOS ONE | www.plosone.org 8 July 2013 | Volume 8 | Issue 7 | e69388 An Integrated Tool to Study MHC Region

20. Wang J, Wang W, Li R, Li Y, Tian G, et al. (2008) The diploid genome 24. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. (2010) The sequence of an Asian individual. Nature 456(7218): 60–65. Genome Analysis Toolkit: a MapReduce framework for analyzing next- 21. The International HapMap Consortium (2003) The International HapMap generation DNA sequencing data. Genome Res 20(9): 1297–1303. Project. Nature 426(6968): 789–796. 25. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, et al. (2011) A 22. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows- framework for variation discovery and genotyping using next-generation DNA Wheeler transform. Bioinformatics 25: 1754–1760. sequencing data. Nat Genet 43(5): 491–498. 23. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078–2079.

PLOS ONE | www.plosone.org 9 July 2013 | Volume 8 | Issue 7 | e69388 Paper III

Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology

By Hongzhi Cao1,3,4, Alex R Hastie2, Dandan Cao1,3, Ernest T Lam2, Yuhui Sun1,5, Haodong Huang1,5, Xiao Liu1,4, Liya Lin1,5, Warren Andrews2, Saki Chan2, Shujia Huang1,5, Xin Tong1, Michael Requa2, Thomas Anantharaman2, Anders Krogh4, Huanming Yang1,3, Han Cao2 and Xun Xu1,3

1BGI-Shenzhen, Shenzhen 518083, China. 2BioNano Genomics, San Diego, California 92121, USA. 3Shenzhen Key Laboratory of Transomics Biotechnologies, Shenzhen 518083, China. 4Department of Biology, University of Copenhagen, Copenhagen 2200, Denmark. 5School of Bioscience and Biotechnology, South China University of Technology, Guangzhou 511400, China.

Publication details Published in GigaScience 3:34 (2014)

Cao et al. GigaScience 2014, 3:34 http://www.gigasciencejournal.com/content/3/1/34

RESEARCH Open Access Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology Hongzhi Cao1,3,4†, Alex R Hastie2†, Dandan Cao1,3†, Ernest T Lam2†, Yuhui Sun1,5, Haodong Huang1,5, Xiao Liu1,4, Liya Lin1,5, Warren Andrews2, Saki Chan2, Shujia Huang1,5, Xin Tong1, Michael Requa2, Thomas Anantharaman2, Anders Krogh4, Huanming Yang1,3, Han Cao2* and Xun Xu1,3*

Abstract Background: Structural variants (SVs) are less common than single nucleotide polymorphisms and indels in the population, but collectively account for a significant fraction of genetic polymorphism and diseases. Base pair differences arising from SVs are on a much higher order (>100 fold) than point mutations; however, none of the current detection methods are comprehensive, and currently available methodologies are incapable of providing sufficient resolution and unambiguous information across complex regions in the human genome. To address these challenges, we applied a high-throughput, cost-effective genome mapping technology to comprehensively discover genome-wide SVs and characterize complex regions of the YH genome using long single molecules (>150 kb) in a global fashion. Results: Utilizing nanochannel-based genome mapping technology, we obtained 708 insertions/deletions and 17 inversions larger than 1 kb. Excluding the 59 SVs (54 insertions/deletions, 5 inversions) that overlap with N-base gaps in the reference assembly hg19, 666 non-gap SVs remained, and 396 of them (60%) were verified by paired-end data from whole-genome sequencing-based re-sequencing or de novo assembly sequence from fosmid data. Of the remaining 270 SVs, 260 are insertions and 213 overlap known SVs in the Database of Genomic Variants. Overall, 609 out of 666 (90%) variants were supported by experimental orthogonal methods or historical evidence in public databases. At the same time, genome mapping also provides valuable information for complex regions with haplotypes in a straightforward fashion. In addition, with long single-molecule labeling patterns, exogenous viral sequences were mapped on a whole-genome scale, and sample heterogeneity was analyzed at a new level. Conclusion: Our study highlights genome mapping technology as a comprehensive and cost-effective method for detecting structural variation and studying complex regions in the human genome, as well as deciphering viral integration into the host genome. Keywords: Genome mapping, Structural variation, Repeat units, Epstein-Barr virus (EBV) integration

* Correspondence: [email protected]; [email protected] †Equal contributors 2BioNano Genomics, San Diego, California 92121, USA 1BGI-Shenzhen, Shenzhen 518083, China Full list of author information is available at the end of the article

© 2014 Cao et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Cao et al. GigaScience 2014, 3:34 Page 2 of 11 http://www.gigasciencejournal.com/content/3/1/34

Background power for comprehensiveness and are heavily biased against A structural variant (SV) is generally defined as a region repeats and duplications because of short-read mapping of DNA 1 kb and larger in size that is different with ambiguity and assembly collapse [9,10,26]. David C. respect to another DNA sample [1]; examples include Schwartz’s group promoted optical mapping [27] as an inversions, translocations, deletions, duplications and in- alternative to detect SVs along the genome with restriction sertions. Deletions and duplications are also referred to mapping profiles of stretched DNA, highlighting the use as copy number variants (CNVs). SVs have proven to be of long single-molecule DNA maps in genome analysis. an important source of human genetic diversity and However, since the DNA is immobilized on glass surfaces disease susceptibility [2-6]. Base pair differences arising and stretched, the technique suffers from low throughput from SVs occur on a significantly higher order (>100 and non-uniform DNA stretching, resulting in imprecise fold) than point mutations [7,8], and data from the 1000 DNA length measurement and high error rate, hindering Genomes Project show population-specific patterns of its utility and adoption [24,27-29]. Thus, an effective SV prevalence [9,10]. Also, recent studies have firmly method to help detect comprehensive SVs and reveal established that SVs are associated with a number of complex genomic regions is needed. human diseases ranging from sporadic syndromes and The nanochannel-based genome mapping technology, Mendelian diseases to common complex traits, particu- commercialized as the “Irys” platform, automatically im- larly neurodevelopmental disorders [11-13]. Chromosomal ages fluorescently labeled DNA molecules in a massively aneuploidies, such as trisomy 21 and monosomy X have parallel nanochannel array, and was introduced as an ad- long been known to be the cause of Down’s and Turner vanced technology [30] compared to other restriction syndromes, respectively. A microdeletion at 15q11.2q12 mapping methods because of high-throughput data col- has been shown causal for Prader-Willi syndrome [14], lection and its robust and highly uniform linearization of and many submicroscopic SV syndromes have been re- DNA in nanochannels. This technology has previously vealed since then [15]. In addition, rare, large de novo been described and used to map the 4.7-Mb highly vari- CNVs were identified to be enriched in autism spectrum able human major histocompatibility complex (MHC) disorder (ASD) cases [16], and other SVs were described region [31], as well as for de novo assembly of a 2.1-Mb as contributing factors for other complex traits including region in the highly complex Aegilops tauschii genome cancer, schizophrenia, epilepsy, Parkinson’s disease and [32], lending great promise for use in complete genome immune diseases, such as psoriasis (reviewed in [11] and sequence analysis. Here, we apply this rapid and high- [12]). With the increasing recognition of the important throughput genome mapping method to discern genome role of genomic aberrations in disease and the need wide SVs, as well as explore complex regions based on for improved molecular diagnostics, comprehensive the YH (first Asian genome) [33] cell line. The workflow characterization of these genomic SVs is vital for, not only for mapping a human genome on Irys requires no library differentiating pathogenic events from benign ones, but construction; instead, whole genomic DNA is labeled, also for rapid and full-scale clinical diagnosis. stained and directly loaded into nanochannels for im- While a variety of experimental and computational aging. With the current throughput, one can collect approaches exist for SV detection, each has its distinct enough data for de novo assembly of a human genome biases and limitations. Hybridization-based approaches in less than three days. Additionally, comprehensive SV [17-19] are subject to amplification, cloning and detection can be accomplished with genome mapping hybridization biases, incomplete coverage, and low dy- alone, without the addition of orthogonal technologies namic range due to hybridization saturation. Moreover, or multiple library preparations. Utilizing genome map- detection of CNV events by these methods provides no ping, we identified 725 SVs including insertions/dele- positional context, which is critical to deciphering their tions, inversions, as well as SVs involved in N-base gap functional significance. More recently, high-throughput regions that are difficult to assess by current methods. next generation sequencing (NGS) technologies have been For 50% of these SVs, we detected a signal of variation by heavily applied to genome analysis based on alignment/ re-sequencing and an additional 10% by fosmid sequence- mapping [20-22] or de novo sequence assembly (SA) [23]. based de novo assembly whereas the remainder had no Mapping methods include paired-end mapping (PEM) signal by sequencing, hinting at the intractability of detec- [20], split-read mapping (SR) [21] and read depth analysis tion by sequencing. Detailed analyses showed most of the (RD) [22]. These techniques can be powerful, but are tedi- undetected SVs (80%, 213 out of 270) could be found ous and biased towards deletions owing to typical NGS overlapped in the Database of Genomic Variant (DGV) short inserts and short reads [24,25]. De novo assembly database indicating their reliability. Genome mapping methods are more versatile and can detect a larger range also provides valuable haplotype information on complex of SV types and sizes (0 ~ 25 kb) by pair-wise genome regions, such as MHC, killer cell Immunoglobulin-like re- comparison [23-25]. All such NGS-based approaches lack ceptor (KIR), T cell receptor alpha/beta (TRA/TRB) and Cao et al. GigaScience 2014, 3:34 Page 3 of 11 http://www.gigasciencejournal.com/content/3/1/34

immunoglobulin light/heavy locus (IGH/IGL), which can molecules have, on average, one label every 9 kb and were help determine these hyper-variable regions’ sequences up to 1 Mb in length. A total of 932,855 molecules larger and downstream functional analyses. In addition, with than 150 kb were collected for a total length of 223 Gb long molecule labeling patterns, we were able to accurately (~70-fold average depth) (Table 1). Molecules can be map the exogenous virus’ sequence that integrated into aligned to a reference to estimate the error rates in the sin- the human genome, which is useful for the study of the gle molecules. Here, we estimated the missing label rate is mechanism of how virus sequence integration leads to 10%,andtheextralabelrateis17%.Mostoftheerrorasso- serious diseases like cancer. ciated with these reference differences are averaged out in the consensus de novo assembly. Distinct genetic features Data description intractable to sequencing technologies, such as long arrays High-molecular weight DNA was extracted from the YH of tandem repeats were observed in the raw single mole- cell line, and high-quality DNA was labeled and run on cules (Additional file 1: Figure S1). the Irys system. After excluding DNA molecules smaller than 100 kb for analysis, we obtained 303 Gb of data giv- De novo assembly of genome maps from single-molecule ing 95× depth for the YH genome (Table 1). For subse- data quent analyses, only molecules larger than 150 kb (223 Single molecules were assembled de novo into consensus Gb, ~70X) were used. De novo assembly resulted in a set genome maps using an implementation of the overlap- of consensus maps with an N50 of 1.03 Mb. We per- layout-consensus paradigm [37]. An overlap graph was formed “stitching” of neighboring genome maps that constructed by an initial pairwise comparison of all mol- were fragmented by fragile sites associated with nick ecules >150 kb, by pattern matching using commercial sites immediately adjacent to each other. After fragile software from BioNano Genomics. Thresholds for the site stitching, the N50 improved to 2.87 Mb, and the as- alignments were based on a p-value appropriate for the sembly covered 93.0% of the non-N base portion of the genome size (thresholds can be adjusted for different human genome reference assembly hg19. Structural vari- genome sizes and degrees of complexity) to prevent ation was classified as a significant discrepancy between spurious edges. This graph was used to generate a draft the consensus maps and the hg19 in silico map. Further consensus map set that was improved by alignment of analyses were performed for highly repetitive regions, single molecules and recalculation of the relative label complex regions and Epstein-Barr virus (EBV) integra- positions. Next, the consensus maps were extended by tion. Supporting data is available from the GigaScience aligning overhanging molecules to the consensus maps database, GigaDB [34-36]. and calculating a consensus in the extended regions. Finally, the consensus maps were compared and merged Analyses where patterns matched (Figure 1). The result of this de Generation of single-molecule sequence motif maps novo assembly is a genome map set entirely independent Genome maps were generated for the YH cell line by puri- of known reference or external data. In this case, YH was fying high-molecular weight DNA in a gel plug and label- assembled with an N50 of 1.03 Mb in 3,565 maps and an ing at single-strand nicks created by the Nt.BspQI nicking N50 of 2.87 Mb in 1,634 maps after stitching fragile sites endonuclease. Molecules were then linearized in nano- (Additional file 1: Figure S2 and Additional file 1: Table channel arrays etched in silicon wafers for imaging [31,32]. S1). These genome maps define motif positions that occur From these images, a set of label locations on each DNA on every 9 kb on average, and these label site positions molecule defined an individual single-molecule map. Single have a resolution of 1.45 kb. The standard deviation for interval measurements between two labels varies with Table 1 Molecule collection statistics under different length. For example, for a 10 kb interval, the standard de- length thresholds viation (SD) is 502 bp, and for a 100 kb interval, it is Length No. Molecules Total length (Gb) Estimated 1.2 kb. Consensus genome maps were aligned to an in cutoff (kb) depth (X)* silico Nt.BspQI sequence motif map of hg19. Ninety-nine 100 1,568,969 303 95 percent of the genome maps could align to hg19 and 120 1,313,832 275 86 they overlap 93% of the non-gap portion of hg19. 150 932,855 223 70 Structural variation analysis 180 659,149 179 56 Using the genome map assembly as input, we performed 200 523,146 153 48 structural variation detection (Figure 1), and the genome 300 170,432 68 21 maps were compared to hg19. Strings of intervals be- 500 22,944 14 4 tween labels/nick motifs were compared and when they *Estimated depth based on 3.2 Gb genome size. diverged, an outlier p-value was calculated and SVs were Cao et al. GigaScience 2014, 3:34 Page 4 of 11 http://www.gigasciencejournal.com/content/3/1/34

Figure 1 Flowchart of consensus genome map assembly and structural variant discovery using genome mapping data. called at significant differences (See Methods for details), insertions/deletions (Figure 2) while 12 were inversions generating a list of 725 SVs including 59 that overlapped (Additional file 2, Spreadsheet 1 & 2). Out of the 654 in- with N-base gaps in hg19 (Additional file 2, Spreadsheet sertions/deletions, 503 were defined as insertions and 3). Based on the standard deviation of interval measure- 151 were deletions, demonstrating an enrichment of ments, 1.5 kb is the smallest insertion or deletion that insertions for this individual with respect to the hg19 can be confidently measured for an interval of about reference (Figure 2). Of the 59 SV events that span N- 10 kb if there is no pattern change. However, if label pat- gap regions, 5 of them were inversions. Of the remaining terns deviate from the reference, SVs with a net size dif- 54 events, 51 were estimated to be shorter than indicated ference less than 1.5 kb can be detected. Additional file and 3 longer. These gap-region related SVs indicate a spe- 1: Figure S1 shows three mapping examples (one dele- cific structure of gap regions of the YH genome compared tion, one insertion, and one inversion) of gap region SVs. to the hg19 reference. We present these 59 events separately although technic- In order to validate our SVs, we first cross-referenced ally, in those cases, genome mapping detected structural them with the public SV database DGV (http://dgv.tcag. differences between the genome maps and reference re- ca/dgv/app/home) [38]. For each query SV, we required gions. For the remaining 666 SVs, 654 of them were 50% overlap with records in DGV. We found that the

Figure 2 Size distribution of total detected large insertions (green) and deletions (purple) using genome mapping. The comparative histogram bars in red and blue respectively represent deletions and insertions supported by NGS. NGS: next-generation sequencing. Cao et al. GigaScience 2014, 3:34 Page 5 of 11 http://www.gigasciencejournal.com/content/3/1/34

majority of the SVs (583 out of 666; 87.5%) could be variation as described in the current reference assembly. found (Additional file 2, Spreadsheet 1 & 2), confirming Furthermore, we have found a very large peak of approxi- their reliability. Next, we applied the NGS discordant mately 2.5-kb repeats in YH (male, 691 copies) but not in paired-end mapping and read depth-based methods, as NA19878 (female, 36 copies; Figure 3). This was further well as fosmid-based de novo assembly (See Methods for supported by additional genome mapping in other males detail), and as a result, detected an SV signal in 396 and females demonstrating a consistent and significant (60%, Figure 2) out of 666 SVs by at least one of the two quantity of male-specific repeats of 2.5 kb (unpublished). methods (Figure 2, Additional file 2, Spreadsheet 1 & 2). As an example, Additional file 1: Figure S3 shows a raw For the remaining 270 SVs, 79% (213 out of 270, Additional image of an intact long molecule of 630 kb with two tracts file 2, Spreadsheet 1 & 2) were found in the DGV database. of at least 53 copies and at least 21 copies of 2.5-kb tan- Overall, 91% (609 out of 666, Additional file 2, Spreadsheet dem repeats (each 2.5-kb unit has one nick label site, cre- 1 & 2) of SVs had supporting evidence by retrospectively ating the evenly spaced pattern) physically linked by applied sequencing-based methods or database entries. another label-absent putative tandem repeat spanning We wanted to determine if SVs revealed by genome over 435 kb, and Additional file 1: Figure S4 shows con- mapping, but without an NGS supported signal, had vincing mapping information. Unambiguously elucidating unique properties. We firstly investigated the distribu- the absolute value and architecture of such complex re- tion of NGS-supported SVs and NGS-unsupported SVs peat regions is not possible with other short fragment or in repeat-rich and segmental duplication regions. How- hybridization-based methods. ever we did not find significant differences between them (data not shown) which was in concordance with Complex region analysis using genome mapping previous findings [27]. We also compared the distribu- Besides SV detection, genome mapping data also provide tion of insertions and deletions of different SV categories abundant information about other complex regions in and found that SV events that were not supported by se- the genome. For complex regions that are functionally quencing evidence were 97% (260 out of 268) insertions; important, an accurate reference map is critical for precise in contrast, the SVs that were supported by sequencing evi- sequence assembly and integration for functional analysis dence were only 61% (243 out of 396, Figure 2, Additional [40-43]. We analyzed the structure of some complex file 2, Spreadsheet 1) insertions showing insertion en- regions of the human genome. They include MHC richment (p = 2.2e-16 Chi-squared test, Figure 2) in also called Human leukocyte antigen (HLA), KIR, SVs without sequencing evidence. In addition, we fur- IGL/IGH, as well as TRA/TRB [44-48]. In the highly ther investigated the novel 57 SVs without either se- variable HLA-A and –C loci, the YH genome shared quencing evidence or database supporting evidence. one haplotype with the previously typed PGF genome We found that the genes they covered had important (used in hg19) and also revealed an Asian/YH-specific functions, such as ion binding, enzyme activating and variant on maps 209 and 153 (Additional file 1: Figure so forth, indicating their important role in cellular bio- S5), respectively. In the variant haplotype (Map ID chemical activities. Some of the genes like ELMO1, 153), there is a large insertion at the HLA-A locus HECW1, SLC30A8, SLC16A12, JAM3 are reported to while at the HLA-D and RCCX loci, YH had an Asian/ be associated with diseases like diabetic nephropathy, YH-specific insertion and a deletion. In addition to the lateral sclerosis, diabetes mellitus and cataracts [39], MHC region, we also detected Asian/YH-specific struc- providing valuable foundation for clinical application tural differences in KIR (Additional file 1: Figure S6), (Additional file 2, Spreadsheet 1 & 2). IGH/IGL (Additional file 1: Figure S7), and TRA/TRB (Additional file 1: Figure S8), compared to the reference Highly repetitive regions of the human genome genome. Highly repetitive regions of the human genome are known to be nearly intractable by NGS because short External sequence integration detection using genome reads are often collapsed, and these regions are often re- mapping fractory to cloning. We have searched for and analyzed External viral sequence integration detection is import- one class of simple tandem repeats (unit size ranging ant for the study of diseases such as cancer, but current from 2-13 kb) in long molecules derived from the ge- high-throughput methods are limited in discovering nomes of YH (male) and CEPH-NA12878 (female). The integration break points [49-51]. Although fiber fluores- frequencies of these repeating units from both genomes cence in situ hybridization (FISH) was used to discrimin- were plotted in comparison with hg19 (Figure 3). We ate between integration and episomal forms of virus found repeat units across the entire spectrum of sizes in utilizing long dynamic DNA molecules [52], this method YH and NA12878 while there were only sporadic peaks in was laborious, low-resolution and low-throughput. Thus, hg19, implying an under representation of copy number long, intact high-resolution single-molecule data provided Cao et al. GigaScience 2014, 3:34 Page 6 of 11 http://www.gigasciencejournal.com/content/3/1/34

Figure 3 A plot of repeat units in two human genomes as seen in single molecules. A repeat unit is defined as five or more equidistant labels. Total units in bins are normalized to the average coverage depth in the genome. by genome mapping allows for rapid and effective analysis structural variation and haplotype differences in the of which part of the virus sequence has been integrated human MHC region, has been adopted to capture the into the host genome and its localization. We detected genome-wide structure of a human individual in the EBV integration into the genome of the cell line sample. current study. Evidence for over 600 SVs in this indi- TheEBVvirusmapwasassembledde novo during the vidual has been provided. Despite the difficulty of SV whole genome de novo assembly of the YH cell line detection by sequencing methods, the majority of gen- genome. We mapped the de novo EBV map to in silico ome map-detected SVs were retrospectively found to maps from public databases to determine the strain that have signals consistent with the presence of an SV, was represented in the cell line. We found that the YH validating genome mapping for SV discovery. Approxi- strain was most closely related, although not identical, mately 75% of the SVs discovered by genome mapping to strain B95-8 (GenBank: V01555.2). To detect EBV in- tegration, portions of the aligned molecules extending beyond the EBV map were extracted and aligned with hg19 to determine potential integration sites (Additional file 1: Figure S9). There are 1,340 EBV integration events across the genome (Figure 4). We found that the frequency of EBV integration mapping was significantly lower than the average coverage depth (~70X), implying the DNA sample derived from a clonal cell population is potentially more diverse than previously thought, and that this method could reveal the heterogeneity of a very complex sample population at the single-molecule level. Also, the integrated portion of the EBV genome sequence was detected with a larger fraction towards the tail (Additional file 1: Figure S10). Besides integration events, we also found EBV episome molecules whose single-molecule map could be mapped to the EBV genome, free of flanking human genomic regions.

Discussion Structural variants are increasingly frequently shown to play important roles in human health. However, available technologies, such as array-CGH, SNP array and NGS Figure 4 Circos plot of distribution of integration events are incapable of cataloguing them in a comprehensive throughout YH genome. Thegenomewasdividedinto and unbiased manner. Genome mapping, a technology non-overlapping windows of 200 kb. The number of molecules with successfully applied to the assembly of complex evidence of integration in each window is plotted with each concentric grey circle representing a two-fold increment in virus detection. regions of a plant genome and characterization of Cao et al. GigaScience 2014, 3:34 Page 7 of 11 http://www.gigasciencejournal.com/content/3/1/34

were insertions; this interesting phenomenon may be a PCR steps or NGS approaches [50]. All in all, we dem- method bias or a genuine representation of the additional onstrated advantages and strong potential of the genome content in this genome of Asian descent that is not mapping technology based on nanochannel arrays to present in hg19, which was compiled based on genomic help overcome problems that have severely limited our materials presumably derived from mostly non-Asians. understanding of the human genome. Analysis of additional genomes is necessary for compari- In addition to the advantages this study reveals about son. Insertion detection is refractory to many existing the genome mapping technology, aspects that need to be methodologies [24,25], so to some extent, genome map- improved also are highlighted. As genome mapping tech- ping revealed its distinct potential to address this chal- nology generates sequence-specific motif-labeled DNA lenge. Furthermore, functional annotation results of the molecules and analyzes these motif maps using an overlap- detected SVs show that 30% of them (Additional file 2, layout-consensus algorithm, subsequent performance and Spreadsheet 1 & 2) affect exonic regions of relevant genes resolution largely depends on motif density (any individual which may cause severe effects on gene function. Gene event endpoints can only be resolved to the nearest restric- ontology (GO) analysis demonstrates that these SVs are tion sites). For example, the EBV integration analysis in this associated with genes that contribute to important bio- study was more powerful in the high-density regions logical processes (Additional file 2, Spreadsheet 1 & 2 and (Additional file 1: Figure S10). Hence, higher density label- Additional file 1: Figure S11), reflecting that the SVs de- ing methods to increase the information density that may tected here are likely to affect a large number of genes and promote even higher accuracy and unbiased analysis of ge- may have a significant impact on human health. Genome nomes are currently being further developed. When data mapping provides us with an effective way to study the from genome mapping is combined with another source of impact of genome-wide SV on human conditions. Some information, one can achieve even higher resolution for N-base gaps are estimated to have longer or shorter length each event. In addition, reducing random errors like extra or more complex structurally compared to hg19, demon- restriction sites, missing restriction sites and size measure- strating that genome mapping is useful for improving the ment is important for subsequent analysis. Finally, improve- human and other large genome assemblies. We also ments to the SV detection algorithm will provide further present a genome-wide analysis of short tandem repeats discovery potential, and balanced reciprocal translocations in individual human genomes and structural information can be identified in genome maps generated from cancer and differences for some of the most complex regions in model genomes (personal communication, Michael Rossi). the YH genome. Independent computational analysis has The throughput and speed of a technology remains been performed to discern exogenous viral insertions, as one of the most important factors for routine use in well as exogenous episomes. All of these provide invaluable clinical screening as well as scientific research. At the insights into the capacity of genome mapping as a promis- time of manuscript submission, genome mapping of a ing new strategy for research and clinical application. human individual could be accomplished with fewer The basis for the genome mapping technology that en- than three nanochannel array chips in a few days. It is ables us to effectively address shortcomings of existing anticipated that a single nanochannel chip would cover methodologies is the use of motif maps derived from ex- a human size genome in less than one day within tremely long DNA molecules hundreds of kb in length. 6 months, facilitating new studies aimed at unlocking Using these motif maps, we are able to also access the inaccessible portions of the genome. In this way, challenging loci where existing technologies fail. Firstly, genome mapping has an advantage over the use of mul- global structural variations were easily and quickly de- tiple orthogonal methods that are often used to detect tected. Secondly, evidence for a deletion bias which is global SVs. Thus, it is now feasible to conduct large commonly observed with both arrays and NGS technol- population-based comprehensive SV studies efficiently ogy, is absent in genome mapping. In fact, we observe on a single platform. more insertions than deletions in this study. Thirdly, for the first time, we are able to measure the length of re- Methods gions of the YH genome that represent gaps in the High-molecular weight DNA extraction human reference assembly. Fourthly, consensus maps High-molecular weight (HMW) DNA extraction was could be assembled in highly variable regions in the YH performed as recommended for the CHEF Mammalian genome which are important for subsequent functional Genomic DNA Plug Kit (BioRad #170-3591). Briefly, analysis. Finally, both integrated and un-integrated EBV cells from the YH or NA12878 cell lines were washed molecules are identified, and potential sub-strains differ- with 2x with PBS and resuspended in cell resuspension entiated, and the EBV genome sequence that integrated buffer, after which 7.5 × 105 cells were embedded in into the host genome was obtained directly. This infor- each gel plug. Plugs were incubated with lysis buffer and mation was previously inaccessible without additional proteinase K for four hours at 50°C. The plugs were Cao et al. GigaScience 2014, 3:34 Page 8 of 11 http://www.gigasciencejournal.com/content/3/1/34

washed and then solubilized with GELase (Epicentre). contained no more than 5 unaligned end labels. If these The purified DNA was subjected to four hours of criteria were satisfied, the two genome maps would be drop dialysis (Millipore, #VCWP04700) and quantified joined together with the intervening label patterns taken using Nanodrop 1000 (Thermal Fisher Scientific) and/ from the reference in silico map. or the Quant-iT dsDNA Assay Kit (Invitrogen/Mo- lecular Probes). Structural variation detection DNA labeling Alignments between consensus genome maps and the DNA was labeled according to commercial protocols hg19 in silico sequence motif map were obtained using a using the IrysPrep Reagent Kit (BioNano Genomics, dynamic programming approach where the scoring Inc). Specifically, 300 ng of purified genomic DNA was function was the likelihood of a pair of intervals being nicked with 7 U nicking endonuclease Nt.BspQI (New similar [53]. Likelihood is calculated based on a noise England BioLabs, NEB) at 37°C for two hours in NEB model which takes into account fixed sizing error, sizing Buffer 3. The nicked DNA was labeled with a error which scales linearly with the interval size, mis- fluorescent-dUTP nucleotide analog using Taq polymer- aligned sites (false positives and false negatives), and ase (NEB) for one hour at 72°C. After labeling, the nicks optical resolution. Within an alignment, an interval or were ligated with Taq ligase (NEB) in the presence of range of intervals whose cumulative likelihood for dNTPs. The backbone of fluorescently labeled DNA was matching the reference map is worse than 0.01 percent stained with YOYO-1 (Invitrogen). chance is classified as an outlier region. If such a region − occurs between highly scoring regions (p-value of 10e 6), Data collection an insertion or deletion call is made in the outlier re- The DNA was loaded onto the nanochannel array of gion, depending on the relative size of the region on the BioNano Genomics IrysChip by electrophoresis of DNA. query and reference maps. Inversions are defined if adja- Linearized DNA molecules were then imaged automatic- cent match-groups between the genome map and refer- ally followed by repeated cycles of DNA loading using ence are in reverse relative orientation. the BioNano Genomics Irys system. The DNA molecules backbones (YOYO-1 stained) and locations of fluorescent labels along each molecule were Signals refined by re-sequencing and de novo assembly detected using the in-house software package, IrysView. based methods The set of label locations of each DNA molecule defines In order to demonstrate the capacity of genome mapping an individual single-molecule map. for the detection of large SVs, we tested the candidate SVs using whole-genome paired-end 100 bp sequencing De novo genome map assembly (WGS) data with insert sizes of 500 bp and fosmid se- Single-molecule maps were assembled de novo into con- quence based de novo assembly result. SVs were tested sensus maps using software tools developed at BioNano based on the expectation that authentic SVs would be Genomics. Briefly, the assembler is a custom implemen- supported by abnormally mapped read pairs, and that tation of the overlap-layout-consensus paradigm with a deletions with respect to the reference should have lower maximum likelihood model. An overlap graph was gen- mapped read depth than average [20,22,23]. We per- erated based on pairwise comparison of all molecules as formed single-end/(paired-end + single-end) reads ratio input. Redundant and spurious edges were removed. (sp ratio) calculations at the whole-genome level to assign The assembler outputs the longest path in the graph and an appropriate threshold for abnormal regions as well as consensus maps were derived. Consensus maps are fur- depth coverage. We set sp ratio and depth cutoff thresh- ther refined by mapping single-molecule maps to the olds based on the whole genome data to define SV signals. consensus maps and label positions are recalculated. Re- Insertions with aberrant sp ratio and deletions with either fined consensus maps are extended by mapping single sp ratio or abnormal depth were defined to be a supported molecules to the ends of the consensus and calculating candidate. label positions beyond the initial maps. After merging of We also utilized fosmid-based de novo assembly data overlapping maps, a final set of consensus maps was to search for signals supporting candidate SVs. We used generated and used for subsequent analysis. Further- contigs and scaffolds assembled from short reads to more, we applied a “stitching” procedure to join neigh- check for linearity between a given assembly and hg19 boring genome maps. Two adjacent genome maps using LASTZ [54]. WGS-based and fosmid-based SV would be joined together if the junction a) was within validation showed inconsistency and/or lack of satur- 50 kb apart, b) contained at most 5 labels, c) contained, ation as each supported unique variants (Additional file or was within 50 kb from, a fragile site, and d) also 1: Figure S2) [24]. Cao et al. GigaScience 2014, 3:34 Page 9 of 11 http://www.gigasciencejournal.com/content/3/1/34

EBV integration detection Acknowledgements Single-molecule maps were aligned with a map generated We wish to recognize BGI-Shenzhen’s sequencing platform for generating the data in this study. We thank the faculty and staff at BGI-Shenzhen who in silico based on the EBV reference sequence (strain B95- contributed to this project. We also thank everyone at BioNano Genomics 8; GenBank: V01555.2). Portions of the aligned molecules who contributed to the development of the Irys system and the team’s extending beyond the EBV map were extracted and helpful discussion and analysis. This work was supported by the State Key Development Program for Basic Research of China—973 Program aligned with hg19 to determine potential integration sites. (NO.2011CB809202); the National High Technology Research and Development Program of China - 863 Program (NO.2012AA02A201); the Shenzhen Municipal Government of China (NO.JC201005260191A); Shenzhen Key Laboratory of Availability of supporting data Transomics Biotechnologies (NO.CXB201108250096A). The data sets supporting the results of this article is Author details available in the GigaScience GigaDB, repository [55]. See 1BGI-Shenzhen, Shenzhen 518083, China. 2BioNano Genomics, San Diego, the individual GigaDB entries for the YH Bionano data California 92121, USA. 3Shenzhen Key Laboratory of Transomics [35] and YH fosmid validation data [36], which is also Biotechnologies, Shenzhen 518083, China. 4Department of Biology, University of Copenhagen, Copenhagen 2200, Denmark. 5School of Bioscience and available in the SRA [PRJEB7886]. Biotechnology, South China University of Technology, Guangzhou 511400, China.

Additional files Received: 21 May 2014 Accepted: 28 November 2014 Published: 30 December 2014 Additional file 1: Figure S1. Comparison of consensus genome maps and hg19 reference across gap regions. Figure S2. Consensus genome map coverage of human reference assembly (hg19). Figure S3. Examples References of repetitive sequence detected in intact single molecules by genome 1. Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, mapping. Figure S4. Consensus genome map compared to hg19 in a Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, long tandem repeat region. Figure S5. Consensus genome maps Lee C: Copy number variation: new insights in genome diversity. compared to hg19 in the MHC region. Figure S6. Consensus genome Genome Res 2006, 16:949–961. maps compared to hg19 in the KIR region. Figure S7. Consensus 2. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, genome maps compared to hg19 in the IGH and IGL regions. Figure S8. Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Fine-scale structural Consensus genome maps compared to hg19 in the TRA and TRB regions. variation of the human genome. Nat Genet 2005, 37:727–732. Figure S9. Single-molecule alignment to EBV in silico motif map (strain 3. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. B95-8) showing evidence of strain variation and heterogeneous integration. Nat Rev Genet 2006, 7:85–97. Figure S10. Distribution of integrated portions of the EBV genome. 4. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen Figure S11. GO annotations of genes within called SVs. Table S1. Summary N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, of consensus genome map assembly. Newman TL, Tüzün E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Additional file 2: Spreadsheet 1. Summary of insertions and deletions Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, detected in BioNano genome maps. Spreadsheet 2: Summary of Chen L, Malig M, et al: Mapping and sequencing of structural variation inversions detected in BioNano genome maps. Spreadsheet 3: Validation from eight human genomes. Nature 2008, 453:56–64. of insertions and deletions detected in BioNano genome maps. 5. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale Abbreviations copy number polymorphism in the human genome. Science 2004, Array-CGH: Array-based comparative genomic hybridization; AS: De novo 305:525–528. sequence assembly; ASD: Autism spectrum disorder; BCR: B cell receptor; 6. Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, Krauss RM, Myers CNV: Copy number variant; DGV: Database of genomic variants; EBV: RM, Ridker PM, Chasman DI, Mefford H, Ying P, Nickerson DA, Eichler EE: Epstein-Barr virus; FISH: Fluorescence in situ hybridization; GO: Gene Population analysis of large copy number variants and hotspots of ontology; HLA: Human leukocyte antigen; HMW: High-molecular weight; human genetic disease. Am J Hum Genet 2009, 84:148–161. IGH: Immunoglobulin heavy locus; IGL: Immunoglobulin light locus; KIR: Killer 7. Cheng Z, Ventura M, She X, Khaitovich P, Graves T, Osoegawa K, Church D, cell immunoglobulin-like receptor; LRC: Leukocyte Receptor Complex; DeJong P, Wilson RK, Pääbo S, Rocchi M, Eichler EE: A genome-wide MHC: Major histocompatibility complex; NGS: Next-generation sequencing; comparison of recent chimpanzee and human segmental duplications. PCR: Polymerase chain reaction; PEM: Pair-end mapping; RD: Read depth; Nature 2005, 437:88–93. SNP: Single nucleotide polymorphism; SR: Split read; SV: Structural variation; 8. Lupski JR: Genomic rearrangements and sporadic disease. Nat Genet 2007, TCR: T cell receptor; TRA: T cell receptor alpha locus; TRB: T cell receptor beta 39:S43–S47. locus; WGS: Whole-genome sequencing; YH: YanHuang. 9. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature 2010, 467:1061–1073. Competing interests 10. Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, The authors have the following interests: Alex R. Hastie, Ernest T. Lam, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Warren Andrews, Saki Chan, Michael Requa, Thomas Anantharaman and Han Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Cao were employees of BioNano Genomics at the time of the study, and Konkel MK, Korn J, Khurana E, Kural D, Lam HY, Leng J, Li R, Li Y, Lin CY, they owned company stock options. Luo R, et al: Mapping copy number variation by population-scale genome sequencing. Nature 2011, 470:59–65. Authors’ contributions 11. Stankiewicz P, Lupski JR: Structural variation in the human genome and XX, HC and HZC managed the project. HC, HZC, ARH, DC, ETL and SC its role in disease. Annu Rev Med 2010, 61:437–455. designed the analyses. HC, ARH, ETL, W A, MR and TA produced genome 12. Girirajan S, Campbell CD, Eichler EE: Human copy number variation and mapping data. HZC, ARH, DC, ETL, YS, HH, XL, LY, WA, SC, SH, XT, MR, TA, complex genetic disease. Annu Rev Genet 2011, 45:203–226. AK and HY performed the data analyses. HZC, ARH and DC did most of the 13. Weischenfeldt J, Symmons O, Spitz F, Korbel JO: Phenotypic impact of writing with contributions from all authors. All authors read and approved genomic structural variation: insights from and for human disease. the final manuscript. Nat Rev Genet 2013, 14:125–138. Cao et al. GigaScience 2014, 3:34 Page 10 of 11 http://www.gigasciencejournal.com/content/3/1/34

14. Ledbetter DHRV, Airhart SD, Strobel RJ, Keenan BS, Crawford JD: Deletions 31. Lam ET, Hastie A, Lin C, Ehrlich D, Das SK, Austin MD, Deshpande P, Cao H, of chromosome 15 as a cause of the Prader-Willi syndrome. N Engl J Med Nagarajan N, Xiao M, Kwok PY: Genome mapping on nanochannel arrays 1981, 304:325–329. for structural variation analysis and sequence assembly. Nat Biotechnol 15. Vissers LE, Stankiewicz P: Microdeletion and microduplication syndromes. 2012, 30:771–776. Methods Mol Biol 2012, 838:29–75. 32. Hastie AR, Dong L, Smith A, Finklestein J, Lam ET, Huo N, Cao H, Kwok PY, 16. Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Deal KR, Dvorak J, Luo MC, Gu Y, Xiao M: Rapid genome mapping in Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J, nanochannel arrays for highly complete and accurate de novo sequence Spence SJ, Lee AT, Puura K, Lehtimäki T, Ledbetter D, Gregersen PK, assembly of the complex Aegilops tauschii genome. PLoS One 2013, 8: Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC, e55864. Skuse D, Geschwind DH, Gilliam TC, et al: Strong association of de novo 33. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang copy number mutations with autism. Science 2007, 316:445–449. J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang 17. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González J, et al: The diploid genome sequence of an Asian individual. Nature 2008, JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, 456:60–65. Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, 34. Sneddon TP, Li P, Edmunds SC: GigaDB: announcing the GigaScience Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, et al: Global database. biotechnology 2012, 1:11. variation in copy number in the human genome. Nature 2006, 35. Cao HZ, Hastie AR, Cao D, Lam ET, Sun Y, Huang H, Liu X, Lin L, Andrews W, 444:444–454. Chan S, Huang S, Tong X, Requa M, Anantharaman T, Krogh A, Yang H, Cao 18. McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, H, Xu X: Supporting material for: "Rapid detection of structural variation Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E, in a Human genome using nanochannel-based genome mapping Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M, technology". GigaScience Database 2014, http://dx.doi.org/10.5524/100097. Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D: Integrated 36. Cao HZ, Wu H, Luo R, Huang S, Sun Y, Tong X, Xie Y, Liu B, Yang H, Zheng detection and population-genetic analysis of SNPs and copy number H, Li J, Li B, Wang Y, Yang F, Sun P, Liu S, Gao P, Huang H, Sun J, Chan D, variation. Nat Genet 2008, 40:1166–1174. Ha G, Huang W, Huang Z, Li Y, Tellier LCAM, Liu X, Feng Q, Xu X, Zhang X, 19. Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, et al: Supporting material for: De novo assembly of a haplotype-resolved Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler human genome. GigaScience Database 2014, http://dx.doi.org/10.5524/ EE: Segmental duplications and copy-number variation in the human 100096. genome. Am J Hum Genet 2005, 77:78–88. 37. Valouev A, Schwartz DC, Zhou S, Waterman MS: An algorithm for assembly 20. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, of ordered restriction maps from single DNA molecules. Proc Nat Acad Sci Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, USA 2006, 103:15770–15775. Mardis ER: BreakDancer: an algorithm for high-resolution mapping of 38. Macdonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW: The Database of genomic structural variation. Nat Methods 2009, 6:677–681. Genomic Variants: a curated collection of structural variation in the 21. Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC, human genome. Nucleic Acids Res 2014, 42:D986–D992. Chen K, Harris CC, Ding L, Holmfeldt L, Payne-Turner D, Fan X, Wei L, Zhao 39. McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. D, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J: Am J Hum Genet 2007, 80:588–604. CREST maps somatic structural variation in cancer genomes with 40. Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, Almeida J, Forbes S, base-pair resolution. Nat Methods 2011, 8:652–654. Gilbert JG, Halls K, Harrow JL, Hart E, Howe K, Jackson DK, Palmer S, Roberts 22. Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to AN, Sims S, Stewart CA, Traherne JA, Trevanion S, Wilming L, Rogers J, de discover, genotype, and characterize typical and atypical CNVs Jong PJ, Elliott JF, Sawcer S, Todd JA, Trowsdale J, Beck S: Variation analysis from family and population genome sequencing. Genome Res 2011, and gene annotation of eight MHC haplotypes: the MHC Haplotype 21:974–984. Project. Immunogenetics 2008, 60:1–18. 23. Li Y, Zheng H, Luo R, Wu H, Zhu H, Li R, Cao H, Wu B, Huang S, Shao H, 41. Tesch HMP, Krueger GR, Fischer R, Diehl V: Analysis of immunoglobulin, T Ma H, Zhang F, Feng S, Zhang W, Du H, Tian G, Li J, Zhang X, Li S, Bolund L, cell receptor and bcr rearrangements in human malignant lymphoma Kristiansen K, de Smith AJ, Blakemore AI, Coin LJ, Yang H, Wang J, Wang J: and Hodgkin's disease. Oncology 1990, 47:215–223. Structural variation in two human genomes mapped at single- 42. Rajagopalan S, Long EO: Understanding how combinations of HLA and nucleotide resolution by whole genome de novo assembly. Nature KIR genes influence disease. J Exp Med 2005, 201:1025–1029. biotechnology 2011, 29:723–730. 43. Gonzalez D, van der Burg M, Garcia-Sanz R, Fenton JA, Langerak AW, 24. Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and Gonzalez M, van Dongen JJ, San Miguel JF, Morgan GJ: Immunoglobulin genotyping. Nat Rev Genet 2011, 12:363–376. gene rearrangements and the pathogenesis of multiple myeloma. Blood 25. Zhao M, Wang Q, Jia P, Zhao Z: Computational tools for copy number 2007, 110:3112–3121. variation (CNV) detection using next-generation sequencing data: 44. Beck S, Geraghty D, Inoko H, Rowen L, Aguado B, Bahram S, Campbell RD, features and perspectives. BMC Bioinformatics 2013, 14(11):S1. Forbes SA, Guillaudeux T, Hood L: Complete sequence and gene map of a 26. Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome human major histocompatibility complex. Nature 1999, 401:921–923. sequence assembly. Nat Methods 2011, 8:61–65. 45. Vilches C, Parham P: KIR: diverse, rapidly evolving receptors of innate and 27. Teague B, Waterman MS, Goldstein S, Potamousis K, Zhou S, Reslewic S, adaptive immunity. Annu Rev Immunol 2002, 20:217–251. Sarkar D, Valouev A, Churas C, Kidd JM, Kohn S, Runnheim R, Lamers C, 46. Katzmann JA, Clark RJ, Abraham RS, Bryant S, Lymp JF, Bradwell AR, Kyle RA: Forrest D, Newton MA, Eichler EE, Kent-First M, Surti U, Livny M, Schwartz Serum reference intervals and diagnostic ranges for free kappa and free DC: High-resolution human genome structure by single-molecule lambda immunoglobulin light chains: relative sensitivity for detection of analysis. Proc Natl Acad Sci U S A 2010, 107:10848–10853. monoclonal light chains. Clin Chem 2002, 48:1437–1444. 28. Dong Y, Xie M, Jiang Y, Xiao N, Du X, Zhang W, Tosser-Klopp G, Wang J, 47. Tomlinson IM, Cook GP, Walter G, Carter NP, Riethman H, Buluwela L, Yang S, Liang J, Chen W, Chen J, Zeng P, Hou Y, Bian C, Pan S, Li Y, Liu X, Rabbitts TH, Winter G: A complete map of the human immunoglobulin Wang W, Servin B, Sayre B, Zhu B, Sweeney D, Moore R, Nie W, Shen Y, VH locus. Ann N Y Acad Sci 1995, 764:43–46. Zhao R, Zhang G, Li J, Faraut T, et al: Sequencing and automated 48. Haynes MR, Wu GE: Gene discovery at the human T-cell receptor alpha/ whole-genome optical mapping of the genome of a domestic goat delta locus. Immunogenetics 2007, 59:109–121. (Capra hircus). Nat Biotechnol 2013, 31:135–141. 49. Li WZX, Lee NP, Liu X, Chen S, Guo B, Yi S, Zhuang X, Chen F, Wang G, 29. Levy-Sakin M, Ebenstein Y: Beyond sequencing: optical mapping of DNA Poon RT, Fan ST, Mao M, Li Y, Li S, Wang J, Jianwang, Xu X, Jiang H, Zhang in the age of nanotechnology and nanoscopy. Curr Opin Biotechnol 2013, X: HIVID: an efficient method to detect HBV integration using low 24:690–698. coverage sequencing. Genomics 2013, 102:338–344. 30. Das SK, Austin MD, Akana MC, Deshpande P, Cao H, Xiao M: Single 50. Kan Z, Zheng H, Liu X, Li S, Barber TD, Gong Z, Gao H, Hao K, Willard MD, molecule linear analysis of DNA in nano-channel labeled with sequence Xu J, Hauptschein R, Rejto PA, Fernandez J, Wang G, Zhang Q, Wang B, specific fluorescent probes. Nucleic Acids Res 2010, 38:e177. Chen R, Wang J, Lee NP, Zhou W, Lin Z, Peng Z, Yi K, Chen S, Li L, Fan X, Cao et al. GigaScience 2014, 3:34 Page 11 of 11 http://www.gigasciencejournal.com/content/3/1/34

Yang J, Ye R, Ju J, Wang K, et al: Whole-genome sequencing identifies recurrent mutations in hepatocellular carcinoma. Genome Res 2013, 23:1422–1433. 51. Zhao G, Krishnamurthy S, Cai Z, Popov VL, da Rosa Travassos AP, Guzman H, Cao S, Virgin HW, Tesh RB, Wang D: Identification of Novel Viruses Using VirusHunter – an Automated Data Analysis Pipeline. PLoS One 2013, 8:e78470. 52. Reisinger J, Rumpler S, Lion T, Ambros PF: Visualization of episomal and integrated Epstein-Barr virus DNA by fiber fluorescence in situ hybridization. Int J Cancer 2006, 118:1603–1608. 53. Anantharaman T, Mishra B: A probabilistic analysis of false positives in optical map alignment and validation.InProc. of WABI; 2001:27–40. 54. Harris RS: Improved pairwise alignment of genomic DNA; 2007. 55. Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI: GigaDB: promoting data dissemination and reproducibility. Database 2014, 2014:bau018.

doi:10.1186/2047-217X-3-34 Cite this article as: Cao et al.: Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. GigaScience 2014 3:34.

Submit your next manuscript to BioMed Central and take full advantage of:

• Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit