Informatics and Clinical Genome Sequencing: Opening the Black Box

©American College of Medical Genetics and Genomics REVIEW Informatics and clinical genome sequencing: opening the black box Sowmiya Moorthie, PhD1, Alison Hall, MA 1 and Caroline F. Wright, PhD1,2 Adoption of whole-genome sequencing as a routine biomedical tool present an overview of the data analysis and interpretation pipeline, is dependent not only on the availability of new high-throughput se- the wider informatics needs, and some of the relevant ethical and quencing technologies, but also on the concomitant development of legal issues. methods and tools for data collection, analysis, and interpretation. Genet Med 2013:15(3):165–171 It would also be enormously facilitated by the development of decision support systems for clinicians and consideration of how such Key Words: bioinformatics; data analysis; massively parallel; information can best be incorporated into care pathways. Here we next-generation sequencing INTRODUCTION Each of these steps requires purpose-built databases, algo- Technological advances have resulted in a dramatic fall in the rithms, software, and expertise to perform. By and large, issues cost of human genome sequencing. However, the sequencing related to primary analysis have been solved and are becom- assay is only the beginning of the process of converting a sample ing increasingly automated, and are therefore not discussed of DNA into meaningful genetic information. The next step of further here. Secondary analysis is also becoming increasingly data collection and analysis involves extensive use of various automated for human genome resequencing, and methods of computational methods for converting raw data into sequence mapping reads to the most recent human genome reference information, and the application of bioinformatics techniques sequence (GRCh37), and calling variants from it, are becom- for the interpretation of that sequence. The enormous amount ing standardized. The major bottleneck for wider clinical appli- of data generated by massively parallel next-generation sequenc- cation of NGS is the interpretation of sequence data, which is ing (NGS) technologies has shifted the workload away from still a nascent field in terms of developing algorithms, appro- upstream sample preparation and toward downstream analy- priate analytical tools and effective evidence bases of human sis processes. This means that, along with the development of genotype–phenotype associations. Although the steps involved sequencing technologies, concurrent development of appropri- in interpretation and application of results will vary depending ate informatics solutions is urgently needed to make clinical on the specific clinical setting, the purpose of the testing, and interpretation of individual genomic variation a realistic goal.1,2 the clinical question, it is likely that there will be commonali- ties in both the basic analysis pipeline and tools used. Similarly, THE DATA ANALYSIS PIPELINE although some of the details may change with the introduction The data analysis process can be broadly divided into the of third-generation sequencing technologies, such as those that following three stages (Figure 1). involve real-time detection of single molecules, the data analysis challenge posed by massively parallel sequencing technologies Primary analysis: base calling will remain essentially the same. Below, we provide an overview That is, converting raw data, based on changes in light intensity of the secondary and tertiary steps involved in analyzing raw or electrical current, into short sequences of nucleotides.3 sequence reads from NGS technologies. Secondary analysis: alignment and variant calling SECONDARY ANALYSIS: VARIANT CALLING That is, mapping individual short sequences of nucleotides, or AND ANNOTATION reads, to a reference sequence and determining variation from Following initial base calling, the next step toward generating that reference.4 useful genetic information from sequencing reads (i.e., short sequences of nucleotides) involves assembly of the reads into Tertiary analysis: interpretation a complete genome sequence by comparison of multiple over- That is, analyzing variants to assess their origin, uniqueness, lapping reads and the reference sequence. Longer read lengths and functional impact.5 would make this process much simpler, as each read would be 1PHG Foundation, Cambridge, UK; 2Current address: Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK. Correspondence: Caroline F. Wright ([email protected]) Submitted 6 June 2011; accepted 6 August 2012; advance publication online 13 September 2012. doi:10.1038/gim.2012.116 GENETICS in MEDICINE | Volume 15 | Number 3 | March 2013 165 REVIEW MOORTHIE et al | Informatics and clinical genome sequencing Signal analysis • Variation may represent true genetic variation in the sam- Primary analysis ple. As each human genome is estimated to differ from the BASE-CALLING Base-calling reference sequence in ~3–4 million sites,9,10 including sin- Quality scoring (raw) gle-nucleotide changes and structural variants, mapping Reference sequence these variants is a challenge. Determining which variants are derived from the same physical chromosome can also Mapping reads to reference be difficult from short reads, and haplotypes must either Quality scoring (consensus) be assembled11 or imputed.12 Using paired-end reads for mapping is particularly important for identifying Secondary analysis Variant calling • Read depth analysis structural variation, and is critical for cancer genome ALIGNMENT • Paired-end analysis • Haplotype inference sequencing due to the presence of extensive large structural rearrangements relative to the matched germline Visualization genome, including both intra- and interchromosomal 13 Databases of variation rearrangements. Variant analysis The addition of biological information to sequence data is • Genetic filtering • Functional filtering an important step toward making it possible to interpret the potential effects of variants, and involves a mixture of automatic Tertiary analysis Validation of candidates INTERPRETATION annotation by computational prediction and manual annotation (curation) by expert interpretation of experimental data. The Clinical interpretation and decision-making main steps relate to structural information (e.g., gene location, • Diagnostic • Therapeutic structure, coding, splice sites, and regulatory motifs) and functional information (e.g., protein structure, function, interac- 14 Figure 1 Outline of informatics pipeline for processing and analyzing tions, and expression). A fully annotated sequence can include data from massively parallel sequencing platforms. an enormous amount of information, including common and rare variants, comparisons with other species,15 known genetic much more likely to map uniquely to a single position in the and epigenetic variants, regulatory features, transcript, and genome. The process is complicated by the extent of variation in expression data, as well as links to protein databases. However, the sample sequence from the reference sequence, and generat- annotation is currently both incomplete and imperfect in the ing a list of true variants is made challenging because there are human genome. Projects such as ENCODE, the Encyclopedia several possible explanations for differences between the refer- of DNA Elements,16,17 and the related GENCODE, encyclope- ence and sample genome: dia of genes and gene variants,18,19 aim to identify all functional elements in the human genome sequence and will ultimately be • Inaccuracies in the reference genome, which does not used to annotate all evidence-based gene features in the entire represent any single individual and is still incomplete due human genome at a high accuracy. to highly repetitive regions that have yet to be sequenced. Numerous programs have been developed specifically for Furthermore, to date, the reference sequence has been genome assembly, alignment, and variant calling based on regularly updated, which has caused specific regions to DNA sequence reads from high-throughput next-generation map to different locations between versions. sequencing platforms2,4,20,21 (Table 1). These include software • Incorrect base calls in the sample sequence due to developed for use with a particular sequencing platform, open sequencing or amplification errors.6–8 However, this can access academic software with a variety of functionalities and be mitigated through read-depth analysis (i.e., evaluat- platform compatibilities, and proprietary software designed for ing the number of times an individual base is sequenced specific purposes such as diagnostics (Table 2). A major issue in independent reads) to assess the reliability of each for standard alignment programs is the interpretation of small base call. insertions and deletions, which has been partly addressed by the • Incorrect alignment between the reference and sample development of new programs specifically for this purpose22; may arise due to the highly repetitive nature of substan- however, current technology does not yet allow for confident tial portions of the genome and (in the case of NGS plat- analysis of interpretation of small insertions and deletions as forms) short read lengths. This can result in individual long as several hundred base pairs, especially repeat sequences. reads mapping to multiple locations. The likelihood that Various dedicated software packages have also been developed a read is mapped correctly can be indicated

Informatics and Clinical Genome Sequencing: Opening the Black Box

Reference Genome Sequence of the Model Plant Setaria

The Enigmatic Charcot-Leyden Crystal Protein

Y Chromosome Dynamics in Drosophila

The Bacteria Genome Pipeline (BAGEP): an Automated, Scalable Workflow for Bacteria Genomes with Snakemake

Lawrence Berkeley National Laboratory Recent Work

Telomere-To-Telomere Assembly of a Complete Human X Chromosome W

Human Lectins, Their Carbohydrate Affinities and Where to Find Them

Human Genome Reference Program (HGRP)

De Novo Genome Assembly Versus Mapping to a Reference Genome

CLC and IFNAR1 Are Differentially Expressed and a Global Immunity Score Is Distinct Between Early- and Late-Onset Colorectal Cancer

Practical Guideline for Whole Genome Sequencing Disclosure

Y Chromosome