Next Generation Sequencing Data and Proteogenomics 2
Total Page:16
File Type:pdf, Size:1020Kb
Next Generation Sequencing Data and Proteogenomics 2 Kelly V. Ruggles and David Fenyö Abstract The fi eld of proteogenomics has been driven by combined advances in next-generation sequencing (NGS) and proteomic methods. NGS tech- nologies are now both rapid and affordable, making it feasible to include sequencing in the clinic and academic research setting. Alongside the improvements in sequencing technologies, methods in high throughput proteomics have increased the depth of coverage and the speed of analy- sis. The integration of these data types using continuously evolving bioin- formatics methods allows for improvements in gene and protein annotation, and a more comprehensive understanding of biological systems. Keywords Next generation sequencing • Proteogenomic integration • Bioinformatics • Peptide identifi cation • Gene annotation 2.1 NGS Overview some cases adding up to 1 TB per run. With this level of data volume and faster data generation, NGS itself refers to a number of techniques, all of bioinformatics has emerged as the true challenge which perform massively parallel sequencing, in in NGS data analysis and integration. which millions of DNA fragments from a sample The most frequently used NGS methods at the are sequenced at the same time (Muzzey et al. DNA level are whole exome sequencing and 2015 ). This produces a vast amount of data, in whole genome sequencing (WGS). In whole K. V. Ruggles D. Fenyö (*) Department of Medicine , New York University Institute for Systems Genetics , New York University Medical Center , 550 First Avenue , New York , Medical Center , 430 East 29th Street , New York , NY 10016 , USA NY 10016 , USA Center for Health Informatics and Bioinformatics , Department of Biochemistry and Molecular New York University Medical Center , 227 East 30th Pharmacology , New York University Medical Center , Street , New York , NY 10016 , USA 550 First Avenue , New York , NY 10016 , USA e-mail: [email protected] e-mail: [email protected] © Springer International Publishing Switzerland 2016 11 Á. Végvári (ed.), Proteogenomics, Advances in Experimental Medicine and Biology 926, DOI 10.1007/978-3-319-42316-6_2 12 K.V. Ruggles and D. Fenyö exome sequencing, only protein-coding regions NGS is also performed on RNA using RNA- of the genome are sequenced, removing the Seq , a technique which is now frequently used in remaining ~99 % of the DNA and thereby signifi - lieu of micro-arrays to assess gene expression. cantly lowering the required time and cost. This RNA-Seq enables researchers to investigate method has been most often employed in studies alternative splicing events, gene fusion events, of gene discovery and the identifi cation of dis- SNPs and gene expression. The experimental ease causing mutations . For WGS however, the procedure is similar to that of DNA sequencing, entire genome is sequenced, which is useful for with an additional step of fi rst deriving cDNA novel gene identifi cation and for the analysis of sequences from all RNA present in the sample. non-coding regions including promoters and Although different sequencing methods can enhancers. produce different raw data types, these data are DNA NGS technologies have enabled most often combined to create a FASTQ fi le, con- researchers to detect differences between an taining information on both sequence and quality. experimental and a reference genome. These typ- This data is fi rst aligned to the reference genome ically fall into two categories: and stored in a sequence alignment map (SAM) or binary alignment map (BAM) fi le using a 1. Large deletions/duplications (copy number sequence alignment algorithm (Li et al. 2009 ) variation (CNV)) (Fig. 2.1 ). A number of algorithms have been 2. Changes to the DNA sequence, also known as developed for this purpose, using Burrows- “variants”, either as single nucleotide poly- Wheeler Transformation (BWT) techniques (e.g. , morphisms (SNPs) or short insertion/dele- Bowtie/Bowtie 2 (Langmead and Salzberg 2012 ), tions (indels) BWA/BWA-SW (Li and Durbin 2010 )) and/or Smith-Waterman (SW) dynamic programing Both require alignment of NGS reads to a refer- ( e.g. , SHRiMP/SHRiMP2 (David et al. 2011 ; ence genome (Fig. 2.1 ). Rumble et al. 2009 )). Novel splice sites Splice Custom Junction Protein Analysis DB BED FASTA RNA Sequence Variant SNPs NGS Alignment Calling indels DNA FASTQ VCF Compare, score, Protein LC-MS/MS test signficance m/z Novel Peptide Identification (SAAPs,NJP) Fig. 2.1 Proteogenomic overview 2 Next Generation Sequencing Data and Proteogenomics 13 2.2 Variant Identifi cation Using Pattern growth approach software ( e.g. , Pindel Proteogenomics (Ye et al. 2009 )), baysian-based algorithms (e.g. , Dindel (Albers et al. 2011 )) and the variant call- 2.2.1 Single Nucleotide ing algorithm GATK (McKenna et al. 2010 ) have Polymorphisms (SNPs) all been refi ned for accurate indel identifi cation (Neuman et al. 2013 ). Following alignment of the DNA or RNA Following variant calling, fi ltering and anno- sequence, subsequent variant calling, fi ltering tation are common steps for isolating variants and annotation can be completed. Variants are most likely to contribute to the pathology of found through the identifi cation of small devia- interest. Although quality cutoffs for variant tions between the experimental and the reference identifi cation should always be employed, addi- genome (Figs. 2.1 and 2.2 ). These variants may tional fi ltering becomes less important in prote- be disease drivers, or mutations having little to ogenomic analysis because proteomic data can no functional impact. Several programs have be leveraged for variant validation. been developed specifi cally for the purpose of variant calling and each produces a list of variant positions stored in a Variant Call Format (VCF) 2.2.2 Single Amino Acid fi le (Danecek et al. 2011 ). Polymorphisms (SAAP) A primary challenge with SNP variant calling is in identifying “true” variants and fi ltering out Identifying variants that are expressed at the pro- those due to errors in sequencing or alignment tein level presents a non-trivial informatics chal- (Nielsen et al. 2011 ). Informatics packages have lenge in that mass spectrometric identifi cation of been developed for variant calling, including peptide sequences is dependent upon the inclusion the popular Genome Analysis Toolkit (GATK) of that sequence in the protein database . Protein (McKenna et al. 2010 ) and VarScan (Koboldt sequence database searching algorithms such as et al. 2012 ). Indel mutation identifi cation pres- X!Tandem (Craig et al. 2005 ), Mascot (Perkins ents an additional set of complications, because it et al. 1999 ) and MSGF+ (Granholm et al. 2014 ) requires a more sophisticated approach to gapped match the MS/ MS spectra against a list of candi- alignment and paired-end sequence inference. date peptide sequences and score the similarity of A.Next Generation Sequencing Reference Genome ...ATGAGTGCTGATAGAAGTATTCTGGTGTACTGTTTAAGCAGG... AAGTAGTCTGGTGTACTGTT GAGTGCTGATAGAAGTAGTC Aligned GATAGAAGTAGTCTGGTGTA NGS ATGAGTGCTGATAGAAGTAG TGCTGATAGAAGTAGTCTGG Reads AGTGCTGATAGAAGTAGTCT GATAGAAGTAGTCTGGTGTAC SNP B. Proteogenomics Reference Peptide MSADRSILVYCLSR Variant Peptide MSADRSS LVYCLSR SAAP Fig. 2.2 Single nucleotide polymorphism and single amino acid polymorphism identifi cation 14 K.V. Ruggles and D. Fenyö a theoretical or library spectrum to the acquired 2.3 Alternative Splicing spectrum based on mass. Databases with missing and Gene Annotation sequences will fail to identify these peptides in the MS/MS data and ideally, the protein database Coding of novel gene regions and alternative would contain all proteins present in the sample splicing provides additional biological complex- with minimal irrelevant sequences (Fig. 2.1 ). ity. The advent of RNA-Seq has shown alterna- Therefore, in order to identify single amino tive splicing to occur in over 90 % of human acid polymorphisms (SAAPs) occurring from genes (Pal et al. 2012 ), emphasizing the role of non-synonymous genomic SNPs, one must cre- diverse protein isoforms in cellular function. ate a protein sequence database that incorporates RNA-Seq analysis provides information on the sequencing data to contain corresponding splice junctions (intron / exon boundaries) pres- SAAPs. These changes are integrated into the ent in a given sample, providing insight into both protein sequence data by fi rst modifying the normal gene annotation and novel expression. genomic reference sequence to include SNPs in Splice sites are identifi ed following sequence the genome and/or transcriptome (Fig. 2.2a ) and alignment using splice-alignment software such then completing an in silico protein translation of as TopHat (Kim et al. 2013 ), BLAT (Fonseca the modifi ed sequences to attain a list of peptides et al. 2012 ) and MapSplice (Wang et al. 2010 ) containing SAAPs (Fig. 2.2b ). (Fig. 2.1 ). Comparing intron / exon boundaries identifi ed through NGS to known junction boundaries can 2.2.3 Bioinformatics Tools identify novel splice sites, including unannotated for Creating SAAP Protein alternative splicing (two known exons) (Fig. Sequence Databases 2.3a ), partially novel splicing (one known exon) (Fig. 2.3b ) and completely novel splicing (no Several tools have been developed to create known exons) (Fig. 2.3c ) (Ruggles et al. 2015 ; NSG-integrated databases containing potential Mertins et al. 2016 ). Hundreds of thousands of variant SAAPs. With the inclusions of these novel splice sites can be identifi ed by one RNA- novel peptide sequences