<<

NGS, Cancer and Exome- followed by Variant Calling.

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 10 Galaxy WoOverviewrkflow of exome analysis

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

29 janvier 2015 FORMATION “NGSFormation& CA N CNGSER &: CancerANAL -YAnalysesSE DE ExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Public dataset • Accessible online on SRA (Sequence Read Archive): ERA148528

 Exome sequencing of 2 samples: tumor (lung cancer) and blood (normal sample)  Publication : Ys et al., Genome Res. 2012 Mar;22(3):436-45

• 100bp paired-end reads, Illumina HiSeq 2000 • Mean depth higher for the tumor sample (~100X) than for the normal sample (~30X) to detect somatic variant with a low allelic frequency • Aligned Exome size: ~15 Go tumor; ~7 Go blood • Complete analysis processing time: ~20h  Need to restrict the analysis to a few regions in order to limit the processing time (~112kb)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 12 Select Librairies on Galaxy

1. Open your web browser and go to « http://galaxy.sb-roscoff.fr »

2. In tSelecthe top men librariesu, click oonn « Galaxy Shared Data » then « Data librairies »

1. Open your web browser and go to ”https://galaxy.gustaveroussy.fr/galaxyprod” 2. In the top menu, click on « Shared Data » then « Data librairies »

3. Click on «canceropole-tp-input » 3. Click on [FORMATION] Input Data then « EXOME » 4. Select4. « tumor_R1.fastqSelect « tumor_R» ; «1.fatumor_R2.fastqstq » ; « tumor_R» ; 2.fastq » ; « exome_regions.bed » ; « exome_regions.bed » ; « known_sites_regions.vcf » then click on « Go »« known_sites_regions.vcf » then click on « Go ».

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

13 FASTQ formaFASTQt conv eformatrsion conversion 1. Rename your history to « Tumor » by clicking on « Unnamed 1.historyRename». your history to « Tumor » by clicking on « Unnamed history » 2.2.In In the the left left panepanel,l, click onclick « F onAST theQ Groo « searchmer » undertools the» textbox NGS: QCand and enter « FASTQmanipul Groomeration sectio»n and to con thenvert clickboth yonou itr FtoASTQ convert into FbothASTQ yourSangerFASTQ Format into FASTQ Sanger Format. 3.3.Cl Clickick on on « Ex « ecuExecutete » to »la untoch launch the contheversi conversion.on

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 14 GENERAL TIP : RENAME YOUR HISTORY ITEMS FREQUENTLY TO FASTQC : FASTQ QBEu MOREali EXPLICITty Co THANnt r«oonl data xxx » ! FASTQC : FASTQ Quality Control 1. In the1. Inleft the pane leftl, clipanel,ck on click« FAST on QC:the «Researchad QCtools » under» textbox the NGS:and QC enter and « manFASTQC:ipulation Readsectio QCn » 2. Select the FASTQ Groomer dataset and click on « Execute »; repeat 2. Select the FASTQ Groomer dataset and click on « Execute »; repeat for both reads for both reads

The result of FTheAST resultQC ofis FASTQCan html is anpag htmle tha paget y thatou canyou can vievieww byby cli clickingckingon on the the eye eye 29 janvier 2015 Formation NGS & Cancer - Analyses Exome

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 FastQC Metrics

• Look at the different metrics for both reads • Problem: the per base sequence quality of the Read2 are quite low towards the end

Solution: Trim the 25bp from the 3’ end of the reads  Higher confidence in the sequenced information

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 16 16 FAFSATSQT QTr Timrimmmere r FASTQ Trimmer 1. 1.UseUs «e F «AST FASTQ TQrim Trimmemer » rt o» cutot cuoftf o2f5f b2p5 bfromp from f3’ e f3n’ d e nod t oh e t hGrooe Groomemed Red aRed2a d(u2se (u se 1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of the the « search tools » object to find the tool) Groomedthe « seaReadsrch too(lsuse » obtheject « to search find thetools tool) » object to find the tool) 2.2. Run2.RunRu« « n FASTQCF «AST FASTQC»QC» » on on onthe the t hetrimmed trimmed trimmed rea readsreads ds

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome If bad-quality sequences/bases distribution more complex

Use a more « elaborated » trimming step

12 - 14 novembre 2014 Formation NGS & Cancer - Analyses RNA-Seq 10 GOverviewalaxy Woofrk fexomelow analysis

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

29 janvier 2015 FORMATION “NGSFormation& CA N CNGSER &: CancerANAL -YAnalysesSE DE ExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 242424 MMaappppiningg w witihth MappingB Boowwttieie22 with Bowtie2 1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome 1.1.UsUse e« «Bo Bowwtietie2 2» »from from the the « «Ma Mapppinping g» »sec sectiotion nto t oa liagling nre reaaddasds soo non n tth htehe e hghg hg11919 9 gg egenenonomeomeme

PresPresetetet optopt option:ion:ion: comcombinatibinatibinatioonno nofof of paramparam parameteteteersrsers desdesignedignedigned ttoo t ohavhav havee eaa agoodgood good ttradeofradeof tradeofff f betbetwweenweeneen sspeed, peed,speed, ssens enssensititivivititivityyit,, y , acacccuracuraccuracyy y

FOFROMRAMTAIOTNIO “NN “GNSG &S C A& CANCNECRE :R A :N AANLAYLSYES DEE D VEA VRAIIARNIATNST GS É GNÉONMMOIIMQUIUQEEUSSE””S ” 77 --7 99 - AA9V VARILRILVRIL 2020 20141414 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 242424 25 SAM/BAM aligned format SAM/BAM aligned format • SAM• SAM Forma Format:t: alig alignedned format,format, human human readareadableble

@SQ SN:chr12 LN:133851895 @RG ID:Sample_ID LB:Sample_Library PL:ILLUMINA SM:Sample_Name PU:Platform_Unit

5’ pos of Read name Flag Chr 5’ pos MAPQ Cigar paired the mate Insert size ERR166338.1 99 chr12 82670685 23 101M = 82670850 266 GCCCCTGGGGATGTTTTGCACCAAGCCACTGTCTCCAGCTGG sequence BBC@GIIHGCFCIEHEAIEIFFGEONDNJFINIONHNGJNNNNKNJN Base quality RG:Z:Sample_ID XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 XA:Z tags Group affiliation

• BAM• BAM F Formatormat:: Binary BinarySAM SAM Format Format (not human (not readablehuman butrea dable but compressed = smaller) compressed = smaller)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Mapping statistics (directly from the mapper)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 26 Mapping Statistics Mapping statistics • Use « Flagstat » from « Samtools » to see some mapping statistics • Use « Flagstat » from « Samtools » to see some mapping statistics

% of mapped reads

Properly paired reads: - 0<= Insert size <= Max size - Reads on same chromosome - Reads facing each other - Both reads are mapped

29 janvier 2015 FORMATION “FormationNGS & NGSCA &N CancerCER -: Analyses ANAL YExomeSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 27 Removing Duplicates Removing duplicates (not for targeted • Duplicates reads: different reads having the same sequence caused by PCR amplication sequencingduring sequencing) library preparation • Duplicates reads: different reads having the same sequence caused by PCR • The remoamplicationval of theduring duplicasequencingtes dependslibrary onpreparation the application (not suitable for sequencing on small• targ Theet removal) of the duplicates depends on the application (not suitable for sequencingu on small target)

PCRdup removal

• Galaxy: Use “Mark Duplicates reads” from “NGS:Picard” to mark duplicates • Galaxy: Use “Mark D plicates reads” from “NGS:Picard” to mark duplicates • Galaxy: Run “Flagstatn” o the output BAM ehto s e tue n mCber of uPR d plicates • Galaxy: Run “Flagstat” on the output BAM to see the number of PCR duplicates

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 28 Target intersection Target intersection •• Use Use « Intersect « InterseBAMct BAM alignments alignmewithnts intervalswith interva» fromls » from « NGS:Bedtools » to keep only «the NGS:Bedtools reads map»pe tod keepon theonly targtheete readsd regmappedions on the targeted regions Smaller BAM size  Smaller BAM size  TheThe targeted targetedregions regionsmust must be inbe BED in BED format forma (4 columnst (4 column: chrs; : chr ; start ; end ; name) start ; end ; name)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 Group Association Group association • • Use Use « « Add Add oror ReplaceReplace GroupsGroups »» fromfrom «« NGS:Pica NGS:Picardrd »» to to asso associateciate a sampa le ID and a sequencing technology to the reads sample ID and a sequencing technology to the reads  Mandatory Mandatoryfor for some sometools tools (GATK)(GATK) or or inin multimulti--sampsamplele ananalysisalysis

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 30 NeOverviewxt sessionof exome analysis

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 4 PreVariantproces callings GATKpre: pa-rprocessingt 1 : GATK part 1

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Why realign around indels ?

• Small Insertion/deletion (Indels) in reads (especially near the ends) can trick the mappers into mis-aligning with mismatches  Alignment scoring – cheaper to introduce multiple Single Nucleotide Variants (SNVs) than an indel: induce a lot of false positive SNVs

• These artifactual mismatches can harm base quality recalibration and variant detection

• Realignment around indels helps improve the accuracy of several of the downstream processing steps

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Local realignment identifies most parsimonious 6 alignment along all reads at a problematic locus Local Localreali grealignmentnment idenidentifiestifies m omostst paparsimoniousrsimonious alignmalignmentent alonalongg all rallea readsds at ata pa rproblematicoblematic llocusocus

1. Find1. theFind bethest albesttern alternateate consenconsensussus sequence sequence that, togthateth, ertogether with thewith reference, best fits the reads in a pile the reference, best fits the reads in a pile

consistent with the reference 3 adjacent Realigning

SNPs determines consistent with a 3bp insertion which is better

2. The2. score The score for an for al ternan alternateate consenconsensussus is the is totalthe sumtotal ofsum theof qu thealit yquality scores of mismascorestchin ofg bamismatchingses bases 3. If the score of the best alternate consensus is sufficiently better than the original alignments, then we accept the proposed 3. If therealignment score of theof b theest readsalternate consensus is sufficiently better than the original alignments, then we accept the proposed realignment of the reads

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Three types of realignment targets

• Known sites:  Common polymorphisms: dbSNP, 1000Genomes

• Indels seen in original alignments (in CIGAR, indicated by I for Insertion or D for Deletion)

• Sites where evidences suggest a hidden indel (SNVs abundance)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Known sites https://www.broadinstitute.org/gatk/guide/tagged?tag=knownsites Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

2. Recommended sets of known sites per tool

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 8 Local realLocalignme realignmentnt around inarounddels indels

Before SNVs After SNVs

Deletion

Deletion

29 janvier 2015 FORMATION “NGS Formation& CA NC ENGSR :& A CancerNALY -SAnalysesE DE VExomeARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 9

Local realiLocalgnm erealignmentnt around iaroundndels indels

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 10 Local realigLocalnme realignmentnt around inarounddels indels

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 11 Indel realignment steps/tools

1. Identify what regions need 11 to be realigned Indel realignment stepsIndel/torealignmentols steps/tools RealignerTargetCreator 1. Identif1.y whIdentifyat rewhatgionsregions need need to be realigned+ known sites to be realignedRealignerTargetCreator + known sites Intervals Realigner TargetIntervals Creator + known sites 2. Perform the actual 2. Perform the actual realignmentInterva (BAMls output) realignment (BAM output)  IndelRealigner IndelRealigner 2. Perform the actual realignment29 janvier 2015(BAM output)Formation NGS & Cancer - Analyses Exome

IndelRealignerFORMATIO N “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 12 Galaxy: Realigner Target Creator • Use « RealigGalaxyner Target Cre: Realignerator » from « GATargetTK Tools »Creator to detect intervals in need of • loUsecal «rea RealignerlignmenTargett Creator » from « GATK Tools » to detect intervals in need of local realignment

Choose advanced GATK options

Add new binding for reference- ordered datas Add new operate on Genomic Intervals

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 13 Galaxy: Indel RGalaxyealign:e Indelr Realigner •• Use Use « « IndelIndel ReRealigneraligner » from» from « GA«T KGATK Tools Tools» to ap »p lyto lo applycal realilocalgnme realignmentnt

Choose advanced GATK options

Add new binding for reference- ordered datas Add new operate on Genomic Intervals

7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 14 Preprocess GAPreprocessTK: part 2 GATK: part 2

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 15

Why recalibWhyraterecalibrate base quabaselities qualities? ?

Real data is messyReal data so is pmessyroperlyso estimaproperlytingestimating the evidenthece evidence is criticaisl critical

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 16 The quality scores issued by sequencers are inaccurate and biased The quality scores issued by sequencers are inaccurate and biased • Quality scoreThes are quality critical scoresfor all do issuedwnstreamby an sequencersalysis are inaccurate and biased • Systematic biases are a major contributor to bad calls • Quality scores are critical for all downstream analysis • Examp• Systematicle of sequebiasesnce conaretex ta b majorias in thecontributor reported qtou abadlitiescalls: • Example of sequence context bias in the reported qualities:

before after

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Evidences of error covariates

• Analyze covariation among several features of a base, e.g.:

• Reported quality score

• Position within the read (machine cycle)

• Preceding and current nucleotides (chemistry effect)

• Sequencing technology...

• Adjust the quality score associated to each sequenced base to be more accurate  Remove systematic biases

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 18 How the covariates are analyzed? How the covariates are analyzed?

• Keep track of the number of observations and the number of times it was an error as a function of various covariates:

• Typically stratify the data by lane, by original quality score, by machine cycle andHow sequ the covariatesencing contexare analyzedt ? • Databas•e Keeps oftrack knofown the numbervarianofts observations are used and theto numberdiscountof times most of the real it was an error as a function of various covariates: genetic variation present in the sample • Typically stratify the data by lane, by original quality score, • All other differencesby machine are cycle ass and sequencingumed tocontext be errors • Databases of known variants are used to discount most of • Having done theindel real regeneticalignmentvariation present first rein ducthe samplees noise • All other differences are assumed to be errors • Having done indel realignment first reduces noise #mismatches+1 Phred-scaled #bases+2 Quality score

https://www.broadinstitute.org/gatk/guide/article?id=44

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 19 BQSR steps/tools

1. Model the error modes and recalibrate qualities 19

BQSR steps/tools Count CoBQSRvariates steps /tools

1. Model1. Modelthe er thero errorr modesmodes and and recalibrate qualities

recalibrate  Base qualit Recalibratories Covariates Covariates Co unt Covariates 2. Perform the actual 2. Perform the actual recalibrationrecalibra(BAMtion output) (BAMCovari ateoutput)s  Print Reads Table Recalibration 2. Perform the actual recalibra29 janviertion 2015 (BAM output)Formation NGS & Cancer - Analyses Exome

TableFORM AReTIOcaN “NlibraGS & CtionA NC ER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Galaxy: Base Recalibrator • Use « Base Recalibrator » from « GATK Tools » to recalibrate base quality scores

Add new binding for reference- ordered datas

Choose advanced GATK options Add new operate on Genomic intervals

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Galaxy: Print Reads

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 22 Final Results Final results

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 23 Next SteNextp: somstepati:c somaticvariant cvariantalling calling

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

29 janvier 2015 FORMATION “NGFormationS & CA N NGSCE R& Cancer: ANA -LAnalysesYSE D EExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 4 Using workflow in Galaxy: Repeat the same steps Repeat the samfore s thetep sbloodfor thsamplee blood sample

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Repeat process Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

29 janvier 2015 FORMATION “NGFormationS & CA N NGSCE R& Cancer: ANA -LAnalysesYSE D EExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 5 RepeatRepeat all thesalle stheseteps insteps a fewin c ali cfewk… click…

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

29 janvier 2015 FORMATION “NGSFormation & CA NC NGSER & : CancerANAL -YAnalysesSE DE Exome VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 6

6 Select Librairies on Galaxy Select Librairies on Galaxy 1. In the top menu, click on « Shared Data » then « Data librairies » 1. In the top menu, click on « Shared Data » then « Data librairies » 2. Click on «canceropole-tp-input » 2. Click on «canceropole-tp-input » Select libraries on Galaxy 3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ; 3. Select « normal_R11. In.fas thetq top » menu, ; « no clickrm onal_ «R2 Shared.fastqData » »; then« ex«ome_ Data librairiesregion s.bed» » ; « known_sites_regio2.« Clickknons.v onw cfn_s«[FORMATION] » ites_regio Inputns.v Datacf» »then « EXOME » 3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ; « 4. Select « Import4. to Histoknown_sites_regions.vcfSelectrie s« »Impo thenrt cli tock Histo »on Gorie s » then click on Go 4. Select « Import to Histories » then click on Go

5. Write a new history name and click on 5. W5. Writerite a new new history historyname name and click on and click on « import library « import library datasedatasetsts » » « import library datasets »

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 FORMATIOFormationN “NG NGSS & Cancer CA N - AnalysesCER Exome: ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

7

Extract workflow from history 7 Extract workflow from history 1. In the top menu, click on « Analyze Data » to return to

1. In the top menu, clickthe on «main Anal frameyzeExtract Da ta » workflowto return to from history the main frame 2.1. InIn thethe top « himenu,story click » p anon el« ,Analyze click on the topside wheel then Data » to return to the main frame 2. In the « history » p2.an Inonel ,the cli« ck Ex« historyontract the W »top panel,orkfloside wclickw he» elon the then on « Extract Workflotopsidew » wheel then on « Extract 3.WorkflowWrite a» name for your workflow then click on « Create 3. Write a name for your workflow then click on « Create 3. WriteWorkfloa namew » for your workflow then Workflow » click on « Create Workflow »

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” FORMATION “NGS & CA NCER : ANALYSE DE VAR7 I-A 9N ATVSRIL GÉ 20N14OM IQUES” 7 - 9 AVRIL 2014

8 Edit your worklow 8 Edit your worklow 1. In the top Editmenu, yourclick onworkflow « Workflow » to return to the main1. In ftherame top menu, click on « Workflow » to return 1.to In t hethe main top menu,frame click on « Workflow » to 2. Click on returnyour ne tow the workflow main frame then select « Edit » 2.2.Cl Clickick onon yyourour neneww w workfloworkflow thethenn selselectect « «EdEditit »» 3. Identify and rename « Input dataset » boxes 3. Each box represent a tool set with parameters that Set at 3. Each box represent a tool set with parameters that corresponding to R1/R2/BED/VCF runtSetime at you can mo dify by clicking on it runtime 4.yo Eachu caboxn mo representdify by cliacki toolng onset it with parameters that you can modify by clicking on it Click on the « Add or Replace Groups » box and Click onClick the on « theAdd « or Add Replorace Replace Grou psGroups » box » an d change the ch reanagboxde gthero and urep a changeIDd g aronudp the nIDa mereadand (nyogroupameu ca ( yoIDn uaand lsocan also name (you can also choose to set them at choose to chosetose the tom setat runthemtime at )run time) runtime) CheckCheck the inp theut ofinput the of« Intersethe « Intersectct BAM» BAM»box: it Check the input of the « Intersect BAM» box: it box: it should be the BAM output from « should be shotheul BAMd be theou tpuBAMt from outpu «t Markfrom « Du Markplica Dutesplica » tes » Mark Duplicates » 4.5. SaandSaveve from youryou r« editeded flagitedstat wworfklow orfklow» byby cli clickingcking onon t hethe w heel 4. Saandve from y ou r« ed flagitedstat w orfklow» by clicking on the wheel wheel then « save » FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 then « saveFOR M»A TION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 then « save »

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

9

Run your workflow on the blood sample 9 Run your workflow on the blood sample 1. A woRunrkflow cayourn only workflowbe runned on daonta p rethesent inblood the curresamplent history 1. A workflow can only be1.runnedA workflowon data can o presentnly be runned on data present in the current in theClick current on the whehistoryel in the top hiofstory your history panel and  Click on the wheel in the top of your history select « saved histories » Click on the wheel in the top of your history panel and panel and select « saved histories » Click on the « Normal » history and click on « switch »  Click on the « Normal » historyselect and« sa veclickd histo onrie s » Go« switch back to» the « Workflow » pagCle ick(top on pa thene l)« theNonrmal click » history and click on « switch »  onGo y oubackr workflow to the the « nWorkflow on « Run Go»» pageback t o(top the « Workflow » page (top panel) then click panel) then click on your workflowon your workflowthen onthe n« on « Run » 2. CheckRun that» all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)

3.2. CheckClick on that « Runall w orkflowyour input » at 2.th fileseCh boe ttareckom tha ofcorrectt tahell yopauge (step1:r in put file s are correct (step1: bed, step2: vcf, step3: R1, step4: R2) bed , step2: vcf, step3: R1, step4: R2) 3. Click on « Run workflow » at the bottom of the page 3. Click on « Run workflow » at the bottom of the page FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

10 Let Galaxy work for you!

10 10 Let GaLelat xGya wlaxoyr Letkw foo rGalaxykr yfooru y!owork u! for you!

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formation NGS & Cancer - Analyses Exome

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 11 Next Step Next Step

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

29 janvier 2015 FORMATION “NGSFormation& CA NC NGSER &: CancerANAL -YAnalysesSE DE ExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Somatic Variant Calling with VarScan2 4 Somatic Variant Calling with Varscan

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Variant Calling • Factors to consider when calling a SNVs: – Base call qualities of each supporting base (base quality) – Proximity to small indels, or homopolymer run – Mapping qualities of the reads supporting the SNP – Sequencing depth: >=30x for constit ; >=100 for tumor – SNVs position within the reads: Higher error rate at the reads ends – Look at strand bias (SNVs supported by only one strand are more likely to be artifactual) – Allelic frequency: Tumor cellularity will reduce the % of an heterozygous variant • Higher stringency when calling indels (and sanger validation often needed)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 7 Depth of CoveragDepthe of Coverage

DepthDepth of Coveraof Coveragege = numbe= rnumber of reads ofsup readsportingsupporting one positionsone position ex: 1ex:X, 5 X1X,, 1 05X,0X… 100X... >1000X >1000X

Reference Base SNV Sequencing Error

Reference Genome

Aligned Reads 7X 2X 17X 5X brin + 100% 9X brin + 2X brin - = SNV 50% SNV and 8X brin - Homozygote 50% sequence = SNV context Heterozygote (errors) Calling Confidence --- NGS - Applications and Analysis +++

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 VarScan2 • Mutation caller written in Java (no installation required) working with Pileup files of Targeted, Exome, and Whole-Genome sequencing data (DNAseq or RNAseq)

• Multi-platforms: Illumina, SOLiD, Life/PGM, Roche/454

• Detection of different kinds of Germline SNVs/Indels (classical mode):  Variants in individual samples  Multi-sample variants shared or private in multi-sample datasets

• VarScan specificity is to be able to work with Tumor/Normal pairs (somatic mode):  Somatic and germline mutation, LOH events in tumor-normal pairs  Somatic copy number alterations (CNAs) in tumor-normal exome data

29 janvier 2015 Formation NGS & Cancer - Analyses Exome VarScan2 • Most published variant callers use Bayesian statistics (a probabilistic framework) to detect variants and assess confidence in them (e.g.: GATK)

• VarScan uses a robust heuristic/statistic approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance

• In Stead et al. (2013), they compared 3 different somatic callers : MuTect, Strelka, VarScan2  VarScan2 performed best overall with sequencing depths of 100x, 250x, 500x and 1000x required to accurately identify variants present at 10%, 5%, 2.5% and 1% respectively

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 10 Common history 10 1. In the wheel from the history panel, select « Copy datasets » 10 10 Common history 10 2. Select the preprocessed BAMCCommonom frommo Non historyrmahislt aonrdy T umor histories Common histor1.y In the wheel from the history panel, select « Copy datasets » and « exomeC_reomgionms.beond h 1.» i Ins ttheor wheely from the history panel, select « 2.1. SeIn lethect wthhee pelre fromproce thesse hidstory BAM pa fromnel, Noselrmaect l« a Cond pyTu damotaser histots »rie s Copy datasets » 1. In the wheel from the historyan dpa « neexomel, sel_reectgion «s.be Codpy » datasets » 1. In the wheel from2. 2.Se Selectthelect hith storyethe pre preprocessed pparocenessel, seld BAMectBAM « from Co from pyNo darmaNormaltasel andts T » uand mor histories and « exome_regions.bed » 2. Select the2. preSepleroctce thssee pdre BAMproTumorce ssefromhistoriesd BAM Norma fromandl a «Non exome_regions.beddrma Tulmo anrd hTistoumorier hsisto» ries and « exomean_red gi« onexomes.be_red »gi ons.bed »

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 11 Common history 11 11 Common history 1. EdCoitm eamcho BAMn h iattribstorutey s by clicking on the little pen 1. Edit each BAM attributes by clicking on the little pen 2.1.ToEd addit ea chmore BAM cla attribrityute, rens byame Commonclickin ygou onr tBAMhe historylitt lein pen « No rmal.bam » and « Tumor.bam » 2. T2.o theaddTo1.n add moreEdit « moreSa each clave cla rityBAM»rity the, ren, attributesren chaameameng yesoubyr r BAMclickingBAM in in «on «No Nothermalrmal little.bam.bapen »m and » a« ndTumo « Tr.baumom r».ba m » 2. To add more clarity, rename your BAM in « Normal.bam » and « then « Save » the changes theTumor.bamn « Save »» the then cha«ng Savees » the changes

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMAFTOIORMN A “NTGIOSN & “ CANGNSC &E RCA : ANNCAELRY S: EA NDAE L VYASREIA DNET S V AGRÉNIAONMTISQ UGEÉSN”O MIQUES” 7 - 9 AVRIL 20714 - 9 AVRIL 2014 12 SomatiSomaticc VariantvariantCalling callingwith VwitharscaVarScann

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 13 13 Mpiilleeuupp Mpileup 1.1. Use «« MpiMpileleupup »» fromfrom « « NGS:Samtoo NGS:Samtoolsls » »to to crea createte pi lepiupleup file files (srep (repeaeat fort forTumo Tumor r 1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeat and nnoormarmall sasampmpleless)) for Tumor and normal samples) AnomalousAnomalous read read pairs pairs Anomalousare aredue due to t heto the resreadtresricttionrict ionofpairs t heof texheareome exome dueto at oregto a ionreg ionthe restriction of the exome to a region

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 14 Pileup format Pileup format

•• DescribesDescribes thethe ba basese-pa-irpair informatio informationn at each atpositioeachn position

Re ference base Base qualities

=4 =??????@??@?@@ @=@ ??@ ? @? ? .$,$.,,.,,..,,.,.,..,.,,.,.,,.,..,,..,,.,.,.,.,,.,.,.,,,.,. ??< ? ??@??????? ? @??? ? ??@?? chr12 112888238 A 108 ,.,.,,,.,.,.,.,.,,.,.,.,.,,,,.,,.,.,..,,,,,.,.,,,,^F, A???@@ ?@@???AB????= ? @ @@??@@?@ A 00 .$t,.,,.T,tT,.,T.,.,t.tTtt.tTT,t.T,tTt.tT,T,,.t 936 78??6??6 45<875? ??? ?@6 chr12 112888239 C 108 TtTttt.,.,Tt.ttt.,T,.,.tT,,T,T,.tT,,t,TttTtT,T. @6???

Read bases: Number of reads covering the site . / , = match on forward/reverse strand

(total depth) ACGTN / acgtn = mismatch on forward/reverse strand

`-\+[0-9]+[ACGTNacgtn]+‘ indicates an indel

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 15 Somatic variant calling with VarScan Somatic Variant Calling with Varscan

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

29 janvier 2015 FORMATION “NGFormationS & CA N NGSCE R& Cancer: ANA -LAnalysesYSE D EExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 16 VarScan Somatic 16 1. UseVa r«S VcarScaan nS osomamatict i»c from « Varscan » to detectVarScan variant somatic 1. Use « VarScan somatic » from « Varscan » to 1.• UseMin «-v VarScanar-freq: misomaticnimal all» elicfrom fre«quency Varscan to »c allto adetect variant variantdetect variant (10% here) • Min-var-freq: minimal allelic frequency to call a • Min-var-freq: minimal allelic frequency to call a variant variant• Min -(10%coverage here: )minimum coverage to call a variant (in • Min-(10%coverage here): minimum coverage to call a variant normal and tumor and combined) (in •normalMin-c ovanderage tumor: miniandmum combined coverage )to call a variant (in •• TumorTumorand and normal normal purity purity:: cellularitycellularity ofof yyourour ssampleample normal and tumor and combined)

 •2 outputTumor andfiles: normal SNVs purity& Indels: cellulinarity VCF of format your sample

2 outpu t files: SNVs & Indels in VCF format 2 output files: SNVs & Indels in VCF format

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 17 VarScan VCF ForVarScanmat VCF format

• 22 types:types: • VCFVCF ( specific(specificto to VarScan VarSca)n )  Tabulated (available only for VarScan in classical mode) • Tabulated (available only for VarScan in classical mode)

• VarScanVarScanVCF VCF format: format classic: classicVCF VCF header head (#)er (#) but bu specifict specificvariant varian linest lines

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR

DP=115;SOMATIC;SS=2; GT:GQ: 0/0:.:52:48:0:0% 0/1:.:63:50:8:13.79% chr12 250239 . A G 20 PASS SSC=21;GPV=1; SPV=6.3E- DP:RD:AD: :15,33,0,0 :19,31,3,5 3 FREQ:DP4

GT=Genotype (1/1: Homozygous ; 0/1 : Heterozygous) / GQ= Genotype Quality SS= Somatic Status (0=ref; 1=Germline ; 2=Somatic; 3=LOH ; 5= Unknown) DP= Quality Read Depth of bases with Phred score >= BAPQ RD= Depth of reference-supporting bases AD= Depth of variant-supporting bases FREQ= Variant allele frequency DP4= Ref/FWD , Ref/REV, Alt/FWD, Alt/REV

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 18

VarScan Tabulated Format 18 VarScan Tabulated Format VarScan Tabulated Format

Strands Strands Map Map Chrom Position Ref Cons Reads1 Reads2 VarFreq 1 2 Qual1 Qual2 Pvalue Qual1 Qual2 R1 + R1 - R2 + Rs2 - Alt chr12 113348849 C Y 31 30 49.18% Stran2 ds Stran2 ds 27 27 0.98 Ma1 p Ma1 p 19 12 25 5 T Chrochr12m 11335432Position 9 RefG ConR s Rea72ds1 Rea2ds2 V2.70%arFreq 21 2 Qua31l1 Qua26l2 P0.98value Qua1 l1 Qua1 l2 R148 + R124 - R21 + Rs21 - AAlt chr12 11334884133571993 GC YA 312 3072 49.18%97.30% 21 2 2728 2724 0.98 1 1 192 120 2545 275 AT chr12 1133543213357209 G RA 720 772 2.70%100% 20 2 310 2629 0.98 10 1 480 240 511 261 A chr12 113357193 G A 2 72 97.30% 1 2 28 24 0.98 1 1 2 0 45 27 A chr12 113357209 G A 0 77 100% 0 2 0 29 0.98 0 1 0 0 51 26 A

Cons : Consensus Genotype of Variant Called (IUPAC code): ConMs -:> Cons A or Cens us GenotyY ->pe C oofr TV ariant CaDl l-ed> A (IUP or GAC or Tc ode): W -> A or T V -> A or C or G R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T M -> A or C Y -> C or T D -> A or G or T W -> A or T V -> A or C or G R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T

29 janvier 2015 Formation NGS & Cancer - Analyses Exome FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 19 Variant AnnotatVariantion Annotation

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 29 janvier 2015 Formation NGS & Cancer - Analyses Exome 7 - 9 AVRIL 2014 Different types of SNVs

• SNVs and short indels are the most frequent events:  Intergenic  Intronic  cis-regulatory  splice sites  frameshift or not  synonymous or not  begnin or damaging etc...

• Example of SNV one want to pinpoint:  non-synonymous + highly deleterious + somatically acquired

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Resources dedicated to human genetic variations

• dbSNP and 1000-genomes  Population-scale DNA polymorphisms • COSMIC  Catalogue Of Somatic Mutations In Cancer • Non synonymous SNVs predictions  SIFT, Polyphen2 (damaging impact)... PhyloP, GERP++ (conservation)

 ANNOVAR • Tools to annotate genetic variations

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 22 Annovar 22 Annovar Use «Annovar Annovar » to annotate SNVs and Indels Use « Annovar » to annotate SNVs and Indels Use Mul ti« s Annovarample VCF» (c toontains annotate Tumor &SNVs normaland samples Indels) Multi sampleMulti sampleVCF VCF(contains (containsTumor Tumor& & normalnormal samples samples) ) • RefGene: Gene & Function & AminoAcid Change  RefGene: Gene & Function & AminoAcid (HGVS• format:RefGene c.A15: Gene5G &; p.LFunctiys45Arg)on & Am inoAcid Change Change (HGVS format: c.A155G ; p.Lys45Arg) (HGVS format: c.A155G ; p.Lys45Arg) • 1000g2012apr_all: Minor: Minor Allele FrAlleleequencFrequencyy for all for allethni ethnies•es 1000g2012apr_all: Minor Allele Frequency for all ethnies • ESP6500ESP6500: :Exome Exome SequSequencingencing ProjecProjectt  Ljb_all• ES: predictionsP6500: Exome(SIFT, Sequ encPolyphen2,ing Project LRT, • Ljb_all : predictions (SIFT, Polyphen2, LRT, MutationTaster• Ljb_all : predic, PhyloPtions (SIFT, GERP++, Polyphen2,) LRT, MutationTaster, PhyloP, GERP++) MutationTaster, PhyloP, GERP++)  Tabulated Tabulatedfile file Tabulated file

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 23 VaVariantriant Se selectionlection

Tumor Mpileup Aligned and Tumeur preprocessed Reads ------(BAM) Samtools Mpileup

Variant Calling Variant Variant ------Annotation Selection VarScan ------Somatic Annovar Select Normal Mpileup Aligned and Normal preprocessed Reads ------(BAM) Samtools Mpileup

29 janvier 2015 FORMATION “NGSFormation& CA N CNGSER & :CancerANA L- YAnalysesSE DE ExomeVARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 24 Select variant predicted as somatic 24 • Use the « Select » tools from « Filter and Sort » to select only lines matching the Select vaSelectriant pvariantredicte predictedd as somaastic somatic pattern « SOMATIC » • • UseUse t hethe « Se« Selectlect » too »ls tools from «from Filter« an Filterd Sortand » to selSortect » on toly selectlines match onlyinglines the matchingpattern « SOMAthe patternTIC » « SOMATIC »

29 janvier 2015 Formation NGS & Cancer - Analyses Exome 29 janvier 2015 Formation NGS & Cancer - Analyses Exome Annexe 1 : Frequently mutated genes in WES

Fuentes Fajardo KV, Adams D; NISC Comparative Sequencing Program, Mason CE, Sincan M, Tifft C, Toro C, Boerkoel CF, Gahl W, Markello T. Detecting false-positive signals in exome sequencing. Hum Mutat. 2012 Apr;33(4):609-13. doi: 10.1002/humu.22033. Epub 2012 Mar 5. PubMed PMID: 22294350; PubMed Central PMCID: PMC3302978.

Potentially false-positives = 2157 GENES !!! (ex : MUCxx, HLA-xxx,

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Annexe 2 : without « normal » sample ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome Annexe 3 : how to visualize variants ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome