NGS, Cancer and Bioinformatics Exome-sequencing followed by Variant Calling.

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 10 Overview of exome analysis G a l ax y W o rk fl o w

Refer en ce Gen o me (F asta )

Co nv er sio n to Gal axy M ap pi n g Al ig ned an d p r ep r o cessed Read s Fo r mat ------r ead s (B A M ) ( Fast q ) ------Bowtie2 ------Groom er - M ark ed PC R duplic at es - I nt ersec te d on t arget regions - R ealigned around indels Qual ity PC R d up l i cates - R ec alibrat ed C on tr ol M ar kin g ------Fas t QC Mark Dup

Pr epr ocess GA T K Pr ep r o cess GA T K Tar get Targ et par t 1 p ar t 2 In ter secti on r egi o ns ------(b ed ) Local rea lign ment Bas e Qualit y Sc ore I nt ers ec t Bam around indels R ec alibrat ion

29 janvier 2015 FO R M AT IO N “ N GS Formaon NGS & Cancer - Analyses & CA NC E R : A N AL Y SE D EExome V A R I AN T S G ÉN O M IQ U ES ” 7 - 9 A VRIL 2014 Public dataset • Accessible online on SRA (Sequence Read Archive): ERA148528

Ø Exome sequencing of 2 samples: tumor (lung cancer) and blood (normal sample) Ø Publicaon : Ys et al., Genome Res. 2012 Mar;22(3):436-45

• 100bp paired-end reads, Illumina HiSeq 2000 • Mean depth higher for the tumor sample (~100X) than for the normal sample (~30X) to detect somac variant with a low allelic frequency • Aligned Exome size: ~15 Go tumor; ~7 Go blood • Complete analysis processing me: ~20h Ø Need to restrict the analysis to a few regions in order to limit the processing me (~112kb)

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 12 Select Librairies on Galaxy

1. Open your web browser and go to « http://galaxy.sb-roscoff.fr » Select libraries on Galaxy 2. In the top menu, click on « Shared Data » then « Data librairies » 1. Open your web browser and go to ”hps://

galaxy.gustaveroussy.fr/galaxyprod” 2. In the top menu, click on « Shared Data » then « Data librairies »

3. Click on «canceropole-tp-input » 3. Click on [FORMATION] Input Data then « EXOME » 4. Select « tumor_R1.fastq » ; « tumor_R2.fastq » ; 4. Select « tumor_R1.fastq » ; « tumor_R2.fastq » ; « exome_regions.bed » ; « exome_regions.bed » ; « known_sites_regions.vcf » then click on « Go » « known_sites_regions.vcf » then click on « Go ».

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

13 FASTQ formatFASTQ conversion format conversion 1. Rename your history to « Tumor » by clicking on « Unnamed 1.historyRename ». your history to « Tumor » by clicking on « Unnamed history » 2.2. In the In the leftle panel, panel, click on the « click on « FASTQ Groomersearch » undertools » thetextbox NGS: QC and enter « and FASTQ manipulationGroomer section » and to convertthen click on both yourit FASTQ to convert into FASTQ both Sangeryour FASTQ Format into FASTQ Sanger Format. 3.3. Click on « Click on « ExecuteExecute » to » to launchlaunch the conversion the conversion.

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 14 GENERAL TIP : RENAME YOUR HISTORY ITEMS FREQUENTLY TO FASTQC : FASTQ QualityBE MORE EXPLICIT THAN « on data xxx » ! Control FASTQC : FASTQ Quality Control 1. In the1. In the left panel,le panel, click on the « click on « FASTQC: Readsearch QC tools » under » textbox the NGS: and enter « QC and manipulationFASTQC: Read QC » section 2. Select the FASTQ Groomer dataset and click on « Execute »; 2. Select the FASTQ Groomer dataset and click on « Execute »; repeat for both reads repeat for both reads

The result ofØ FASTQC The result of FASTQC is an htmlis an html page page that youthat canyou canview view by by clickingclicking on the on theeye eye 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 FastQC Metrics

• Look at the different metrics for both reads • Problem: the per base sequence quality of the Read2 are quite low towards the end

Soluon: Trim the 25bp from the 3’ end of the reads Ø Higher confidence in the sequenced informaon

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 16 16 FASTQFASTQ Trimmer Trimmer FASTQ Trimmer 1. 1.UseUse « FASTQ « FASTQ Trimmer Trimmer » to » cutto cut off off25bp 25bp from from 3’ end 3’ end of theof the Groomed Groomed Read2 Read2 (use (use 1. Use « FASTQ Trimmer » to the « search tools » object tocut find off 25bp the tool) from 3’ end of the Groomedthe « search Reads tools (use the « » object tosearch find the tools tool) » object to find the tool) 2. 2.Run2.RunRun « FASTQC » on the « FASTQC» « FASTQC» on onthe the trimmed trimmedtrimmed reads reads reads

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome If bad-quality sequences/bases distribution more complex

Use a more « elaborated » trimming step

12 - 14 novembre 2014 Formaon NGS & Cancer - Analyses RNA-Seq 10

GOverview a l ax y W o r k of fl oexome w analysis

Refer en ce Gen o me (F asta )

Co nv er sio n to Gal axy M ap pi n g Al ig ned an d p r ep r o cessed Read s Fo r mat ------r ead s (B A M ) ( Fast q ) ------Bowtie2 ------Groom er - M ark ed PC R duplic at es - I nt ersec te d on t arget regions - R ealigned around indels Qual ity PC R d up l i cates - R ec alibrat ed C on tr ol M ar kin g ------Fas t QC Mark Dup

Pr epr ocess GA T K Pr ep r o cess GA T K Tar get Targ et par t 1 p ar t 2 In ter secti on r egi o ns ------(b ed ) Local rea lign ment Bas e Qualit y Sc ore I nt ers ec t Bam around indels R ec alibrat ion

29 janvier 2015 FO R M AT IO N “ N GS Formaon NGS & Cancer - Analyses & CA NC E R : A N AL Y SE D EExome V A R I AN T S G ÉN O M IQ U ES ” 7 - 9 A VRIL 2014 2424 MappingMapping with withMapping Bowtie2 Bowtie2 with Bowtie2

1.1.1. Use « Bowe2 » UseUse « «Bowtie2 Bowtie2 » » from fromfrom the the « the « « Mapping Mapping » » secon to » section section to to align alignalign reads reads reads on on on the hg19 the the hg19 hg19 genome genomegenome

PresetPreset option: option: combinationcombination of of parameters parameters designeddesigned to to have have a agood good tradeoff tradeoff betweenbetween speed, speed, sensitivity, sensitivity, accuracyaccuracy

FORMATIONFORMATION “NGS “NGS & &CANCER CANCER : ANALYSE : ANALYSE DE DE VARIANTS VARIANTS GÉNOMIQUES” GÉNOMIQUES” 7 -7 9 - AVRIL9 AVRIL 2014 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 2424 25 SAM/BAM aligned format SAM/BAM aligned format • SAM• SAM Format: Format: alignedaligned format, format, humanhuman readable readable

@SQ SN:chr12 LN:133851895 @RG ID:Sample_ID LB:Sample_Library PL:ILLUMINA SM:Sample_Name PU:Platform_Unit

5’ pos of Read name Flag Chr 5’ pos MAPQ Cigar paired the mate Insert size ERR166338.1 99 chr12 82670685 23 101M = 82670850 266 GCCCCTGGGGATGTTTTGCACCAAGCCACTGTCTCCAGCTGG sequence BBC@GIIHGCFCIEHEAIEIFFGEONDNJFINIONHNGJNNNNKNJN Base quality RG:Z:Sample_ID XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 XA:Z tags Group affiliation

• BAM• BAM Format Format: :Binary Binary SAM Format (not SAM Formathuman (not humanreadable but readable but compressed = smaller) compressed = smaller)

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Mapping statistics (directly from the mapper)

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 26 Mapping Statistics Mapping statistics • Use « Flagstat » from « Samtools » to see some mapping statistics • Use « Flagstat » from « Samtools » to see some mapping stascs

% of mapped reads

Properly paired reads: - 0<= Insert size <= Max size - Reads on same chromosome - Reads facing each other - Both reads are mapped

29 janvier 2015 FORMATION “NGSFormaon NGS & Cancer - Analyses & CANCER : ANALYSEExome DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 27 Removing Duplicates Removing duplicates (not for targeted • Duplicates reads: different reads having the same sequence caused by PCR amplication sequencingduring sequencing )library preparation • Duplicates reads: different reads having the same sequence caused by PCR • The removalamplicaon of the during duplicates sequencing depends library on preparaon the application (not suitable for sequencing on small• The targetremoval) of the duplicates depends on the applicaon (not suitable for sequencing on small target)

PCRdup removal

• Galaxy: Use “Mark Duplicates reads” from “NGS:Picard” to mark duplicates • Galaxy: Use “Mark Duplicates reads” from “NGS:Picard” to mark duplicates

• Galaxy: Run “Flagstat” on the output BAM to see the number of PCR duplicates • Galaxy: Run “Flagstat” on the output BAM to see the number of PCR duplicates

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 28 Target intersection Target intersection •• Use « UseIntersect « Intersect BAM BAMalignments alignments with withintervals intervals » from » from « NGS:Bedtools » to keep only « theNGS:Bedtools reads mapped » to keepon the only targeted the reads regions mapped on the targeted regions Smaller BAM size Ø Smaller BAM size Ø The Thetargeted targeted regions regions must mustbe in BED format (4 be in BED formatcolumns (4 columns : chr ; : chr ; start ; end ; name) start ; end ; name)

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 Group Association Group association • • Use « Use «Add Add or Replace Groups » or Replace Groups » fromfrom « « NGS:PicardNGS:Picard » » to to associateassociate a sample a ID and a sequencing technology to the reads sample ID and a sequencing technology to the reads Ø Mandatory Mandatory for forsome some tools tools (GATK) or in mul- (GATK) or in multi-samplesample analysis analysis

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 30

N eOverviewx t s e s si o n of exome analysis

Refer en ce Gen ome (F asta )

C on v er si o n to Gal axy M ap pi n g Al ig ned an d p rep r ocessed R ead s Fo r mat ------r ead s (B A M ) ( Fast q) ------Bowtie2 ------Groom er - M ark ed PC R duplic at es - I nt ers ec te d on t arget regions - R ealigned around indels Qu al i ty PC R du pl i cates - R ec alibrat ed C o ntr o l M ar kin g ------Fas t QC Mark D up

Pr epr ocess GA T K Pr ep r ocess GA T K Tar g et Targ et par t 1 p ar t 2 I nter secti on r eg i on s ------(b ed ) Local rea lign m ent Bas e Qualit y Sc ore I nt ers ec t Bam around indels R ec alibrat ion

FO R M AT IO N “ NG S & CA N CE R : A N AL Y SE D E V A R I AN T S G É N O M IQ U ES ” 7 - 9 AV RIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 4 Variant calling pre-processing : GATK P re pr o ce ss G AT K: pa rt 1 part 1 Refer en ce Gen o me (F asta )

C on v er sio n to Gal axy M ap pi n g Al i gn ed an d p r ep r o cessed R ead s F or mat ------r ead s (B A M ) ( Fast q ) ------Bowtie2 ------Groom er - M ark ed PC R duplic at es - I nt ers ec te d on target regions - R ealigned around indels Qual ity PC R d up l icates - R ec alibrat ed C on tr ol M ar kin g ------Fas t QC Mark Dup

Pr ep r ocess GA T K Pr ep r o cess GA T K Tar get Targ et par t 1 p ar t 2 I nter secti on r eg io ns ------( bed ) Local rea lign ment Bas e Qualit y Sc ore I nt ers ec t Bam around indels R ec alibrat ion

FO R M AT IO N “ N GS & C A N CE R : A N AL Y SE D E V A R I AN T S G ÉN O M IQ U ES ” 7 - 9 A VRIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Why realign around indels ?

• Small Inseron/deleon (Indels) in reads (especially near the ends) can trick the mappers into mis-aligning with mismatches Ø Alignment scoring – cheaper to introduce mulple Single Nucleode Variants (SNVs) than an indel: induce a lot of false posive SNVs

• These arfactual mismatches can harm base quality recalibraon and variant detecon

• Realignment around indels helps improve the accuracy of several of the downstream processing steps

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Local realignment identifies most parsimonious 6 alignment along all reads at a problematic locus Local realignmentLocal realignment identifies mostidentifies parsimonious most parsimonious alignment along all reads at a alignment along allproblematic reads at a locusproblematic locus 1. Find1. theFind best the best alternatealternate consensus consensus sequencesequence that, together that, together with the with reference, best fits the reads in a pile the reference, best fits the reads in a pile

consistent with the reference 3 adjacent Realigning

SNPs determines consistent with a 3bp which is better

2. The2. The score for an score for an alternatealternate consensus consensus is the istotal the total sum ofsum the of the qualityquality scores of mismatchingscores of basesmismatching bases 3. If the score of the best alternate consensus is sufficiently beer than the original alignments, then we accept the proposed 3. If the score of the best alternate consensus is sufficiently better than the original alignments,realignment then of the we acceptreads the proposed realignment of the reads

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Three types of realignment targets

• Known sites: Ø Common polymorphisms: dbSNP, 1000Genomes

• Indels seen in original alignments (in CIGAR, indicated by I for Inseron or D for Deleon)

• Sites where evidences suggest a hidden indel (SNVs abundance)

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Known sites https://www.broadinstitute.org/gatk/guide/tagged?tag=knownsites Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help disnguish true variants from false posives, which is very important to how these tools work. If you don't provide known sites, the stascal analysis of the data will be skewed, which can dramacally affect the sensivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

2. Recommended sets of known sites per tool

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 8 Local realignmentLocal realignment around indels around indels

Before SNVs After SNVs

Deletion

Deletion

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 7 - 9 AVRIL 2014 9

Local realignmentLocal realignment around indelsaround indels

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 10 Local realignmentLocal realignment around indels around indels

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 11 Indel realignment steps/tools

1. Identify what regions need 11 to be realigned Indel realignment stepsIndel/tools realignment steps/tools RealignerTargetCreator 1. Identify1. whatIdenfy regions what regions need need to be realigned+ known sites to be realignedØ RealignerTargetCreator + known sites Intervals

RealignerTargetCreatorIntervals

+ known sites 2. Perform the actual 2. Performrealignment the actualIntervals (BAM output) realignment (BAM output) Ø IndelRealigner IndelRealigner 2. Perform the actual realignment29 janvier 2015 (BAM output)Formaon NGS & Cancer - Analyses Exome

IndelRealignerFORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 12 Galaxy: Realigner Target Creator • Use « RealignerGalaxy Target: CreatorRealigner » from « GATK Target Tools » toCreator detect intervals in need of • localUse « realignmentRealigner Target Creator » from « GATK Tools » to detect intervals in need of local realignment

Choose advanced GATK options

Add new binding for reference- ordered datas Add new operate on Genomic Intervals

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 13 Galaxy: IndelGalaxy Realigner: Indel Realigner •• Use « Use « IndelIndel Realigner Realigner » from » from « GATK « GATK Tools » to Tools » to apply localapply realignment local realignment

Choose advanced GATK options

Add new binding for reference- ordered datas Add new operate on Genomic Intervals

7 - 9 AVRIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 14

P re pr o ce ss PreprocessG A T K: pa rt 2 GATK: part 2

Refer ence Gen ome ( Fasta )

Co nv er si o n to Gal axy M app in g Al ig ned an d pr epr ocessed R ead s Fo r mat ------r ead s (B A M ) ( F ast q) ------Bowtie2 ------Groom er - M ark ed PC R duplic at es - I nt ers ect ed on t arget regions - R ealigned around indels Qu al i ty PC R du p li cates - R ec alibrat ed Co ntr o l M ar ki ng ------F as t QC Mark D up

Pr ep r ocess GA T K Pr epr ocess GA T K Tar get Targ et p ar t 1 p ar t 2 In ter secti o n r egi on s ------(b ed ) Local rea lign m ent Bas e Qualit y Sc ore Int ers ec t Bam around indels R ec alibrat ion

FO RM AT IO N “ NG S & C A N CE R : A N AL Y SE D E V A R IA N T S G É N O M I QU ES ” 7 - 9 AV RIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 15

Why recalibrateWhy recalibrate base qualities base ?qualities ?

Real data is messyReal data sois properly messy so estimating properly esmang the evidence the evidence is critical is crical

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 16 The quality scores issued by sequencers are inaccurate and biased The quality scores issued by sequencers are inaccurate and biased • Quality Thescores quality are critical scores for all downstream issued by analysis sequencers are inaccurate and biased • Systematic biases are a major contributor to bad calls • Quality scores are crical for all downstream analysis • Example• Systemac of sequence biases context are a major bias in thecontributor reported to qualitiesbad calls : • Example of sequence context bias in the reported qualies:

before after

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Evidences of error covariates

• Analyze covariaon among several features of a base, e.g.:

• Reported quality score

• Posion within the read (machine cycle)

• Preceding and current nucleodes (chemistry effect)

• Sequencing technology...

• Adjust the quality score associated to each sequenced base to be more accurate Ø Remove systemac biases

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 18 How the covariates are analyzed? How the covariates are analyzed?

• Keep track of the number of observations and the number of times it was an error as a function of various covariates:

• Typically stratify the data by lane, by original quality score, by machine cycleHow and thesequencing covariates context are analyzed ? • Keep track of the number of observaons and the number of mes • Databasesit was of an knownerror as a funconvariants of various are covariatesused to: discount most of the real • Typically present strafy in the data by the samplelane, by original quality score, • All other differences by machine cycle and are assumedsequencing to context be errors • Databases of known variants are used to discount most of • Having done the real indel realignmentgenec variaon present first reduces in the sample noise • All other differences are assumed to be errors • Having done indel realignment first reduces noise #mismatches+1 Phred-scaled #bases+2 Quality score

hps://www.broadinstute.org/gatk/guide/arcle?id=44

29 janvier 2015 FORMATION “NGS & CANCERFormaon NGS & Cancer - Analyses : ANALYSEExome DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 19 BQSR steps/tools

1. Model the error modes and recalibrate qualities 19

BQSR steps/tools Count CovariatesBQSR steps /tools

1. Model1. Model the the errorerror modes modes and and recalibrate qualies recalibrateØ Base qualitiesRecalibrator Covariates Covariates Count Covariates

2. Perform the actual 2. Perform the actual recalibraonrecalibration (BAM output) (BAMCovariates output) Ø Print Reads Table Recalibration 2. Perform the actual recalibration29 janvier 2015 (BAM output)Formaon NGS & Cancer - Analyses Exome

TableFORMATION Recalibration “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 Galaxy: Base Recalibrator • Use « Base Recalibrator » from « GATK Tools » to recalibrate base quality scores

Add new binding for reference- ordered datas

Choose advanced GATK opons Add new operate on Genomic intervals

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Galaxy: Print Reads

Add new operate on Genomic intervals

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 22

F i na l R e su l ts Final results

R efer en ce Gen ome (F asta)

C on ver si on to Gal axy M ap pi ng Al i gn ed an d p r ep r ocessed R ead s Fo r mat ------r ead s (B A M ) ( Fast q ) ------Bowtie2 ------Groom er - M ark ed PC R duplic at es - I nt ers ect ed on target regions - R ealigned around indels Qu al ity PCR d u p li cates - R ec alibrat ed C on tr ol M ar ki ng ------F as tQC Mark D up

Pr ep ro cess GA T K Prep r o cess GA T K Tar g et Targ et par t 1 par t 2 I nter secti on r eg io ns ------( bed ) Local rea lign ment Bas e Qualit y Sc ore I nt ers ec t Bam around indels R ec alibrat ion

F OR M AT IO N “ N GS & CA N C ER : A N AL Y SE D E V A R I AN T S G É N OM IQ UE S ” 7 - 9 A VRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 23 Next step: somatic variant calling N ex t S te p: so m at i c v a ri a nt c a l l i n g

Tumor Mpileu p Al i gn ed an d Tum eur pr epr oces sed Read s ------(BA M ) Samtool s Mpi l eup

Variant Calling Variant Variant ------Annot atio n Selectio n VarSc an ------Somati c Anno v ar Sel ect Norm al Mpileu p Al i gn ed and Normal pr ep r ocessed R ead s ------(BA M ) Samtool s Mpil eup

29 janvier 2015 F OR M AT IO N “ N GFormaon NGS & Cancer - Analyses S & CA NC E R : A N AL Y SE D EExome V A R I AN T S G É N OM IQ UE S ” 7 - 9 A VRIL 2014 4 Using workflow in Galaxy: Repeat the same Re p e at t h esteps s a m e fors t e pthes f o blood r t h e b l sample o o d s a m pl e

Tumor Mpileu p Al i gn ed and Tum eur pr epr oces sed R ead s ------(BA M ) Samtool s Mpi l eup

Re peat pr oc es s Variant Calling Variant Variant ------Annot atio n Selectio n VarSc an ------Somati c Anno v ar Sel ect Norm al Mpileu p Al i gn ed and Normal pr ep r ocessed R ead s ------(BA M ) Samtool s Mpil eup

29 janvier 2015 F OR M AT IO N “ N GFormaon NGS & Cancer - Analyses S & CA NC E R : A N AL Y SE D EExome V A R I AN T S G É N OM IQ UE S ” 7 - 9 A VRIL 2014 5 RepeatRepeat all these all these steps insteps a few in click… a few click…

Reference Genome (Fasta)

Conversion to Galaxy Mapping Aligned and preprocessed Reads Format ------reads (BAM) (Fastq) ------Bowtie2 ------Groomer - Marked PCR duplicates - Intersected on target regions - Realigned around indels Quality PCR duplicates - Recalibrated Control Marking ------FastQC MarkDup

Preprocess GATK Preprocess GATK Target Target part 1 part 2 Intersection regions ------(bed) Local realignment Base Quality Score Intersect Bam around indels Recalibration

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 6

6 Select Librairies on Galaxy Select Librairies on Galaxy 1. In the top menu, click on « Shared Data » then « Data librairies » 1. In the top menu, click on « Shared Data » then « Data librairies » 2. Click on «canceropole-tp-input » 2. Click on «canceropole-tp-input »Select libraries on Galaxy 3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ; 3. Select « normal_R1.fastq1. In the top menu, click on « » ; « normal_R2.fastqShared Data » » ; «then exome_regions.bed « Data librairies » » ; « known_sites_regions.vcf2. Click on «« known_sites_regions.vcf[FORMATION] Input Data » » »then « EXOME » 3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ; « 4. Select « Import4. to Historiesknown_sites_regions.vcfSelect « »Import then click to Histories » on Go » then click on Go 4. Select « Import to Histories » then click on Go

5. Write a new history name and click on 5. Write5. Write a new a newhistory history name name and click on and click on « import library « import library datasets » « importdatasets library » datasets »

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

7

Extract workflow from history 7 Extract workflow from history 1. In the top menu, click on « Analyze Data » to return to 1. In the top menu, clickthe on main« Analyze frameExtract Data » workflowto return to from history the main frame 2.1. In the top menu, click on « In the « history » panel,Analyze click on the topside wheel then Data » to return to the main frame 2. In the « history » panel, click on the topside wheel then 2. In the « on « Extracthistory Workflow » panel, click on the » on « Extract Workflowtopside » wheel then on « Extract 3.WorkflowWrite a » name for your workflow then click on « Create 3. Write a name for your workflow then click on « Create 3. WriteWorkflow a name » for your workflow then Workflow » click on « Create Workflow »

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSEFORMATION DE VARIANTS “NGS GÉNOMIQUES” & CANCER : ANALYSE DE VARIANTS7 - 9 AVRIL GÉNOMIQUES” 2014 7 - 9 AVRIL 2014

8 Edit your worklow 8 Edit your worklow 1. In the top Editmenu, clickyour on «workflow Workflow » to return to the main1. In framethe top menu, click on « Workflow » to return 1. In the top menu, click on « to the main frame Workflow » to 2. Click on return to the main frame your new workflow then select « Edit » 2. Click on your new workflow then select « Edit » 2. Click on your new workflow then select «Edit»

3. Each box3. representIdenfy and a renametool set « Input with parametersdataset » boxes that Set at 3. Each box represent a tool set with parameters that corresponding to R1/R2/BED/VCF runtimeSet at runtime you can 4. modifyyouEach can box by modify clickingrepresent by clickingon a ittool set on itwith parameters that you can modify by clicking on it Click on the « Add or Replace Groups » box and ClickØ onClick on the « the « Add orAdd Replace or Replace Groups » Groups » box and change the change readbox and change the groupthe read ID group and nameID readand (name group ID and you can (you also can also name (you can also choose to set them at choose to chooseset them to setat runtimethem at )runtime ) runme) CheckØ Check the input of the « the input of the « IntersectIntersect BAM» BAM» box: it Check the input of the « Intersect BAM» box: it shouldbox: be theit should BAM outputbe the BAM output from « Mark Duplicatesfrom « » should be the BAM output from « Mark Duplicates » Mark Duplicates » 4. Saveand from your « edited flagstat worfklow » by clicking on the wheel and from 5. Save « flagstatyour »edited worlow by clicking on the 4. Save your edited worfklowFORMATION by clicking “NGS & CANCERon the : ANALYSE wheel DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 wheel then « save » then « saveFORMATION » “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 then « save »

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

9

Run your workflow on the blood sample 9

1. ARun workflow your can only workflow be runnedRun on data on your present the workflow in blood the current sample on the blood sample history 1. A workflow can only be 1.runnedA workflow on data can only be runned on data present in the current present Click on in the the wheelcurrent in the history top historyof your history panel and Ø Click on the select « saved wheelhistories in the top of » your history Click on the wheel in the top of your history panel and panel and select « saved histories » Click on the « Normal » history and click on « switch » Ø Click on the « Normal » historyselect and click on « saved histories » « Goswitch back to » the « Workflow » page Click (top on panel) the « thenNormal click » history and click on « switch » on your workflow then on « Run » Ø Go back to the « Workflow Go » page (top back to the « Workflow » page (top panel) then click panel) then click on your workflow then on on your workflow then on « Run » 2. Check« Runthat all » your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)

3.2. Check Click onthat « Run all workflowyour input files are correct (step1: » at 2.theCheck bottom that of theall yourpage input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2) bed, step2: vcf, step3: R1, step4: R2) 3. Click on « Run workflow » at the bottom of the page 3. Click on « Run workflow » at the boom of the page FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

10 Let Galaxy work for you!

10 10 Let GalaxyLet Galaxy workLet work for Galaxy youfor you! work! for you!

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 11 Next Step N ex t St e p

Tumor Mpileu p Al i gn ed an d Tum eur p r epr oces sed R ead s ------(BA M ) Samtool s Mpi l eup

Variant Calling Variant Variant ------Ann otatio n Selectio n VarSc an ------Somati c Anno v ar Sel ec t Norm al Mpileu p Al i gn ed an d Normal p r ep r ocessed R ead s ------(BA M ) Samtool s Mpil eup

29 janvier 2015 F OR M AT I ON “ N GS Formaon NGS & Cancer - Analyses & CA N CE R : A N AL YS E D E Exome V A R I AN T S G É N OM I QU ES ” 7 - 9 AV RIL 2014 4 Somatic Variant Calling with VarScan2 S o m at i c V a ri a nt C a l l i n g w i t h Va rs ca n

Tumor Mpileu p Al i gn ed an d Tum eur p rep r oces sed R ead s ------(BA M ) Samtool s Mpi l eup

Variant Calling Variant Variant ------Annot atio n Selectio n VarSc an ------Somati c Annov ar Sel ect Norm al Mpileu p Al i g ned an d Normal p r ep r ocessed R eads ------(BA M ) Samtool s Mpil eup

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FO R M AT IO N “ N GS & CA NC E R : A N AL YS E D E V A R I AN T S G ÉN O M IQ U ES ” 7 - 9 AV RIL 20 14 Variant Calling • Factors to consider when calling a SNV: – Base call qualies of each supporng base (base quality) – Proximity to small indels, or homopolymer run – Mapping qualies of the reads supporng the SNP – Sequencing depth: >=30x for const ; >=100 for tumor – SNVs posion within the reads: Higher error rate at the reads ends – Look at strand bias (SNVs supported by only one strand are more likely to be arfactual) – Allelic frequency: Tumor cellularity will reduce the % of an heterozygous variant • Higher stringency when calling indels (and sanger validaon oen needed)

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 7 Depth of CoverageDepth of Coverage

DepthDepth of Coverage of Coverage = number = number of reads of supportingreads supporng one positions one posion ex: ex: 1X,1X, 5X, 100X... >1000X 5X, 100X… >1000X

Reference Base SNV Sequencing Error

Reference Genome

Aligned Reads 7X 2X 17X 5X brin + 100% 9X brin + 2X brin - = SNV 50% SNV and 8X brin - Homozygote 50% sequence = SNV context Heterozygote (errors) Calling Confidence --- NGS - Applications and Analysis +++

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 VarScan2 • Mutaon caller wrien in Java (no installaon required) working with Pileup files of Targeted, Exome, and Whole-Genome sequencing data (DNAseq or RNAseq)

• Mul-plaorms: Illumina, SOLiD, Life/PGM, Roche/454

• Detecon of different kinds of Germline SNVs/Indels (classical mode): Ø Variants in individual samples Ø Mul-sample variants shared or private in mul-sample datasets

• VarScan specificity is to be able to work with Tumor/Normal pairs (somac mode): Ø Somac and germline mutaon, LOH events in tumor-normal pairs Ø Somac copy number alteraons (CNAs) in tumor-normal exome data

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome VarScan2 • Most published variant callers use Bayesian stascs (a probabilisc framework) to detect variants and assess confidence in them (e.g.: GATK)

• VarScan uses a robust heurisc/stasc approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and stascal significance

• In Stead et al. (2013), they compared 3 different somac callers : MuTect, Strelka, VarScan2 Ø VarScan2 performed best overall with sequencing depths of 100x, 250x, 500x and 1000x required to accurately idenfy variants present at 10%, 5%, 2.5% and 1% respecvely

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 10 Common history 10

10 10 1. In the wheel from the historyCommon panel, select history « Copy datasets » 10 2. Select the preprocessed BAMCommonCommon from Normal history history and Tumor histories 1. In the wheel from the history panel, select « Copy datasets » Commonand « exome_regions.bedCommon history 1. In the history » wheel from the history panel, select « 2.1. SelectIn the wheelthe preprocessed from the history BAM panel, from Normalselect « and Copy Tumor datasets histories » Copy and «datasets exome_regions.bed » » 1. In the wheel1. Infrom the thewheel history 2.from2. Select the Select thepanel, historythe preprocessed selectpreprocessed panel, « Copyselect BAM BAM datasets «from Copyfrom Normal datasets Normal and » and Tumor » histories and « exome_regions.bed » 2. Select the2. preprocessedSelect the preprocessed BAMTumor from histories and « BAM Normal from and Normalexome_regions.bed Tumor and historiesTumor histories » and « exome_regions.bedand « exome_regions.bed » »

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 11

11

Common history 11 Common history 1. EditCommon each BAM history attributes by clicking on the little pen 1. Edit each BAM attributes by clicking on the little pen 2.1.ToEdit add each more BAM clarity attributes, rename byCommon clicking your on BAMthe littlehistory in pen « Normal.bam » and « Tumor.bam » 2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam » 2.thenTo1. Edit add « moreSaveeach clarity BAM » the, aributesrename changes your by BAMclicking in « on the Normal.bamlile pen » and « Tumor.bam » 2. To add more clarity, rename your BAM in « Normal.bam » and « then « Save » the changes thenTumor.bam « Save » » thethen changes « Save » the changes

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014

29 janvier 2015 FORMATIONFORMATION “NGS & Formaon NGS & Cancer - Analyses “NGS CANCER & CANCER : ANALYSE : ANALYSE ExomeDE VARIANTS DE VARIANTS GÉNOMIQUES” GÉNOMIQUES” 7 - 9 AVRIL 20147 - 9 AVRIL 2014 12

S o m aSomatict i c V a ri a nvariantt C a l l i n gcalling w i t h V a with rs c a n VarScan

Tumor Mpileu p Al i gn ed and Tum eur pr epr oces sed R ead s ------(BA M ) Samtool s Mpi l eup

Variant Calling Variant Variant ------Annot atio n Selectio n VarSc an ------Somati c Anno v ar Sel ect Norm al Mpileu p Al i gn ed and Normal pr ep r ocessed R ead s ------(BA M ) Samtool s Mpil eup

F OR M AT IO N “ N G S & CA NC E R : A N AL Y SE D E V A R I AN T S G É N OM IQ UE S ” 7 - 9 A VRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 13 13 Mpileup Mpileup 1. UseUse «« MpileupMpileup » » from from « « NGS:Samtools NGS:Samtools » »to to create create pileup pileup files files (repeat (repeat for forTumor Tumor 1. Use « andand normalnormalMpileup samplessamples » from) ) « NGS:Samtools » to create pileup files (repeat for Tumor and normal samples) AnomalousAnomalous read read pairs pairs Anomalousare aredue due to the to the restrictionrestrictionread pairs are of the of exomethe exome due to ato regionto a region the restricon of the exome to a region

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 14 Pileup format Pileup format

• • DescribesDescribes the the base-pair informaon base-pair information at each atposition each posion

Reference base Base qualities

=4 =??????@??@?@@ @=@ ??@ ? @? ? .$,$.,,.,,..,,.,.,..,.,,.,.,,.,..,,..,,.,.,.,.,,.,.,.,,,.,. ??< ? ??@??????? ? @??? ? ??@?? chr12 112888238 A 108 ,.,.,,,.,.,.,.,.,,.,.,.,.,,,,.,,.,.,..,,,,,.,.,,,,^F, A???@@ ?@@???AB????= ? @ @@??@@?@ A 00 .$t,.,,.T,tT,.,T.,.,t.tTtt.tTT,t.T,tTt.tT,T,,.t 936 78??6??6 45<875? ??? ?@6 chr12 112888239 C 108 TtTttt.,.,Tt.ttt.,T,.,.tT,,T,T,.tT,,t,TttTtT,T. @6???

Read bases: Number of reads covering the site . / , = match on forward/reverse strand

(total depth) ACGTN / acgtn = mismatch on forward/reverse strand

`-\+[0-9]+[ACGTNacgtn]+‘ indicates an indel

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 15 Somatic variant calling with VarScan S o m at i c V a ri a nt C a l l i n g w i t h Va rs c a n

Tumor Mpileu p Al i gn ed an d Tum eur p r epr oces sed R ead s ------(BA M ) Samtool s Mpi l eup

Variant Calling Variant Variant ------Ann otatio n Selectio n VarSc an ------Somati c Anno v ar Sel ec t Norm al Mpileu p Al i gn ed an d Normal p r ep r ocessed R ead s ------(BA M ) Samtool s Mpil eup

29 janvier 2015 F OR M AT I ON “ N GFormaon NGS & Cancer - Analyses S & CA N CE R : A N AL YS E D EExome V A R I AN T S G É N OM I QU ES ” 7 - 9 AV RIL 2014 16 VarScan Somatic 16

1. UseVarScan « VarScan Somatic somatic » from « Varscan » to VarScan somatic detect variant 1. Use « VarScan somatic » from « Varscan » to 1. Use « • Min-varVarScan-freq: minimal somac allelic » from frequency « Varscan to » to call adetect variant variant detect variant • Min-var-(10% herefreq) : minimal allelic frequency to call a • Min-var-freq: minimal allelic frequency to call a variant variant (10% • Min-coveragehere: ) minimum coverage to call a variant (in • Min-(10%coverage here): minimum coverage to call a variant normal and tumor and combined) (in normal and • Min-coveragetumor: minimum and combined coverage ) to call a variant (in • Tumor and normal purity: cellularity of your sample • Tumornormal and and normal tumor purityand combined: cellularity) of your sample

Ø •2 output files: Tumor and normalSNVs & purityIndels: cellularity in VCF format of your sample

2 output files: SNVs & Indels in VCF format 2 output files: SNVs & Indels in VCF format

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 17 VarScan VCF FormatVarScan VCF format

• 2 types: 2 types: Ø• VCF (VCF specific(specific to toVarScan VarScan) ) Ø Tabulated (available only for VarScan in classical mode) • Tabulated (available only for VarScan in classical mode)

• VarScanVarScan VCF format: VCF formatclassic: classic VCF header (#) but VCF header (#) butspecific specific variant variantlines lines

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR

DP=115; ; :GQ: SOMATIC;SS=2 GT 0/0:.:52:48:0:0% 0/1:.:63:50:8:13.79% chr12 250239 . A G 20 PASS SSC=21;GPV=1; SPV=6.3E- DP:RD:AD: :15,33,0,0 :19,31,3,5 3 FREQ:DP4

GT=Genotype (1/1: Homozygous ; 0/1 : Heterozygous) / GQ= Genotype Quality SS= Somatic Status (0=ref; 1=Germline ; 2=Somatic; 3=LOH ; 5= Unknown) DP= Quality Read Depth of bases with Phred score >= BAPQ RD= Depth of reference-supporting bases AD= Depth of variant-supporting bases FREQ= Variant allele frequency DP4= Ref/FWD , Ref/REV, Alt/FWD, Alt/REV

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 18

VarScan Tabulated Format 18 VarScan Tabulated Format VarScan Tabulated Format

Strands Strands Map Map Chrom Position Ref Cons Reads1 Reads2 VarFreq 1 2 Qual1 Qual2 Pvalue Qual1 Qual2 R1 + R1 - R2 + Rs2 - Alt chr12 113348849 C Y 31 30 49.18% 2 2 27 27 0.98 1 1 19 12 25 5 T Strands Strands Map Map Chromchr12 113354329Position RefG ConsR Reads172 Reads22 VarFreq2.70% 21 22 Qual131 Qual226 Pvalue0.98 Qual11 Qual21 R148 + R124 - R21 + Rs21 - AltA chr12 113357193113348849 GC AY 312 7230 97.30%49.18% 12 2 2827 2427 0.98 1 1 192 120 4525 275 AT chr12 113357209113354329 G RA 720 772 2.70%100% 02 2 310 2926 0.98 01 1 480 240 511 261 A chr12 113357193 G A 2 72 97.30% 1 2 28 24 0.98 1 1 2 0 45 27 A chr12 113357209 G A 0 77 100% 0 2 0 29 0.98 0 1 0 0 51 26 A

Cons : Consensus Genotype of Variant Called (IUPAC code):

ConsM -:> Consensus A or C GenotypeY -> C orof TVariant CalledD -> A (IUPAC or G or Tcode): W -> A or T V -> A or C or G R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T M -> A or C Y -> C or T D -> A or G or T W -> A or T V -> A or C or G R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T

29 janvier 2015 FORMATION “NGS Formaon NGS & Cancer - Analyses & CANCER : ANALYSE DE VARIANTSExome GÉNOMIQUES” 7 - 9 AVRIL 2014

FORMATION “NGS & CANCER : ANALYSE DE VARIANTS GÉNOMIQUES” 7 - 9 AVRIL 2014 19 Variant Annotation Va r i a nt A nn o ta t i o n

Tumor Mpileu p Al ig n ed and Tum eur p r epr oces sed R ead s ------(BA M ) Samtool s Mpi leup

Variant Calling Variant Variant ------Annot atio n Selectio n VarSc an ------Somati c Anno v ar Sel ec t Norm al Mpileu p Al ig ned and Normal p r ep r ocessed R ead s ------(BA M ) Samtool s Mpil eup

FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES” 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 7 - 9 AV RIL 20 14 Different types of SNVs

• SNVs and short indels are the most frequent events: Ø Intergenic Ø Intronic Ø cis-regulatory Ø splice sites Ø frameshi or not Ø synonymous or not Ø begnin or damaging etc...

• Example of SNV one want to pinpoint: Ø non-synonymous + highly deleterious + somacally acquired

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Resources dedicated to human genetic variations

• dbSNP and 1000-genomes Ø Populaon-scale DNA polymorphisms • COSMIC Ø Catalogue Of Somac Mutaons In Cancer • Non synonymous SNVs predicons Ø SIFT, Polyphen2 (damaging impact)... PhyloP, GERP++ (conservaon)

à ANNOVAR • Tools to annotate genec variaons

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 22

22 Annovar AnnovarAnnovar Use « Annovar » to annotate SNVs and Indels Use « Annovar » to annotate SNVs and Indels Use « Multi sampleAnnovar VCF » to (containsannotate Tumor &SNVs normal and samplesIndels) Mul sample Multi sample VCF ( VCFcontains (contains Tumor Tumor & normal & normal samplessamples) ) • RefGene: Gene & Function & AminoAcid Change Ø RefGene• RefGene: Gene & : GeneFuncon & Function & &AminoAcid AminoAcid Change Change (HGVS format: c.A155G ; p.Lys45Arg) (HGVS format: c.A155G ; p.Lys45Arg) (HGVS format: c.A155G ; p.Lys45Arg) Ø• 1000g2012apr_all: 1000g2012apr_all: MinorMinor Allele Frequency Allele Frequency for all for all ethnies ethnies• 1000g2012apr_all: Minor Allele Frequency for all ethnies Ø• ESP6500: ESP6500: ExomeExome Sequencing Sequencing Project Project Ø Ljb_all• : ESP6500:predicons Exome (SIFT, Polyphen2, LRT, Sequencing Project • Ljb_all : predictions (SIFT, Polyphen2, LRT, MutaonTaster• Ljb_all : predictions, PhyloP (SIFT,, GERP++) Polyphen2, LRT, MutationTaster, PhyloP, GERP++) MutationTaster, PhyloP, GERP++) ² Tabulated Tabulated file file Tabulated file

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 23 Variant selection Va ri a nt S e l e ct i o n

Tumor Mpileu p Al i gn ed an d Tum eur p r epr oces sed R ead s ------(BA M ) Samtool s Mpi l eup

Variant Calling Variant Variant ------Annot atio n Selectio n VarSc an ------Somati c Anno v ar Sel ec t No rm al Mpileu p Al i gn ed an d Normal p r ep r o cessed R ead s ------(BA M ) Samtool s Mpil eup

29 janvier 2015 F O RM AT IO N “ N G SFormaon NGS & Cancer - Analyses & CA N CE R : A N AL Y SE D E Exome V A R I AN T S G ÉN OM IQ U ES ” 7 - 9 AV RIL 2014 24 Select variant predicted as somatic

24 • Use the « Select » tools from « Filter and Sort » to select only lines matching the Select Selectvariant variantpredicted predicted as somatic as somatic pattern « SOMATIC » • • Use the « Select » Use the « Select » toolstools from «from Filter « andFilter Sort and Sort » to select » to select only lines matchingonly lines the matchingpattern « SOMATIC the paern « SOMATIC » »

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome 29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Annexe 1 : Frequently mutated genes in WES Fuentes Fajardo KV, Adams D; NISC Comparave Sequencing Program, Mason CE, Sincan M, Ti C, Toro C, Boerkoel CF, Gahl W, Markello T. Detecng false-posive signals in exome sequencing. Hum Mutat. 2012 Apr; 33(4):609-13. doi: 10.1002/humu.22033. Epub 2012 Mar 5. PubMed PMID: 22294350; PubMed Central PMCID: PMC3302978.

Potenally false-posives = 2157 GENES !!! (ex : MUCxx, HLA-xxx,

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Annexe 2 : without « normal » sample ?

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome Annexe 3 : how to visualize variants ?

29 janvier 2015 Formaon NGS & Cancer - Analyses Exome