Zebrafish Illumina Data Analysis Protocol

Zebrafish Illumina Data Analysis Protocol Accompanying the publication: Efficient Mapping and Cloning of Mutations in Zebrafish by Low Coverage Whole Genome Sequencing Margot E. Bowen & Katrin Henke, Kellee Siegfried-Harris, Matthew L. Warman, Matthew P. Harris. Genetics, 2012 Described in this protocol are the steps required for the analysis of low coverage Illumina whole genome sequence data from a pool of F2 mutants. This protocol utilizes whole genome sequence data from the Tü, WK, TLF and AB laboratory strains to: (1) facilitate the accurate identification of common polymorphisms present in a mutant pool, which can be used for homozygosity mapping; (2) identify the likely parental origin of alleles in a mutant pool, which can be used for mapping based on strain specific signatures; (3) within the linked interval, distinguish common polymorphisms from variants unique to the mutant pool, which will be candidates for the causative mutation. Contents 1. Software requirements……………………………………………….. 2 2. Downloading datasets………………………………………………… 2 3. Command line usage…………………………………………………. 3 4. Preparing the reference genome……………………………………. 3 5. Alignment of reads to the reference genome……………………… 4 6. Preparing BAM files……………………………………….………….. 5 7. Variant calling…………………………………………………………. 6 8. Rough Mapping……………………………………………………….. 7 9. Narrowing down the linked interval…………………………………. 8 10. Identifying candidate mutations…………………………………….. 10 11. Identifying candidate causative mutations in regions covered by only one read……..………………………… 11 12. Identifying candidate causative mutations: indels………………… 12 13. Appendix 1: visualizing the data used for calculating the mapping score…………………………………… 13 1 1. Software requirements The following publically available software should be installed: 1. Any suitable alignment software, such as Novoalign (http://www.novocraft.com/). 2. SAMtools (http://samtools.sourceforge.net/). 3. Picard (http://sourceforge.net/projects/picard/) 4. Genome Analysis Toolkit (GATK) (http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit) 5. Integrative Genomics Viewer (IGV) (http://www.broadinstitute.org/igv/) 6. Annovar (http://www.openbioinformatics.org/annovar/) 2. Downloading datasets The following datasets should be downloaded from http://fishyskeleton.com/: 1. Whole genome sequence data for six reference strain pools Folder name: aligned_sequences Example file name: TLF_chr1.bam, TLF_chr1.bam.bai Illumina sequencing reads (100bp single end), for the Tü, WIK, AB, TLF laboratory strains, aligned to the zebrafish reference genome (Zv9/danRer7). For the Tü and WIK strains, sequencing was performed separately on fish maintained in Tübingen, Germany (referred to as TUG and WKG) and fish maintained at Children’s Hospital, Boston (referred to as TUB and WKB). Each aligned sequence file (in the BAM file format) is paired with an index file (.bai). 2. Coordinates of repetitive sequences Folder name: repeat_mask Example file name: danRer7_repeatmask_chr1.bed Coordinates of all interspersed repeats and low complexity sequences in the zebrafish genome. This is the output of the RepeatMasker (http://www.repeatmasker.org/) that has been converted to the BED file format. 3. Physical coordinates of cM intervals Folder name: MGH_map Example file name: chr1_cM Estimated coordinates for each 0.25 cM interval in the genome. These estimates are based on the genetic distance of markers in the MGH meiotic map (http://zfin.org/zf_info/downloads.html#map) that have been mapped to the zv9 reference genome. 2 4. Zebrafish genes and dbSNPs Folder name: zebradb Example file name: danRer7_ensGene.txt Coordinates and sequences of Ensembl and Refseq zebrafish genes (http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/database/) and dbSNPs (http://useast.ensembl.org/info/data/ftp/index.html) that have been formatted for use in the Annovar software. 5. Coding exon bases Folder name: coding_exon_bases Example file name: coding_bases_chr1 The genomic coordinate of each coding exon base as well as bases up to 15 bp upstream and downstream of each coding exon. This list was constructed from the merged dataset of all Refseq and Ensemble coding exons. 6. Perl scripts Folder name: perl Example file name: mapping_score.pl Perl scripts used for mapping and candidate mutation identification. 3. Command line usage Since it is often required to run the same command on multiple files, wild cards and for loops are used in this protocol. The wild card symbol “*” can be used to represent any character or characters in a file name, while the wild card symbol “?” can be used to represent any single character in a file name. For Loops can, for example, be used to run the same command on files from all 25 chromosomes (for f in {1..25} ; do echo $f ; command chr"$f" ; done). Users unfamiliar with these concepts are encouraged to consult an introductory Unix text. 4. Preparing the reference genome Some programs, such as the GATK, require the reference genome to have the .fasta extension, and that a .dict dictionary file and a .fai fasta index file are present in the same folder as the reference genome. See the GATK website for more details: http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_the_essential_GATK_input_file s:_the_reference_genome 3 1. Download the zebrafish genome sequence The zebrafish genome sequence, danRer7.fa, can be downloaded from the UCSC Genome Bioinformatics site (http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/bigZips/). 2. Rename the reference genome to .fasta mv danRer7.fa danRer7.fasta 3. Create a fasta sequence dictionary file, using the CreateSequenceDictionary.jar file from Picard java -jar CreateSequenceDictionary.jar R=danRer7.fasta O=danRer7.dict 4. Create a fasta index file (.fai) using SAMtools samtools faidx danRer6.fasta 5. Aligning reads to the reference genome Alignment can be performed using any suitable alignment software. As an example, the commands for Novoalign are provided: 1. Prepare the reference genome for use in Novoalign: novoindex danRer7_novoindex danRer7.fasta 2. To speed up the alignment, the input file can first be split into multiple smaller files split -l 5000000 sequence.txt splitsequence_ This splits up the input fastq file (sequence.txt) into files each containing 5 million lines (1.25 million reads) that are named splitsequence_aa, splitsequence _ab, splitsequence _ac etc. 3. Align the reads to the reference genome using Novoalign for f in splitsequence_* ; do echo $f ; novoalign -f $f -d danRer7.fa_novoindex -o SAM -F ILMFQ -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG > "$f".sam 3’ adaptor trimming (-a) is necessary if it is expected that a significant fraction of library fragments will have an insert size less than the Illumina read length. For example, libraries prepared from degraded genomic DNA in our laboratory had a significant fraction of fragments with an insert size below 100bp. When Illumina 100bp sequencing is performed on these fragments, the 3’ end of some reads will correspond to the adaptor sequence, which should be removed before alignment to the reference genome. The sequence listed above is the sequence corresponding to the Illumina paired end adaptor used in our laboratory (which can be used for both paired and single end sequencing). 4 6. Preparing BAM files 1. Convert SAM files to BAM files for f in splitsequence_*.sam ; do echo $f ; samtools import danRer7.fasta $f "$f".bam ; done 2. Sort reads based on their location in the genome for f in splitsequence_*.sam.bam ; do echo $f ; samtools sort $f "$f".sort ; done 3. Merge multiple BAM files samtools merge -rh readgroup.txt merge_mutant1.bam splitsequence_*.sam.bam.sort.bam For each read, a read group tag will be added which will specify which input file contained that read. Some programs, such as the GATK, require that all reads contain a read group tag. The readgroup.txt file must first be created as a text file. It should contain one line for each input file. Shown here are the first three lines: @RG ID:splitsequence_aa.sort SM:mutant1 LB:mutant1 PL:Illumina @RG ID:splitsequence_ab.sort SM:mutant1 LB:mutant1 PL:Illumina @RG ID:splitsequence_ac.sort SM:mutant1 LB:mutant1 PL:Illumina The ID tag must contain the input file name (excluding the .bam extension) The SM tag must contain the name of the sample (for example mutant1) The LB tag lists the name of the library. If only one library was made for each mutant, the library name will be the same as the sample name. The PL tag contains the name of the sequencing platform. 4. Index the merged BAM file samtools index merge_mutant1.bam 5. Remove PCR duplicates java -jar picard-tools/MarkDuplicates.jar I= merge_mutant1.bam O= merge_mutant1.PCR.bam METRICS_FILE= merge_mutant1_PCRmetrics ASSUME_SORTED=TRUE REMOVE_DUPLICATES=true 6. Index the merged PCR duplicate removed file samtools index merge_mutant1.PCR.bam 7. Create a file for each chromosome for f in {1..25} ; do echo $f ; samtools view -b merge_mutant1.PCR.bam chr"$f" -o mutant1_chr"$f".bam ; done To decrease the computational load, downstream analyses are performed on one chromosome at a time. 5 8. Index the BAM file for each chromosome samtools index mutant1_chr*.bam 9. Determine the depth of coverage for each chromosome for f in {1..25} ; do echo $f ; java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R danRer7.fasta -I mutant1_chr"$f".bam -o depth_mutant1_chr"$f" -L chr"$f" -- omitDepthOutputAtEachBase ; done 7. Variant calling 1. Perform variant calling on all mutants and reference

Load more