Woods Hole 2012 Zebrafish Bioinformacs Lab

All materials can be downloaded from

http://faculty.ithaca.edu/iwoods/docs/wh/ (or) hp://goo.gl/1bnOF

Ian G. Woods August 2012 Task 1: High resoluon mapping, sequencing, and expression

Overview: From a rough map position, refine the critical interval via (virtual) high resolution mapping with additional markers. Query the critical interval in the zebrafish genome for potential candidate . Find expression patterns online for these candidates. Design primers to sequence candidate genes for the mutagenic lesion or for additional SNPs to use in mapping. Task 2: Clone candidate enhancer/promoter sequences to create a transgenic reporter line

Overview: Idenfy the translaonal start site of a of interest. Obtain ~6kb of sequence upstream of this site. Design PCR primers that will amplify this region, and clone it in-frame with GFP in a tol2 expression vector. Idenfy BACs for use in creang reporter constructs via homologous recombinaon or gap repair. Idenfy evoluonarily conserved sequences from other organisms to uncover potenal regulatory regions around your gene of interest. Task 3: Morpholinos, rescue, and expression

Overview: Find the zebrafish ortholog of your favorite gene. Find its locaon in the genome, locate the ATG, and idenfy the exon-intron boundaries. Design two 25-mer morpholino sequences that target (1) the ATG and (2) an exon- intron boundary. Idenfy an orthologous gene in another species for use in rescue experiments to control for morpholino specificity. Align this sequence with your morpholinos to determine degree of potenal acvity. Obtain a full- length clone of the zebrafish gene (via RTPCR or clone collecons) for use in overexpression experiments or expression analyses via in situ hybridizaon.

Task 4: Idenfying zebrafish transcripts via Batch sequence retrieval and BLAST

Overview: Mine OMIM (Online Mendelian inheritance in Man) for genes related to Sonic Hedgehog. Get amino acid sequences for these genes, and idenfy (via blast) the zebrafish orthologs for these . Use a simple script to parse the blast results to see where the genes are located in the zebrafish genome. Finally, find out where a few of these genes are expressed (via zfin).

Requirements: UNIX terminal, perl (both nave on MacOSX)

Detailed protocols and results Task 1: High resoluon mapping, sequencing, and expression

Overview: From a rough map position, find more SSLP markers to test for polymorphisms in your mapping cross. Query the critical interval in the zebrafish genome for potential candidate genes. Find expression patterns online for these candidates. Design primers to sequence candidate genes for the mutagenic lesion or for additional SNPs to use in mapping. ZFIN home http://www.zfin.org

Click on “Genetic Maps” ZFIN Genec Maps Browser

Uncheck all but “MGH” and enter marker symbol ZFIN MGH map viewer

Zoom out as far as possible ZFIN Map Viewer

Z3057 is at ~38 cM. Z13936 is at ~85 cM.

Plenty of markers to follow up on.

Find primer sequences for Z15270 Find addional SSLPs to test http://www.ensembl.org Find addional SSLPs to test

enter Z15270 into the search box and hit go Follow the links . . . Ensembl Marker Report Viewer Ensembl View

Zoom out a bit Configuring Tracks in Ensembl Zoom out in region

what kind of gene is LOC563432? Ensembl gene view

Click the ‘Orthologues’ link Ensembl gene view

Scroll down a bit . . . looks like a phosphodiesterase Record for rras2

links to expression pattern – click on External References Link out to ZFIN ZFIN expression data

Click on ‘Directly submied expression data’ link ZFIN expression data

rras2 expression – consistent with a role in muscle development? ZFIN home http://www.zfin.org

Follow link for Genes / Markers / Clones ZFIN gene search

Type in rras2 ZFIN gene search results

Click on “1 Gene” ZFIN gene record

Scroll down . . . . ZFIN gene record

Follow link for RNA sequence GenBank gene record

Scroll down . . . . GenBank gene record

Copy Sequence to Clipboard – note cds starts at 240 UCSC Genome Home

Click on the “BLAT” tab UCSC BLAT search

Paste in your sequence, and select “Zebrafish” from the Genome menu UCSC BLAT results

Follow “details” link for the top hit UCSC BLAT results

Light blue = exon boundaries Select about 600-800b of genomic sequence around the first exon Primer3 Home http://frodo.wi.mit.edu/primer3/

Choose size range of 500-600 Primer3 Results BLAST 2 Sequences http://blast.ncbi.nlm.nih.gov/Blast.cgi... ‘nucleotide blast’

Paste in wildtype and mutant sequences BLAST 2 Result

Scroll down . . . BLAST 2 Result

There are two SNPs . . . dCAPS Home http://helix.wustl.edu/dcaps/dcaps.html

Paste about 40b of wildtype and mutant sequence in flanking each SNP dCAPS results – SNP1

Not much luck, but can introduce differenal restricon sites via primers dCAPS results – SNP2

Plenty of enzymes from which to choose Do the SNPs affect coding?

BLAST mutant gDNA vs. cDNA Do the SNPs affect coding?

Query = mutant sequence; Subject = GenBank refseq SNPs are not in coding sequence Task 2: Clone candidate enhancer/promoter sequences to create a transgenic reporter line

Overview: Idenfy translaonal start site of gene of interest. Obtain 6kb of sequence upstream of this site. Design PCR primers that will amplify this region, and clone it in-frame with GFP in a tol2 expression vector. Idenfy BACs for use in creang reporter constructs via homologous recombinaon. Idenfy evoluonarily conserved sequences from other organisms to uncover potenal regulatory regions around your gene of interest. ZFIN home http://www.zfin.org

Follow link for Genes / Markers / Clones ZFIN Gene Search

type in ‘scube2’ ZFIN Search Result

Follow link for Gene ZFIN Gene record

Scroll down a bit . . . ZFIN Gene record

Zfin localizes this gene to Chr. 7 GenBank Gene record

Scroll down . . . GenBank Gene record

Find the heading for “CDS” = coding sequence GenBank Gene record

The “atg” (translaonal start) is at #106 in this mRNA sequence Ensembl Zebrafish http://www.ensembl.org/Danio_rerio

Enter scube2 into the search box Ensembl text search result

Click on “Location” Ensembl Browser – scube2

The transcript is going to the ‘left’ Genec vs. physical distance

Z15270 ~ 28,880,000 scube2 ~ 29,900,000 Physical distance ~ 1,000,000

Genetic distance = 0.1 cM Total genome = 3000 cM = 1.7 x 109 bp, so about 560,000 bp / cM

Genetic distance predicts ~ 56,000 bp away

Differences could arise from recombination hotspots/coldspots, and/or errors in sequence assembly scube2 exon 1 scube2 exon 1 – 5000b

uh oh .... overlaps another gene scube2 intergenic region

Hit ‘export data’ on left part of page scube2 intergenic region – export scube2 intergenic region – export

choose ‘text’ scube2 intergenic region – export Primer3 input http://frodo.wi.mit.edu/primer3/ Primer3 output

Add enzyme (or gateway) sequences, and clone into your GFP vector Ensembl Zebrafish Home http://www.ensembl.org/Danio_rerio

Enter “scube2” and click “Go”. Ensembl Chromosome view

scube2 is split between two sequenced BACS: CU467654 and CU464087 Ensembl Chromosome view

Turn on BAC ends track Ensembl Chromosome view

No BAC has enre gene plus putave regulatory sequences, but one has 5’ regions Find a BAC by BLASTing NCBI http://blast.ncbi.nlm.nih.gov/

Click “nucleode blast” BLAST scube2 vs. zebrafish nr

Enter accession number, click “nr”, and type in “Danio rerio” NCBI BLAST results

There are several BACS, but none contain the enre sequence BLAST2 sequences http://blast.ncbi.nlm.nih.gov/ => ‘nucleotide blast’, ‘Align two sequences’

Place accession numbers for coding sequence on top, and BAC sequence on boom BLAST2 result

The BAC contains upstream sequence plus about 1500b of coding sequence Zebrafish BAC assembly http://www.sanger.ac.uk/cgi-bin/humpub/chromoview Zebrafish BAC assembly Aligning genomic sequences with VISTA

See exons and conserved noncoding sequences (regulatory?) Idenfy orthologous sequences

Grab pepde sequence of Scube2 from GenBank record BLAT search at UCSC with pepde

Pull down Tetraodon from Genome menu Which sequence is the true ortholog?

Whole genome duplicaon in teleosts makes orthology assignment a bit tricky Which sequence is the true ortholog? Zebrafish

Tetraodon Sckleback Medaka

Zebrafish Chromosome #7 =

Tetraodon #5 Stickleback #II or #VII Medaka #3 or #18

Clue from conserved synteny scube2 in Tetraodon

ZF #7 = Tet #5 scube2 in Tetraodon

Zoom out and grab sequence via DNA tab scube2 in Sckleback

Zoom out and grab sequence via DNA tab scube2 in Medaka

Zoom out and grab sequence via DNA tab mVISTA Submission

Select 4 sequences – upload them on the following page mVISTA upload mVISTA Viewer

Probably exons: can adjust parameters to be more or less stringent mVISTA Viewer

Conserved non-coding regions => regulatory elements? mVISTA Alignment

Zebrafish vs. Tetraodon Task 3: Morpholinos, rescue, and expression

Overview: Find the zebrafish ortholog of your favorite gene. Find its locaon in the genome, locate the ATG, and idenfy the exon-intron boundaries. Design two 25-mer morpholino sequences that target (1) the ATG and (2) an exon- intron boundary. Idenfy an orthologous gene in another species for use in rescue experiments to control for morpholino specificity. Align this sequence with your morpholinos to determine degree of potenal acvity. Obtain a full- length clone of the zebrafish gene (via RTPCR or clone collecons) for use in overexpression experiments or expression analyses via in situ hybridizaon.

Entrez Gene Search http://www.ncbi.nlm.nih.gov

Select Gene from the Search menu Gene Search

Search for the first mouse entry Entrez Gene Entry

Scroll down . . . Entrez Gene Entry

Click on pepde (NP_XXX) link Boc GenPept record

Copy pepde sequence to clipboard BLAST home page http://blast.ncbi.nlm.nih.gov/Blast.cgi

Select tblastn Tblastn vs. nr

Paste in sequence, select nr, and type Danio rerio into the Organism box Tblastn vs. ests

Paste in sequence, select est_others, and type Danio rerio into the Organism box BLAST – access recent searches

Click on Request_ID’s to see results Boc vs. nr

Click on the “U” of the top hits to go to the UniGene page Boc vs. ests

Click on the “U” of the top hits to go to the UniGene page Boc UniGene search

Click on the “Brother of CDO” link Boc UniGene entry

Scroll down . . . Boc UniGene entry

Follow BC107996 link. “BCXXXXXX” = zebrafish gene collecon BC107996 UniGene link

Click on the GenBank link BC107996 GenBank entry

1477 bp. Is this full length? Go back and check other sequences NM_001005393 GenBank entry

3373bp. Does this have enre cds? Scroll down . . . NM_001005393 GenBank entry

Looks like enre sequence is present. Design primers for RTPCR NM_001005393 GenBank entry

Looks like enre sequence is present. CDS begins at 231. Design primers for RTPCR Primer 3 Home http://frodo.wi.mit.edu/primer3/

Choose size range that will amplify complete cds Primer 3 Results

atg highlighted, end is included. Do high fidelity PCR and sequence verify NM_001005393 GenBank entry

CDS begins at 231. Design MO spanning ATG Ensembl sequence search

http://www.ensembl.org/Multi/blastview

Ensembl BLAT, choose zebrafish, paste in boc mRNA Zebrafish boc genomic structure

Each exon has an alignment on Chr 24. Beginning of CDS is at 231 Zebrafish boc genomic structure

Extract sequence around ATG or around splice junctions Zebrafish boc genomic structure http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=7955

Paste in coding sequence, and choose the Traces-WGS Database Zebrafish boc genomic structure

Alignments show exon structure – click on third exon Zebrafish boc genomic structure

Third exon ends at about 377b in trace, follow link to trace archive Zebrafish trace archive

Third exon ends at about 426b in trace Trace vs. coding sequence Trace vs. coding sequence

exon / intron boundary

Trace is in same orientaon Trace vs. coding sequence http://www.bioinformatics.org/SMS/rev_comp.html

Can find reverse complement of trace if you need to Target for splice morpholino

End of exon is highlighted in this genomic sequence Design morpholino that surrounds this sequence Morpholino design

Boc mRNA = NM_001005393 >boc_exon3_gDNA ATGACTGATAGCCAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCCAAACTGGATTCAAATCCGGCGCCAATT ATAGATGTTTGTTTTGGATAGCCGATATATTTATCCTCCATTTTTCTGTCCTGTTTTCTCTTCTAGATGACGTTCCAG TGTTCACTGAGGAGCCGTTCTCGGTGGTACAGAAGCTGGGGGGCAGCGTGACCCTGCGCTGCAGTGCCCTGCCTAACC ATGTCAACATCAGCTGGCATCTGAATGGCAAAGAGCTGCCAGCTGGGGGCGACGAGGAGCTGGGAGTGCTGGTGCGGC CCGGTTCTCTCTACATCCCCTCCCTTACAAACCTCACCGTGGGCAGATACCAGTGTGTGGCCACCACCAGCGTCGGGT CCTGCGCGAGTGTGCCGGCCAATGTTACTGCAGCGAGTGAGTACACCAGTGCTTTTGTAAACATCAGACACTTGTTGT ATATATATTTACAGGCAATGGTGTGAACTTTGAAGCTTTTTTGTACAGAAAGTGTTTGTGTGCGAAAGATGTTTTGTT GTGTGGTGTGTGTGTGGTGTGTGTGTTTGCCTCATGTTAGTCAGTGTTGGAGCACAGAAAGGGGCGTGTAATATGACC CACCATAATGCCTGTACTAGTGAACCACTGGCCCCATTAGTGAAGGGCCTCCTCCCTGTTCTTCATAACTTTTCTTTG CAAATGTAGAGATGGAAAGAAAACCAAACAGTGAGTAAATTTTGAGAGTGTTTGGNTAGTAGTTGATTGAGTTGAGTT CAGG splice MO GTGTACTCACTCGCTGCAGTAACAT ATG MO: caacgtcccagacatcttccataca

Check these with your favorite sequence aligner to see that they match and are ansense Splice MO needs to be checked against genomic region (the trace) Search for boc in Tetraodon http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

Enter mouse sequence into the BLAT page at UCSC, and select Tetraodon Tetraodon BLAT results

Click on the browser link for the top result Tetraodon genome browser

Zoom out a bit to include enre sequence Tetraodon genome browser

Click on GSTENT track Tetraodon genome browser

Click on and mRNA to retrieve sequences Tetraodon Boc pepde sequence

How does this compare with the mouse sequence? Tetraodon Boc vs. Mouse Boc: BLAST2

How does this compare with the mouse sequence? Tetraodon Boc vs. Mouse Boc: BLAST2 Get UTRs for Tetraodon boc Get UTR sequences for Tetraodon boc

Grab about 100b of putave 5’ UTR Get UTR sequences for Tetraodon boc

Grab about 100b of putave 3’ UTR Primer 3 Home http://frodo.wi.mit.edu/primer3/

Enter cds plus putave UTR into primer 3 Primer 3 Output

Amplify via hi-fi PCR, clone, and inject Orthology makes sense by conserved synteny?

ZF: 24 Tet: 6

Looks OK ZF MO’s vs. Tetraodon boc

>tetraodon_boc_coding ATGACGATCCGTGTGGGGCTCCGGAGCCGACGGGGAGAGCTGCTTCGAGCATCCGGAGAGGGGAGTGTATGGAAAATGCCTGGAAAGCGCGACTGGACTCCGTGGATGAAAAAGAATAGGGC TCCAGTCTTGTGCACTCTGGGTGCAGTACTGCTATGCTGCCTGCAGAGCGGTGCCTCTGTCCCTGACGAGGTGCCGGTGTTCACCGAGGAGCCTGCGTCGGTGGTGCAGAAGCTCGGTGGTA GCGTGTCTCTGCGCTGCAGTGCCCGGCCCGCCTTGGCCAACATCAGCTGGCGCCTCAACGGCCAGGAACTGCTGGATGGAGATTTTGGAGCCGTGCTGGGGCCCAACAGCCTCTACATCCCG TCTTTGTCCAACCTGACCCTGGGCAGGTACCAGTGTGTAGCCAGCACTGGTGTGGGCGCCTTGGCCAGTGTACTGGCTAATGTGACGGCTGCCAAGCTGCGGGATTTCGAGCCGGACGACCA TCAGGAGATCGAGGTGGACGAGGGCAACACGGCCGTCATCGAGTGCCACCTCCCCGAGAGCCAGCCCAAGGCTCAGGTCCGCTACAGCGTCAAGCAGGAGTGGCTGGAAACATCCAAAGGCA ACTACCTCATCATGCCATCAGGGAACCTGCAGATCGCTAACGCCACCCAGGAGGACGAGGGCCCGTACAAGTGCGCCGCCTACAACCCCGTCACTCAGGAGGTCAAGACATCCATCTCTGCG GACCGCCTGCGCATACGCCGCTCCACCTCCGAGGCCGCACGCATCATTTACCCGCCGGCTTCTCGCTCCATCATGGCGACCAAGGGCCAGCGGCTGGTGCTGGAGTGTGTGGCCAGCGGCAT CCCCACCCCTCAGGTGACATGGGCGAAGGACGGGCAGGACCTGCGCTACGTCAACAACACCCGCTTCCTGCTCAGCAACCTGCTGATCGACGCCGTGGGTGAGAGCGACTCGGGCACCTACG CCTGCCAGGCCGACAACGGCATCCTTGCATCCGCCTCTGCGATGGTGCTCTATAACGTCCAGGTGTCCGAGCCTCCCCAGGTGACGGTGGAGCTGCAGCAGGTGTACGGTGGGACGGTGCGC TTCACCTGCCAGGCTCGCGGCAAACCGGCTCCCTCGGTGACGTGGCTCCACAACGCGCGGCCCCTGTCCCCGTCGCCCCGCCACCGGCTGACCTCCAGGATGCTCCGCGTGTCCAACGTGGG CCTCCAGGACGAGGGCCTGTACCAGTGCATGGCCGAGAACGGCGTGGGCAGCTCGCAGGCGTCGGCTCGCCTCATCATAGCCTCGGCCGTCGTCCCCCCGCGGGGAAAGCCGCCCTCCATTT TTCTGAGTCCCGACAAGGTGCTGCGGGAGCAGCCTCCGGTGAGGCCGGGGCCCGGCGGCGCCATGTTGCCCCTGGACTGCTCCGAGCTGCCGGGACAGGTCCTGCCCGCAGAAGCTCCCATC ATCCTCAGCCAGCCGCGCACGGGCAAGGCCGACTATTACGAGCTGACCTGGAGGCCCCGACACGAGCGCGGCGTTCCCGTGCTGGAGTATATGATTAAATACAGAAAGGTGGGGGACCCTCT GGCCGAGTGGACCTCCAGCAGTATCTCCGGCTCCCTGCACAAGCTGACCCTGGCCAAGCTGCAGCCAGACAGCCTGTACGAGGTGGAGATGGCTGCCAAGAACTGCGCCGGCTTGGGACAGC CGGCAATGATGACCTTCCGAACCGGCAAAGGTACATCGGAGCACCTCGGTGATGTTTCGGGGAAGGTTCTAAGAGTGGGGGCAGGTCGTAGAGGAAAAATCGATCCTCCAAAGACCCCTGCG GTCCCGTCGCCAAGCCTCTCTCGGTTTTTTTTGCCCTGTGTTTCTTGTCCTGTCCCATTTCACACTGCCCCCCCCCCCCCCGCAGCTCCCGAAGCCCCCGACAAGCCCACGGTCTCCGCGGC GACGGAGACATCGGCGTACGTGACCTGGATCCCGCGCGGCAACCGCGGCTTCCCCATCCAGTCTTTCCGGGTGGAGTACAAGAAAGTGAAGAAGGCCGGAGAAGACTGGGTGACGGCAGTGG AGAACATCCCCCCATCGCGCCTCTCCGTGGAGATCACAGGCCTGGAGAAAGGTACATCCTACAAGTTCCGCGTGGTGGCGGTGAATGTCATCGGTTCCAGTCCCCCCAGCGCTCCTTCCAAG GCCTACGCGGTGGTGGTTGGGAGAACCCCCGAGCGGCCCGTCGACGGCCCCTACATCACCTACAACGAAGCCATCAATGAGACCAGCATCATCCTCAAATGGACGTACACGCCTGTGAACAA CACGCCCATCTACGGCTTCTACATCTACTACCGCCCGACGGACAGCGACAACGACAGCGACTACAAGAAGGATGTGGTGGAGGGGGACAAGTACTGGCACTCCATCACCAACCTCCAGCCTG AGACCGCCTACGACATCAAGATGCAGAGCTTCAACGAGAAGGGCGAGAGCGAGTTCGGCAACGTGGTGATCCTGGAAACCAAAGGTGGGGCTGTCGTGTCGCTCGCGCCTGGGGGGGTGGAG GACAGAACCGGGTGCATTGATTGTATCCCTCCACTGCGTCTCGCTAGCCCGCCCCAATCAGCCCGTCCCGTCGGAGATCCCAGATTACAGTCCTGGAACCCCCAAGGACGGCGTGCCTCGGC CCGGCGACCTCCCCTACTTCATAGTCGTCATTGTCCTCGGGGCCTTCATCTTCATCATTGTGGCCTTCATCCCCTTCTGTCTTTGGAGGACCTGGGCCAAGCAGAAGCAAACATCAGACATG TGCTTTCCCGCCGTGCCCTCCCCCGTGCCATCCTGCCAGTACACCATGGTCCCTCTCCAGGGACTGGCCCTGGTTGGCCGCTGCCCGCTGGATGGTCACATGACCGGGCCGCACGGGGTTTA CCCTGTGAATGGCGAGTGCGGCATGAATGGCAAACCTCACCACCTGCCAGGACGGCAGCAGGTAAAGAAGCGAAGCGCTGGAACCGGCCTGTGTGGTGGGAGTGGAAAGCCTCAGCTGTGGT TTCTCCACAGGAGGAGGCGGACTGTGACATGGAGTGTGACACCCTGTTACCGCAGACGGTGCCAAATGGACATTTGCCAGTTTGCCATTACCCCACCAGAGTCGGTGTCCTTTCCTCTCCTC TCTCTGGAAGACGAAGGGGTCTTCACCACGTCCTCCTCGACGGCCACAACGCCACAATCCCAAGATACGATTCAGGAAGTGAGCATCCTCCCAAATGA >tetraodon_boc_genome TGGCGTAAAAAGTCACCGTTATGCAGGTGACGCCACAGATTGGTCGTGCGGGACTAAAGAAATGACTCTTTCCTGCAGGCAACTACCTCATCATGCCATCAGGGAACCTGCAGATCGCTAAC GCCACCCAGGAGGACGAGGGCCCGTACAAGTGCGCCGCCTACAACCCCGTCACTCAGGAGGTCAAGACATCCATCTCTGCGGACCGCCTGCGCATACGCCGTGAGTGGCCGCCGGTGCAGAC ACACGCACAGGCCGTTTTTGCATTTTGAGGTCACAAATGGTCGCACAAGCAGTTCCAAGAATTCCGAAATCACGTGCCGGCCGTCCCCCAGGCTCCACCTCCGAGGCCGCACGCATCATTTA CCCGCCGGCTTCTCGCTCCATCATGGCGACCAAGGGCCAGCGGCTGGTGCTGGAGTGTGTGGCCAGCGGCATCCCCACCCCTCAGGTGACATGGGCGAAGGACGGGCAGGACCTGCGCTACG TCAACAACACCCGCTTCCTGCTCAGCAACCTGCTGATCGACGCCGTGGGTGAGAGCGACTCGGGCACCTACGCCTGCCAGGCCGACAACGGCATCCTTGCATCCGCCTCTGCGATGGTGCTC TATAACGTCCAGGTGTCCGGTGAGTACACATGGAGGGCGTGAGGTGCAGGGACCTGGATGAGGTGCATGTTTGTTAAAATAGCTTTTTTCTTTCTGTAAAAACAAATGTGTTACTGTAGGTT TTTTTTGTGTGATTTTGGTAATCACACAAATTGCGTGAAATTGTGTGTGAGAGGCCTGTGAGGCTGGGCGGGGTTCTGGGGGACGGTATTTATGGAGACAGGCCTCCCTCTGTCAGAGGACT TGACTG splice MO CACTCACCGAACACCTGCACATCGT ATG MO: caacgtcccagacatcttccataca

Test via blast2: ATG-MO vs. cds, and splice-MO vs. genome – no alignment Task 4: Batch sequence retrieval and BLAST (a bit advanced)

Overview: Mine OMIM (Online Mendelian inheritance in Man) for genes related to Sonic Hedgehog. Get amino acid sequences for these genes, and idenfy (via blast) the zebrafish orthologs for these proteins. Use a simple script to parse the blast results to see where the genes are located in the zebrafish genome. Finally, find out where a few of these genes are expressed (via zfin).

Requirements: UNIX terminal, perl (both nave on MacOSX)

Go to the NCBI website: http://www.ncbi.nlm.nih.gov/

Select “OMIM” and search for ‘shh’ OMIM

hp://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM Select “Protein links” from the Display pulldown menu Select “Homo sapiens” from the Top Organisms menu Select “FASTA (text)” and show 200 from the Display Sengs menu Build up a file from the sequences BLAST homepage

http://blast.ncbi.nlm.nih.gov/Blast.cgi Click on the “help” tab Follow link for “Download BLAST Soware and Databases”

Use the “blast” download for your plaorm Connect to the Ensembl p server Move to the Zebrafish Directory and download sequences

cd pub/release-63/fasta/danio_rerio/cdna mget Dan* Make a BLASTable database from the zebrafish sequences

gunzip Danio* cat Danio_* >> Zv9_release63_transcripts.fa mv Zv9_release63_transcripts.fa ~/Desktop/ncbi-blast-2.2.25+ makeblastdb –in Zv9_release63_transcripts.fa –dbtype nucl may need to type ./bin/makeblastdb ... if you haven’t updated your PATH to point to BLAST executables BLAST the SHH-related pepdes vs. the zebrafish transcript database

tblastn -query shh_peps.fa -db Zv9_release63_transcripts.fa -num_descriptions 2 -num_alignments 2 -evalue 1e-5 -out shh_v_zv9transcrips.tblastn &

Type tblastn –help for BLAST options Command-line BLAST results Parse BLAST results with wh_blast.pl

perl wh_blast.pl shh_v_zv9transcripts.blast > blast_output.csv Import into excel as comma-delimited file Anything near our region? Chr 7: ~28.8 Mb

What are these genes? Look at Ensembl record http://www.ensembl.org/Danio_rerio/Transcript/Transcript?t=ENSDART00000089574

what’s this? Ensembl cDNA view Orthologs

You’ve found a mutation in tubby! A fish model for weight loss. You’ll be rich! Ensembl Genome View

turn on expression pattern track: configure page => Other DNA Alignments => Expression No Data available for this gene Finding the genbank RefSeq entry http://blast.ncbi.nlm.nih.gov/Blast.cgi

Go to blast homepage and select “nucleotide blast” Finding the genbank RefSeq entry

Select the “nr” button, and type in “Danio rerio” in the Organism box Finding the genbank RefSeq entry

Take the accession number of the top hit (XM_692022) and search at ZFIN Note: XM_XXXXX are predicted transcripts and not usually on ZFIN (they are usually real, though) ZFIN homepage http://www.zfin.org

Click on “Search Genes/Markers/Clones” ZFIN gene search

Paste/type the accession number into the appropriate box ZFIN gene record

Doh! You’ll have to clone it and find the expression pattern yourself