Ensembl Tools
Total Page:16
File Type:pdf, Size:1020Kb
Ensembl Tools EBI is an Outstation of the European Molecular Biology Laboratory. Questions? • We’ve muted all the mics • Ask questions in the Chat box in the webinar interface • I will check the Chat box periodically for questions • There’s no threading so please respond with @name Objectives • What is Ensembl? • What tools are available in Ensembl? • How to use the online tools in Ensembl. • Where to go for help and documentation. Overview • Introduction to Ensembl • BLAST/BLAT • Sequence searching • Assembly Converter • Convert files between genome assemblies • Data Slicer • Pull out sections of VCF and BAM files • File Chameleon • Custom download of reference files for NGS analysis • Variant Effect Predictor (VEP) • Analyse your own variants Introduction Why do we need genome browsers? 1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 Gb) CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAA ACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGC CCCTGGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCC ACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCG AGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTC CAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCAT CCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTA TTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAAC TTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAA ACTGTTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCT AGTGGATAAAGAGGAAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGG CAGAAGTTGTCCAACTTTTTGGTTTCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATG TAGCTTACCATATTAGAAATTTAAAACTAAGAATTTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGA GGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAAT ACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGAGGTGGAGGCAGGAGAATCGCTTGAACCC TGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCATGACTCTGTCTCAAAACAA ACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACAT TCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAAT AGAGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAA AAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGAT ATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCC AAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGAT TGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGC AGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTC AAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCTTTTCTAGGCAGTATTGTACT TCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTGAGCTTCGAAA TTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTG GTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCAT CATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTT We need to make the data mean something… http://www.ncbi.nlm. http://www. nih.gov/mapview ensembl.org http://genome.ucsc.edu Ensembl Features • Gene builds for ~70 species • Gene trees • Regulatory build (ENCODE) • Variation display and VEP • Display of user data • BioMart (data export) • Programmatic access via the APIs • Completely Open Source Access scales One by one Main browser Mobile site BioMart REST API Perl API VEP MySQL FTP Groups Whole genome Vertebrate species on Ensembl Image obtained using Dendroscope: Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks D.H. Huson and C Scornavacca Systematic Biology, 2012 Non-vertebrates on Ensembl genomes Bacteria Protists Fungi Metazoa Plants www.ensemblgenomes.org Ensembl and Ensembl Genomes Ensembl EnsemblGenomes Released 2000 2009 Species Vertebrates (fly, worm and Non-vertebrates (protists, yeast as outgroups) plants, fungi, metazoa, bacteria) Annotation by Ensembl in collaboration with the scientific communities URL www.ensembl.org www.ensemblgenomes.org Release cycle New/updated interfaces 9089 JulyMay 2017 2017 Updated New regulation genome assemblies data 2-3 months Updated Underlying variation software data updates Compara on new genes and genomes Updated gene sets Ensembl Tools Tools allow: • Interpretation and processing of your own data • Custom download of Ensembl data for further analysis BLAST/BLAT for sequence searching • Find Ensembl sequences that match your sequence using BLAST/BLAT • Search: • Nucleotide sequences • Protein sequences • Short sequences (eg primers, morpholinos, siRNAs) • Search against • Genomic sequences • cDNA sequences • Protein sequences Hands on – BLAST/BLAT • I’ve designed a pair of primers for RT-PCR against human BRCA2 • I want to make sure they don’t have any non-specific hits that will mess up my RT-PCR results • The sequences are: >fwd GAGGACTCCTTATGTCCAAATTT >rev GAGAATCAGCTTCTGGGGTAATAA Assembly converter • You have data mapped to an old genome assembly • You want to update your data to map it to a new one What is a genome assembly? Sequence reads CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA CAGCTGTCCCAGATGAC ACTTAACTTCCCTCCCAGCTGTCC GGGCTCCGCCTTCAGCTC TCCCAGCTGTCCCAGATGACGCCATC AACTTCCCTCCCAGCT CGGCCTTTGGGCTCC TCCGCCTTCAGCTCAAGACTTAACTTC CAGATGACGCC Match up overlaps CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA AACTTCCCTCCCAGCT CAGATGACGCC TCCGCCTTCAGCTCAAGACTTAACTTC TCCCAGCTGTCCCAGATGACGCCATC GGGCTCCGCCTTCAGCTC ACTTAACTTCCCTCCCAGCTGTCC CGGCCTTTGGGCTCC CAGCTGTCCCAGATGAC Genome assembly CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATC Genome contigs BL102 BL AL476 AL CM553 CM IM768 IM Reference alleles BL102 BL102 BL AL476 AGTCGTAGCTAGC TAGGCCATAGGCGA AL Frequency T = 0.05, frequency G = 0.95 CM553 G is the allele in all primates T causes disease susceptibility CM Perhaps G should be the reference IM768 allele? We can replace the region with a new IM contig Genome Gaps BL102 BL102 Gap in the genome caused by: AL476 BL AL476 ● Poor sequencing at this region ● No contig was ever cloned AL CM553 CM IM768 We can fill in the gap with a new contig IM Incorrectly assembled contigs BL102 BL102 CM553 BL AL476 BL CM AL CM553 AL476 AL CM IM768 IM768 IM IM New genome assemblies • Fixing errors in the genome produces a new genome assembly • New genome assemblies mean re-mapping of all genome features • Ensembl will stop updating the old assembly when a new one is brought in • You’ve got data mapped to the old assembly and you want to compare to the up-to-date Ensembl annotation Assembly converter • Converts genome coordinates to a different genome assembly. • Works with: • BED (simple coordinates) • GFF (gene, transcript and exon coordinates) • GTF (gene, transcript and exon coordinates) • WIG (values plotted against the genome) • VCF (variants) Hands-on – Assembly converter • We’re going to convert a small BED file from the human genome assembly GRCh37 to the more recent GRCh38 • BED is a simple features format which lists the start and end coordinate of the feature. 5 36821734 37091336 P1 5 36731578 36978408 P2 5 36908654 37108773 P3 Data Slicer for variants • Whole genome VCF files are unwieldy • They contain all variants in the genome • They contain all genotypes from all individuals studied • Sometimes you just want to analyse a small region and one population • The Data Slicer allows you to take a slice of a VCF and narrow down to only individuals and populations of interest • Data Slicer currently only accesses the 1000 Genomes data • It is only available for human and only on GRCh37 Hands on – Data Slicer • I want to get a VCF of the region containing the MC1R gene for the British population • MC1R is found at 16:89978527-89987385 in GRCh37 • The three-letter code for the British population in 1000 Genomes is GBR FTP • Files of our complete database: • Genomic, cDNA, CDS, ncRNA and protein sequence (FASTA) • Annotated sequence (EMBL, GenBank) • Gene sets (GTF, GFF) • Whole-genome multiple and gene-based multiple alignments (MAF) • Variants (VCF, GVF) • Constrained elements (BED) • Regulatory features (BED, BigWig) • RNA-Seq files (BAM, BigWig) • MySQL database Access FTP Your favourite FTP client FTP site ftp://ftp.ensembl.org/pub/ FTP downloads page http://www.ensembl.org/info/data/ftp/index.html FTP files are big • Multiple Mb/Gb • Lots of time to download/unzip • Do you really need this data? • Make sure it’s the right file before you download. File chameleon for NGS analysis • Although files on the Ensembl FTP site are in a standard format, different tools define the standards differently (sigh!) • Your NGS analysis tool might need files that are slightly different to the Ensembl formats • File chameleon allows you to download files with these adjustments Hands on – File Chameleon • I need a GFF3 file of cat for my RNA-seq analysis. • My tool requires: • UCSC-style chromosome naming like chr1 • Only genes shorter than 4 Mb • Transcript IDs in every line • We will use File Chameleon to download this customised file. Analyse your own variants with the VEP • Find out the effects of your own variants on Ensembl genes • Analyse whole genome variant calls • Filter variants to find those that might be interesting Your own variant data Variant coordinates 1 881907 881906 -/C + 5 140532 140532