Browsing Genes and Genomes with Ensembl

Browsing Genes and Genomes with Ensembl Ensembl Workshop University of Cambridge 3rd September 2013 www.ensembl.org Denise Carvalho-Silva Ensembl Outreach Notes: This workshop is based on Ensembl release 72 (June 2013). 1) Presentation slides The pdf file of the talks presented in this workshop is available in the link below: http://www.ebi.ac.uk/~denise/naturimmun 2) Course booklet A diGital Copy of this course booklet can be found below: http://www.ebi.ac.uk/~denise/coursebooklet.pdf 3) Answers The answers to the exercises in this course booklet can be found below: http://www.ebi.ac.uk/~denise/answers 2 TABLE OF CONTENTS OVERVIEW ...................................................................................................... 4 INTRODUCTION TO ENSEMBL .................................................................. 5 BROWSER WALKTHROUGH .................................................................... 14 EXERCISES .................................................................................................... 38 BROWSER ..................................................................................................... 38 BIOMART ...................................................................................................... 41 VARIATION ................................................................................................... 48 COMPARATIVE GENOMICS ...................................................................... 50 REGULATION ............................................................................................... 52 QUICK GUIDE TO DATABASES AND PROJECTS ................................. 55 CHECK YOUR LEARNING! 10 ADVANCED EXERCISES ..................... 57 3 OVERVIEW Getting started with Ensembl at www.ensembl.orG Ensembl provides Genes and other annotation such as regulatory reGions, Conserved base pairs aCross speCies, and sequenCe variations. The Ensembl gene set is based on protein and mRNA evidence in UniProtKB and NCBI RefSeq databases, along with manual annotation from the VEGA/Havana group. All the data are freely available and Can be aCCessed via the web browser at www.ensembl.orG. Perl proGrammers Can direCtly aCCess Ensembl databases through an Application Programming Interfaces (Perl APIs). Gene sequenCes Can be downloaded from the Ensembl browser itself, or through the use of the BioMart web interfaCe, whiCh can extract information from the Ensembl databases without the need for proGramminG knowledGe! A sister browser at www.ensemblGenomes.orG is set up to aCCess non-chordates, namely bacteria, plants, fungi, metazoa, and protists). #$%&'("=*J$+ !"#$%&'(")*&"+$,-$.'$+"/01("234#562345 !"7$1&0$8$"9%1%"/01("20*:%&1 !"5&<"*-&"G%&0%.1"EH$'1"I&$90'1*& !"4''$++"1($"E.'*9$"9%1%"0."E.+$C=? !"5&<"*-&"G%&0%.1"EH$'1"I&$90'1*& !" 2&*/+$" >?%.1+@" =%'1$&0%@" )-.A0@" >&*B+1+" !" ;(**+$" <*-&" )%8*-&01$" %.9"C$1%D*%"%1"E.+$C=?"F$.*C$+ 8$&1$=&%1$"+>$'0$+ Points covered: • Why do we need Genome browsers? • An introduCtion to the Ensembl browser Check our video tutorial! • How data can be accessed in Ensembl ? http://www.youtube.com/user/ • An overview of Ensembl tools EnsemblHelpdesk The Ensembl Genome browser Introduction to BioMart 4 INTRODUCTION TO ENSEMBL Ensembl is a joint projeCt between the EBI (European BioinformatiCs Institute) and the WellCome Trust SanGer Institute that annotates chordate genomes (i.e. vertebrates and Closely related invertebrates with a notoChord, such as sea squirt). Gene sets from model organisms (e.G. yeast, fruitfly and worm are also imported for comparative analysis by the Ensembl Comparative GenomiCs team. Most annotation is updated every two months, leadinG to inCreasinG Ensembl versions (such as version 73), however the Gene sets are determined less frequently. !"#$%$&$%' !$+&'#3'4,&'56'+0' (#$)'*+,-.*/+%'+)& !"#$%&)#-+&0#*1)& 2'+$%' Figure 1: The ReGion in detail view: Protein alignments and other traCks. Click on the cog wheel icon to add more data track to Ensembl views. Alternatively, you may want to click on the Configure this page button instead at the left hand side. The vast amount of information associated with genomiC sequences demands a way to organise and access that information. This is where Genome browsers Come in. Ensembl strives to display many layers of Genome annotation into a simplified view for the ease of the user. Figure 1 above shows the ReGion in detail page for the BRCA2 gene in human. The example shows blocks of Conserved sequence refleCting Conservation sCores of sequence identity on a base pair level aCross 36 species. Conserved regions are displayed as dark bloCks that represent loCal reGions of aliGnment. 5 Also in this fiGure proteins alignments from the UniProtKB have been added to the Genomic loCation of the human BRCA2 gene. Filled yellow blocks show where these UniProtKB proteins align to the genome, and gaps in the alignment are shown as empty yellow bloCks. Note, in this Case, the UniProtKB proteins support most of the exons shown in the Ensembl BRCA2-001 and BRCA2-201 transcripts. Both Ensembl and Vega (Havana) transcripts are portrayed as exons (boxes) and introns (connecting lines). Filled boxes show codinG sequence and empty boxes refleCt untranslated regions (UTRs). This ReGion in detail view is useful for comparing Ensembl gene models with Current proteins, mRNAs and ESTs from other databases, suCh as NCBI RefSeq, ENA, UniGene and UniProtKB. Everything in this view is aliGned to the Genome. Figure 2: The ReGion in detail view: 1000 Genomes traCks The ReGion in detail view can be configured (using the Configure this page button) to show reGulatory features, sequence variation, and more! For example, CliCk on dbSNP under the Variation menu and turn on the sequence variants (dbSNP and all other sources) at the riGht side of this paGe. Save and Close. BaCk to the ReGion in detail view, click on the sequence variation of interest. A pop-up box will show you a few variation properties such as rs number, alleles, type, and others. CliCk on the rs properties link to take you to an information paGe for the Genetic variation, inCludinG links to population frequencies, if available. You Can do the same for reGulatory features as well. An index paGe is provided for eaCh speCies with information about the source of the Genomic sequence assembly, a karyotype (if available), and a link to past sites on our archive. The picture below shows the Ensembl homepaGe for human. Links to the human 6 karyotype, to the previous human assembly and a summary of gene and genome information are found in this index paGe. News SearCh Information Links to and statistiCs example features in Ensembl Ensembl devotes separate pages and views in the browser to display a variety of information types, using a tabbed structure. The three main entry points in Ensembl are the LoCation tab, Gene tab and TransCript tab. We have also the speCies tab, variation tab and reGulation tab. SpeCies tab Gene tab Variation tab LoCation tab TransCript ReGulation tab tab You Can for example Change the speCies of interest in the speCies tab, view a chromosomal region in the Location tab (also known as ReGion in detail), visualise Gene trees in the Gene tab, browse the cDNA sequence alonGside the protein translation in the transcript pages, look for Genotype information in the variation tab and get details by cell line for regulatory features. You Can also perform a similarity search aGainst any species in Ensembl by usinG our BlastView, whiCh offers aCCess to both BLAST and BLAT proGrams. 7 !"#$%&"'()*+'),*-#"'#)'.$%* /0$0%*#$1*&.$%0"-01*"02'.$%* 3.4.5.260%*'$*/0$0*!"00%* 789:!*#$1*789!*%0#"&,0%* Let’s take a look at the Ensembl Genomes homepage at www.ensemblGenomes.orG Links to the taxa- speCifiC sites Link baCk to Ensembl News 8 Click on the different taxa to see their homepages. Each of them is colour-coded: Protists Fungi Metazoa Plants Bacteria You Can naviGate most of the taxa in the same way as you would with Ensembl, but Ensembl BaCteria has a larGe number of Genomes, so needs slightly different methods. Let’s look at it in more detail. 9 SearCh for SearCh for a gene a speCies Information on Ensembl BaCteria There’s no full speCies list for bacteria as it would be hard to navigate with the number of speCies. To find a speCies, start to type the speCies name into the speCies searCh box. A drop down list will appear with possible speCies. For example, to find a substrain of Clostridium difficile type in Clostridium d. The drop down Contains various strains of Clostridium difficile. Let’s choose Clostridium difficile 630. This will take us to another speCies homepage, where we can explore various features. 10 Retrieving Data from Ensembl BioMart is a web-interfaCe that Can extraCt information from the Ensembl databases and present the user with a table of information without the need for proGramminG. It Can be used to output sequences or tables of Genes along with gene positions (Chromosome and base pair locations), single nucleotide polymorphisms (SNPs), homologues, and other annotation in HTML, text, or Microsoft Excel format. BioMart Can also translate one type of ID to another, identify genes associated with an InterPro domains or gene ontology (GO) terms, export Gene expression data and lots more. Ensembl uses MySQL relational databases to store its information. A comprehensive set of Application Programme Interfaces (APIs) serve as a middle-layer between underlyinG database schemes and more specific application proGrammes. The API aims to encapsulate

Browsing Genes and Genomes with Ensembl

CISD2 (NM 001008388) Human Tagged ORF Clone Product Data

Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

Abstracts Genome 10K & Genome Science 29 Aug - 1 Sept 2017 Norwich Research Park, Norwich, Uk

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Cytotoxic Effects and Changes in Gene Expression Profile

Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

Assessment of Genetic Mutations in WFS1 & CISD2 in Wolfram

Wolfram Syndrome

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I