Browsing Bacterial

www.ensemblgenomes.org .ensembl.org

Coursebook http://www.ebi.ac.uk/~dstaines/Workshops/2014/Marseille Marseille – 23rd June 2014

1

TABLE OF CONTENTS Introduction to Ensembl Bacteria ...... 3

What can I do with Ensembl Bacteria? ...... 3

Exploring the Ensembl Bacteria browser ...... 5

Demo: Finding genomes ...... 5

Exercise: Finding genomes...... 7

Demo: The Region in detail view ...... 7

Exercises: The Region in Detail view ...... 10

Genes and transcripts ...... 11

Demo: The tab ...... 11

Demo: The transcript tab ...... 16

Demo: Searching by sequence ...... 19

Exercises: and transcripts ...... 21

Comparative genomics ...... 22

Demo: Gene trees and homologues ...... 22

Demo: Gene familes ...... 24

Exercises: ...... 26

Using your own data ...... 27

Demo: Viewing your data in the browser ...... 27

Demo: The Variant Effect Predictor (VEP) ...... 29

Exercises: Using your own data ...... 31

Downloading data ...... 32

Exporting individual data ...... 32

Downloading whole genomes ...... 33

Programming with Ensembl Bacteria ...... 34

Getting started with the REST API ...... 34

Getting started with the Perl API ...... 34 2

Introduction to Ensembl Bacteria Ensembl Genomes (www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, provided by the EBI (European Bioinformatics Institute).

The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project (www.ensembl.org), and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis.

The Ensembl Genomes project includes the Ensembl Bacteria portal, which provides access to over 10,000 genomes from more than 2,000 species of bacteria and archaea.

What can I do with Ensembl Bacteria?

 Explore more than 10,000 genomes from over 2,000 bacterial and archaeal species.  View genes with other annotation along the .  View detailed annotation for genes  Download gene sequences and annotation for individual genes or genomic regions  Explore homologues and phylogenetic trees across the taxonomic range for selected genomes.  Find genes in the same family for any gene  Upload your own data for viewing in a genome.  Find sequences matching your own from any genome.  Determine how your variants affect genes and transcripts using the Variant Effect Predictor.  Find genes involved in metabolic pathways  Share Ensembl views with your colleagues and collaborators.  Download data in different formats for any genome  Use Ensembl Bacteria data in your own programs and scripts using the REST and Perl APIs

3

Need more help?

Read our documentation

View the FAQs

Read some publications

Stay in touch!

 E-mail the team with comments or questions at [email protected]

 Sign up to a mailing list

Further reading

Kersey, PJ et al. Ensembl Genomes 2013: scaling up access to genome-wide data Nucleic acids research 2014, 42 (D1): D546-D552 PMID: 24163254 doi:10.1093/nar/gkt979

Spudich, GM and Fernández-Suárez, XM Touring Ensembl: A practical guide to genome browsing BMC Genomics 2010, 11:295 (11 May 2010)

4

Exploring the Ensembl Bacteria genome browser

Aims

By the end of this section, you should know how to…

 …find a genome you’re interested in

 …find information about a genome

 …browse the genome

Demo: Finding genomes

The front page of Ensembl Bacteria is found at bacteria.ensembl.org. It contains lots of information and links to help you navigate:

Link back to homepage Tools Help Search

Top bar present on every page

Search for a gene Information on Ensembl Bacteria

Complete genome list

News

Whilst there is a full genome list for bacteria, the large number of genomes makes it hard to navigate. To find a particular genome, the

5 easiest way is to start to type the species name into the genome search box. A drop down list will appear with possible genomes. You can then select your choice. Alternatively, if you hit enter at this point, you’ll be taken to a search page where you can examine the possible matches.

For example, to find a substrain of Clostridium difficile type in Clostridium d.

The drop down contains various strains of Clostridium difficile. Let’s choose Clostridium difficile 630. This will take us to the genome homepage, where we can explore various features.

Search this Annotation genome examples

Comparative examples

Launch the VEP tool

6

Exercise: Finding genomes

Go to Ensembl Bacteria and find the species Belliella baltica. How many coding and non-coding genes does it have?

Demo: The Region in detail view

The Region in Detail view shows the genes and other features present in an area of the genome, and allows you to navigate around the genome.

We’ll search for C. difficile as we did before. On the species homepage there’s a link to a sample region Chromosome:1889811-1890515. Click on it to jump to the Region in Detail page.

Circular chromosome – plasmids may also be shown

100kb shown in this view

7

Let’s look at the chromosome in more detail:

Origin of sequencing (sometimes this is also the origin of replication but not Gene density always)

GC skew

GC content

Our region of interest

You can jump to a different region of interest by dragging out the handles of the region slice. This gives you a red highlighted slice that you can click on to jump to a new region.

Next, the upper panel in this display shows an overview of genes, colour-coded by type in the region:

Contig

Controls for Controls for configuration, configuration, sharing and sharing and export Gene export

Region to show in detail

8

Finally the lower panel shows a detailed view of the region selected in the red box, including features such as transcripts:

Controls for configuration, Contig sharing and Controls for configuration, Transcript export sharing and export %GC

You can find out more about features in either panel by clicking on them. There are also options to configure the images in bacteria by clicking the “Configure this page” button.

This opens a new panel where you can choose new tracks to show in the display including repeats, genomic features and translated sequences:

Active track

Inactive track

9

Exercises: The Region in Detail view

Exercise – Exploring a genomic region in E. coli

(a) Find the genome of E. coli str. K-12 substrain MG1655. Go to the region from 2,781,249 to 2,860,361 bp on the chromosome. What is the name of the sequence read on which this genome is based?

(b) Zoom in on the proV gene.

(c) Turn on the All Repeats track in this view. Are there any repeats near the proV gene?

(d) Create a Share link for this display. Email it to yourself and open the link.

(e) Export the genomic sequence of the region you are looking at in FASTA format.

(f) Turn off all tracks you added to the Region in detail page.

10

Genes and transcripts

Aims

By the end of this section, you should know how to…

 …find out about a particular gene

 …search for a gene by keyword or name

 …search for a gene by sequence similarity

Demo: The gene tab

If you click on any one of the transcripts in the Region in detail image, a pop-up menu will appear, allowing you to jump directly to that gene or transcript.

Links

Another way to go to a gene of interest is to search directly for it. You can do this from the gene search box home page, from the search box in the top bar, or from the genome-specific search box on the home page for a particular genome. We’re going to use the gene search box on the home page.

We’re going to look at the lacZ gene in E. coli. This gene encodes beta- galactosidase, an enzyme involved in the metabolism of lactose, and part of the well-studied lac operon.

11

From bacteria.ensembl.org, type lacZ into the gene search box and click the Go button:

Gene search

You will get a list of hits from all genomes:

Result type

Genome filter

12

To filter for a given genome, start typing its name into the filter box and select the genome of interest:

Links

Click on the gene name or gene ID. The Gene tab should open:

Gene tab

Information about the gene Gene views lacZ transcript. Click for info Forward- stranded transcripts

Blue bar is the Reverse- genome stranded transcripts

13

Let’s walk through some of the links in the left hand navigation column. How can we view the genomic sequence? Click Sequence at the left of the page.

Exon of an upstream gene

Upstream sequence

lacZ

14

Exons are highlighted in the sequence, which is shown in FASTA format. Take a look at the FASTA header:

base pair end name of molecule genome assembly base pair start strand

(-1 is reverse)

Can information about our gene be found in other databases? Go up the left-hand menu to External references:

Links to external Database records name

Key references for transcript/

This contains links to the gene in other projects, such as BioCyc and EcoGene.

15

Demo: The transcript tab

Let’s now explore the transcript encoded by this gene. Click on the Transcript tab at the top:

Transcript views

Transcript information

You are now in the Transcript tab for lacZ-1. The left hand navigation column provides several options for this transcript. Click on the cDNA link to see the transcript sequence.

Codon

Translation

16

Next, follow the General identifiers link at the left.

This page shows information from other databases such as UniProtKB, PDB and others, that match to this transcript and protein.

Links to external records

Click on links within the Ontology menu to see GO terms from the Gene Ontology consortium which have been used to annotate this protein. www.geneontology.org

Term accession

Term description Annotation source and evidence

17

You can view the annotated terms in a chart by clicking the Ancestry Chart tab:

Parent term

Annotated term

Now click on Protein summary to view domains from , PROSITE, Superfamily and more.

lacZ protein

Protein domains

Protein statistics

18

Clicking on Domains & features shows a table of this information.

Demo: Searching by sequence

A common task is to find which sequences that are already known match a new sequence that you’re interested in. There are two complementary tools available for this.

The Sequence Search tool allows you to search for protein or nucleotide sequences in Ensembl Bacteria or in the whole of Ensembl Genomes.

The BLAST tool is more flexible, allowing specific genomes to be selected and for the exact search parameters to be explored.

This demo will concentrate on the Sequence Search tool, and uses this protein sequence: MKRASLLTLTLIGAFSAIQAAWAVDYPLPPTGSRLVGQNQTYTVQEGDKNLQAIARRFDT AAMLILEANNTIAPVPKPGTTITIPSQLLLPDAPRQGIIVNLAELRLYYYPPGENIVQVY PIGIGLQGLETPVMETRVGQKIPNPTWTPTAGIRQRSLERGIKLPPVVPAGPNNPLGRYA

Open the Sequence Search via the link in the top menu:

Search sequence

Submit

19

Paste your sequence in and hit submit. It may take a few minutes for the search to complete and for your results to appear:

Link to matching gene(s) Link to genomic region

From here, you can click on links to genes and their genomic regions. If you find a very large number of hits and need to filter to specific branches of the , you should consider using BLAST instead.

20

Exercises: Genes and transcripts

Exercise – Exploring a gene

Start in http://bacteria.ensembl.org/index.html and select the Escherichia coli strain K-12 substr. MG1655 genome.

(a) What GO: biological process terms are associated with the era gene?

(b) What domains can be found in the protein product of this gene? How many different domain prediction methods agree with each of these domains?

Exercise – Finding a gene by sequence

Search for genes that match the following DNA sequence:

GATTTAACTGGTTTACTAGTAAATGAAAGGTCATTAGGACATGTAGATTTAATCGACGTA AGTAATTATTTAGGTATATCTCCTAGTCGTTACTCTTTTAAGTGGTATGAAATTTCTAGA TATTGGGATAACGAGAAGAATAGAAGAATTATTAGAGAATATAGTATAGAAAACGCTAGA TCAATATATTTATTAGGTAACTATCTACTATCTACCTATAGTGAACTAGTTAAGATAGTT GGTCTACCTTTAGATAAATTATCAGTAGCTAGTTGGGGAAATAGAATAGAGACTTCACTA ATAAGGACTGCTACAAAATCTGGGGAATTAATTCCTATTCGAATGGATAATCCTAACCGT CCTTCAAAAATAAAAAAGAATATAATTATTCAACCAAAAGTTGGTATCTATACTGATGTT TATGTTCTTGATATATCTTCAGTTTATTCATTAGTGATAAGAAAATTTAATATAGCTCCA GATACTTTAGTCAAAGAGCAGTGCGATGATTGTTATAGTTCCCCAATTTCTAATTACAAA TTTAAGAGAGAACCGTCTGGCCTTTACAAGACATTTTTAGATGAGTTAAGTAATGTTCGA GATTCTAACAAAATAAAGGTTATTGAAGAGTTAATATCGTCATTTAATGATTATGTTCA

What do you think the sequence encodes?

21

Comparative genomics

Aims

By the end of this section, you should know how to…

 …view and explore phylogenetic trees

 …find which genes are in the same family as a gene of interest

Demo: Gene trees and homologues

For some genomes, like E. coli K-12 MG1655, Ensembl Bacteria provides gene trees and lists of homologues to a wide range of other species. Let’s look at the homologues of E. coli pfkA. Search for the gene and go to the Gene tab.

Click on Gene tree (image), which will display the current gene in the context of a phylogenetic tree used to determine orthologues and paralogues.

Gene of Protein interest alignments

Collapsed nodes

Legend

22

Funnels indicate collapsed nodes. We can expand them by clicking on the node and selecting Expand this sub-tree from the pop-up menu.

We can look at homologues in the Orthologues page, which can be accessed from the left-hand menu. The numbers of orthologues is indicated in brackets alongside the name. If there are none, then the name will be greyed out. Click on Orthologues to see the 228 orthologues available.

Choose a taxon of interest

Orthologue types

Information on orthologues

Choose to see only Vertebrate orthologues by selecting the box. The table below will now only show details of vertebrate orthologues. Let’s look at human.

23

Links from the orthologue allow you to go to alignments of the orthologous and cDNAs. Click on Alignment (protein) for the first human orthologue.

Protein IDs

Information on orthologue pair

Alignment in Clustal W format

Demo: Gene familes

Genes for all genomes in Ensembl Bacteria are classified into gene families based on classification using HAMAP and PANTHER.

To see the gene families for the pfkA gene, go to the Gene tab and click on Gene families (loading the family can take a while):

Family name and description Details of family Taxonomic members filter

Family export

24

Many families have a very large number of members. To filter by a particular branch of the taxonomy, click “Filter gene family”. You can then select individual genomes or entire branches to restrict the family by. We’ll select Escherichia coli and click the tick symbol: Find a particular genome Click to apply filter

Select nodes

This filters the family to just genes from Escherichia coli genomes:

From here, you can export the data for use elsewhere, or browse the genes in the family.

25

Exercises: Comparative genomics

(a) Find the gene fdh in Staphylococcus aureus subsp. aureus N315. How many orthologues does it have?

(b) Find the human orthologue. What is the target %id and the query %id? Why are these different?

(c) Click on the link to go to this human gene in Ensembl. (You may wish to open this in a new tab). What GO terms describe the first transcript of this gene?

(d) What GO terms describe the S. aureus gene? Do they overlap? How would you explain the difference in numbers of GO terms between the two orthologues?

(e) Find the fbp gene in this genome. What family does it belong to and how many members in this family come from strains of S. aureus?

26

Using your own data Aims

By the end of this section, you should know how to…

 …view genomic data in a reference genome

 …find out the consequences of sequence variation

Demo: Viewing your data in the browser If you have data that has been aligned to a genome, it can be uploaded to Ensembl Bacteria to view alongside existing annotations.

We’re going to upload a file in GFF3 format containing locations of promoters and terminators for E. coli from RegulonDB, which can be downloaded here: http://www.ebi.ac.uk/~dstaines/Workshops/2014/Marseille/regul on.gff3

To attach your data, go to the Region in Detail view and click on the “Add your data” button:

Click to add data

27

This opens a new window where you can select local or remote data to upload. We’re going to upload a local file in Generic Feature Format (GFF):

Give your data a name

Choose the format

Select your file

After the file has been selected, hit Upload. You’ll be taken to a page where you can choose to view the data in the genome:

Uploaded features

Annotated features Uploaded features

28

Demo: The Variant Effect Predictor (VEP)

We have analysed a sample from a culture of Acinetobacter baumannii AYE taken from a patient being treated with antibiotics and found a mutation of interest: An A->T mutation on the chromosome at position 3,677,386 on the reverse strand.

We will use the Ensembl VEP to determine:

 What genes are affected by my variant?

 Does my variant result in a protein change?

Go to the home page for the genome of interest and click on the VEP button.

Select the VEP tool

This will open up a dialogue box. This allows us to input data on our variant.

29

Give your data a name

Put your data in here.

You can also upload a file.

Data can be supplied in the format : Chromosome Start End alleles (reference/mutation) strand Note that the values are separated by tabs, not spaces.

Type into the Paste data box: Chromosome 3677386 3677386 A/T -

Click Next.

Click HTML to view your results with clickable links.

Our mutation affects one gene directly but neighbouring genes may be affected too Our mutation causes an amino acid change

30

Exercises: Using your own data

Exercise - Transcriptome data in E. coli

This directory contains BigWig format files with the extension .bw which contain RNAseq data aligned to the E. coli genome. ftp://ftp.ensemblgenomes.org/pub/misc_data/areba/Escherichia_co li_K-12_MG1665/

Pick one or two files and add them as tracks to the browser for Escherichia coli K12 MG1655.

You can find out more about this experiment at http://www.ebi.ac.uk/ena/data/view/SRP011371

Exercise - VEP in bacteria

Find the genome for Bacteroides fragilis 638R and launch the VEP tool. Use the VEP to predict the effects of a 7 bp deletion of TCTACAA on the supercontig FQ312004 at the position 258140-258146.

31

Downloading data Aims

By the end of this section, you should know:

 …how to export data for genes and regions of interest

 …how to download data for entire genomes

Exporting individual data

Data can be downloaded for individual genomic regions or features such as genes or proteins. Downloading is possible on any page where you see this button:

For instance, you can download a genomic region as FASTA (containing the genomic sequence) or GFF3 (containing the coordinates of the genomic features in that region) from the Region in Detail View.

In addition, on any image, you can click on the button to export the image in various formats (e.g. PDF, PNG), or the features shown in the region as GFF.

32

Downloading whole genomes

Contents from entire genomes can also be downloaded in a variety of formats (GFF3, FASTA), by following the link from the species home page:

Download sequences

Download features

33

Programming with Ensembl Bacteria Aims

By the end of this section, you should know how to…

 …find out more about the REST API

 …find out more about the Perl API

Getting started with the REST API

The Ensembl REST API makes it easy to access Ensembl data programmatically in a language-independent way, allowing you to retrieve sequences, genomic features, cross-references and more using your favourite language. For more information, please visit: http://ensemblgenomes.org/info/access/rest

The REST server is fully documented with comprehensive examples in Perl, Python and Java.

Getting started with the Perl API

The Ensembl software platform is written in Perl, and as such has an extremely rich Perl API which provides comprehensive access to all the data that Ensembl can offer.

For more about getting started with the Ensembl Perl API, please see: http://www.ensembl.org/info/docs/api/index.html

The number of genomes available from Ensembl Bacteria means that an additional API is useful to make discovery of genomes much easier. You can find out more about this API here: http://ensemblgenomes.org/info/access/eg_api

In addition, example scripts are shown on each species home page.

34