Browsing Genes and Genomes with Ensembl

The Bioinformatics Roadshow
Copenhagen, Denmark
16 & 17 June 2011

BROWSING GENES AND GENOMES
WITH ENSEMBL

EXERCISES AND ANSWERS

Note: These exercises are based on Ensembl version 62 (13 April 2011). After in future a new version has gone live, version 62 will still be available at http://e62.ensembl.org. If your answer doesn’t correspond with the given answer, please consult the instructor.

______________________________________________________________ BROWSER ______________________________________________________________

Exercise 1 – Exploring a gene

(a) Find the human F9 (Coagulation factor IX) gene. On which chromosome and which strand of the genome is this gene located? How many transcripts (splice variants) have been annotated for it?

(b) What is the longest transcript? How long is the protein it encodes? How many exons does it have? Are any of the exons completely or partially untranslated?

(c) Have a look at the external references for ENST00000218099. What is the function of F9?

(d) Is it possible to monitor expression of ENST00000218099 with the CodeLink microarray? If so, can it also be used to monitor expression of the other two transcripts?

(e) In which part (i.e. the N-terminal or C-terminal half) of the protein encoded by ENST00000218099 does its peptidase activity, responsible for the cleavage of factor X to yield its active form factor Xa, reside?

(f) How many non-synonymous variants have been discovered for the protein encoded by ENST00000218099?

(g) Is there a mouse ortholog predicted for the human F9 gene? (h) On which cytogenetic band and on which contig is the F9 gene located? Is there a BAC clone that contains the complete F9 gene?

(i) If you have yourself a gene of interest, explore what information Ensembl displays about it!

______________________________________________________________

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘f9’ or ‘factor IX’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘F9 [ Ensembl/Havana merge gene: ENSG00000101981 ]’.

The human F9 gene is located on the X chromosome on the forward strand. There have been three transcripts annotated for this gene, ENST00000218099 (F9-001), ENST00000394090 (F9-201) and ENST00000479617 (F9-002).

(b) The longest transcript is ENST00000218099. The length of this transcript is 2780 base pairs and the length of the encoded protein is 461 amino acids.

 Click on the Ensembl Transcript ID ‘ENST00000218099’ in the list of transcripts.

It has eight exons.

 Click on ‘Sequence - Exons’ in the side menu.

The first and last exon are partially untranslated (sequence shown in purple). (c)

 Click on ‘External References - General identifiers’ in the side menu.  Explore some of the links (good places to start are usually ‘UniProtKB/Swiss-Prot’ and ‘WikiGene’).  Do the same for ‘Ontology – Ontology table’.

Factor IX is a vitamin K-dependent plasma protein that participates in the intrinsic pathway of blood coagulation by converting factor X to its active form in the presence of Ca²⁺ions, phospholipids, and factor VIIIa (this is the function description as found in UniProtKB/Swiss-Prot).

(d)

 Click on ‘External References - Oligo probes’ in the side menu.

The CodeLink microarray contains one probe, GE80895, that maps to ENST00000218099, so it is possible to monitor expression of this transcript using this array.

 Click on ‘ENST00000394090’ and ‘ENST00000479617’ in the list of transcripts.

The probe GE80895 also maps to ENST00000394090. There is no CodeLink probe that maps to ENST00000479617 though, so expression of this transcript cannot be monitored using this array.

(e)

 Click on ‘ENST00000218099’ in case you are not already on the ‘Transcript: F9-001’ tab.  Click on ‘Protein Information – Protein summary’ in the side menu.  Click on ‘Protein Information - Domains & features’ in the side menu.

The peptidase activity of the protein resides in the peptidase domain that is located in the C-terminal half of the protein.

(f)

 Click on ‘Protein Information - Variations’ in the side menu.

For the protein encoded by ENST00000218099 738 variants are shown. The vast majority of these are mutations from the Human Gene Mutation Database (HGMD). As the HGMD is a subscription database, Ensembl is only allowed to show the ID and the position of these mutations (if you want to get more information, you have to register at the HGMD). Apart from these a small number of variants from other sources are shown with more comprehensive information. Four of these are non-synonymous, i.e. rs104894807, rs6048, rs1801202 and rs4149751.

(g)

 Click on the ‘Gene: F9’ tab.  Click on ‘Comparative Genomics - Orthologues’ in the side menu.  Type ‘mouse’ in the ‘Filter’ text box at the top of the ‘Orthologues’ table.

There is one mouse ortholog predicted for human F9, ENSMUSG00000031138.

(h)

 Click on the ‘Location: X:138,612,917-138,645,617’ tab.

The F9 gene is located on cytogenetic band q27.1 and on contig AL033403.1.

 Click [Configure this page] in the side menu.  Type ‘clone’ in the ‘Find a track’ text box.  Select ‘1 Mb clone set’, ‘30k clone set’, ‘32k clone set’ and ‘Tilepath’.  Click ().

There are several BAC clones that contain the complete F9 gene, e.g. the tilepath clone RP6-88D7 and the clones RP11-1144I12 and RP13-520K15.

______________________________________________________________ Exercise 2 – Exploring a region

(a) Go to the region from bp 52,600,000 to 53,400,000 on human chromosome 4. How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information)? What does the open (i.e. non-blue) part in the ‘Contigs’ track represent?

(b) Do the tilepath clones (i.e. the BAC clones that were sequenced for the human genome assembly) correspond with the contigs?

(c) Zoom in on the SGCB transcripts, including a bit of flanking sequence on both sides.

(d) CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoter of mammalian genes. Is there a CpG island located at the 5’ end of the SGCB transcripts? Possible transcription start sites can be predicted using the Eponine program

(http://www.sanger.ac.uk/Software/analysis/eponine/). Is there a transcription

start site predicted by Eponine annotated for the SGCB transcripts? (e) Export the genomic sequence of the region you are looking at in FASTA format.

(f) If you have yourself a genomic region of interest, explore what information Ensembl displays about it!

______________________________________________________________

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘4:52600000-53400000’ in the ‘for’ text box (or alternatively leave the ‘Search’ drop-down list like it is and type ’human 4:52600000-53400000’ in the ‘for’ text box).  Click [Go].

This genomic region is made up of seven contigs, indicated by the alternating light and dark blue colored bars in the ‘Contigs’ track. One of them, AC107402.4, located between AC093858.2 and AC093880.4, is very small and possibly not visible. The open part in the ‘Contigs’ track represents a gap in the genome assembly (although the human genome is called ‘finished’, there are still gaps!). Note that this region is very close to the centromere of the chromosome.

If you cannot see the very tiny contig AC107402.4, you may want to increase the width of your display as follows:

 Click [Configure this page] in the side menu.  Click on the ‘Configure page’ tab.  Select a larger pixels value or ‘best fit’ from the ‘Width of image’ drop-down list.  Click ().

(b) If the tilepath clones are not already shown:

 Click [Configure this page] in the side menu.  Type ‘tilepath’ in the ‘Find a track’ text box.  Select ‘Tilepath’.  Click ().

The tilepath clones do correspond to the contigs and it is easy to see from which BAC clone which contig sequence in the assembly is derived, e.g. AC027271.7 is derived from RP11-365H22, AC104784.5 is derived from RP11-61F5 etc.

(c)

 Draw with your mouse a box around the SGCB transcripts.  Click on ‘Jump to region’ in the pop-up menu.

(d)

 Click [Configure this page] in the side menu.  Type ‘cpg’ in the ‘Find a track’ text box.  Select ‘CpG islands’.  Click ()

There is indeed a CpG island located at the 5’ end of the SGCB-001, SGCB- 201 and SGCB-002 transcripts. Note that you also clearly can see this island in the ‘%GC’ track, that should be shown by default.

 Click [Configure this page] in the side menu.  Type ‘eponine’ in the ‘Find a track’ text box.  Select ‘TSS (Eponine)’.  Click ().

Eponine TSS-finder predicts a transcription start site close to the starts of the SGCB-001, SGCB-201 and SGCB-002 transcripts.

(e)

 Click [Export data] in the side menu.  Click [Next>].  Click on ‘Text’.

Note that the sequence has a header that provides information about the genome assembly (GRCh37), the chromosome, the start and end coordinates and the strand. For example:

>4 dna:chromosome chromosome:GRCh37:4:52886013:52905920:1

______________________________________________________________ ______________________________________________________________ BIOMART ______________________________________________________________

Exercise 1

Generate a list of all human genes that are located on the Y chromosome, that are protein-coding and that encode for a protein containing one or more transmembrane domains. Do a count after selection of each filter to check the number of genes remaining in your dataset.

______________________________________________________________

Answer

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘BioMart’ link on the toolbar.

Start with all human Ensembl genes:

 Choose the ‘Ensembl Genes 62’ database.  Choose the ‘Homo sapiens genes (GRCh37.p3)’ dataset.

Now filter for the genes on the Y chromosome:

 Click on ‘Filters’ in the left panel.  Expand the ‘REGION’ section by clicking on the + box.  Select ‘Chromosome – Y’. Make sure the check box in front of the filter is ticked, otherwise the filter won’t work.  Click the [Count] button on the toolbar.

This should give you 511 / 53334 Genes. Now filter further for genes that are protein-coding:

 Expand the ‘GENE’ section by clicking on the + box.  Select ‘Gene type – protein_coding’.  Click the [Count] button on the toolbar.

This should give you 55 / 53334 Genes. Finally filter for genes that encode proteins containing one or more transmembrane domains:

 Expand the ‘PROTEIN DOMAINS’ section by clicking on the + box.  Select ‘Transmembrane domains – Only’.  Click the [Count] button on the toolbar.

This should give you 5 / 53334 Genes. Specify the attributes to be included in the output (note that a number of attributes will already be default selected):

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Select, in addition to the attributes ‘Ensembl Gene ID’ and ‘Ensembl Transcript ID’ that are already default selected, for instance ‘Associated Gene Name’ and ‘Description’.

Have a look at a preview of the results (only 10 rows of the results will be shown):

 Click the [Results] button on the toolbar.

If you are happy with how the results look in the preview, output all the results:

 Select ‘View All rows as HTML’ or export all results to a file.

Note: When you select ‘View All rows as HTML’, your results will be shown under a new tab or in a new window in your internet browser.

Although you have filtered for only five genes, your results will contain more than five rows. This is because several of the genes have more than one transcript and consequently the results contain a separate row for each transcript.

______________________________________________________________ Exercise 2

BioMart is a very handy tool when you want to map between different databases. The following is a list of 19 accession numbers from the UniProtKB/Swiss-Prot database (http://www.uniprot.org/) of human proteins that are supposedly involved in the sensory perception of pain:

Q99608, P34913, P28482, P28223, Q96LB1, P10997, P01210, P25929, P17481, P43681, P29460, Q9HC23, P20366, Q9Y2W7, Q99572, Q99571, Q00535, P01138, P17787

Generate a list that shows to which Ensembl Gene IDs and to which HGNC symbols these UniProtKB/Swiss-Prot accession numbers map. Also include the gene description.

Hint: Instead of doing a lot of typing (which is not only a waste of time but also introduces the risk of making typos), copy-paste the above list from the digital version of this document, which you can find at

http://www.ebi.ac.uk/training/roadshow/110616_copenhagen.html.

______________________________________________________________

Answer

 Click on ‘Filters’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Select ‘ID list limit – UniProt/Swissprot Accession(s)’.  Enter the list of IDs in the text box (either comma separated or as a list).

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Description’.  Expand the ‘EXTERNAL’ section by clicking on the + box.  Select ‘HGNC symbol’ and ‘UniProt/SwissProt Accession’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Note: BioMart is ‘transcript-centric’, which means that it will often give a separate row of output for each transcript of a gene, even if you don’t include the Ensembl Transcript ID in your output. To get rid of redundant rows, use the ‘Unique results only’ option.

Your results should show 19 genes.

______________________________________________________________ Exercise 3

The paper ‘Fine mapping of the usher syndrome type IC to chromosome 11p14 and identification of flanking markers by haplotype analysis’ (Ayyagari et al. Mol Vis. 1995 Oct 25;1:2) describes the mapping of the human Usher Syndrome type I C to the genomic region between the markers D11S1397 and D11S1310.

Confirm this finding by generating a list of the genes located in this region. Include the Ensembl Gene ID, name and description.

______________________________________________________________

Answer

 Click on ‘Filters’ in the left panel.  Expand the ‘REGION’ section by clicking on the + box.  Enter ‘Marker Start: d11s1397’ and ‘Marker End: d11s1310’.

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Associated Gene Name’ and ‘Description’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file.

Your results should show 32 genes. Among these there should be one gene (ENSG00000006611) named ‘USH1C’ with the description ‘Usher syndrome 1C (autosomal recessive, severe) [Source:HGNC Symbol;Acc:12597]’. This confirms that Ayyagari et al. correctly mapped Usher Syndrome type I C to this genomic region.

______________________________________________________________ Exercise 4

In the paper ‘Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers’ (Forrest et al. Environ Health Perspect. 2005 June; 113(6): 801–807) the effect of benzene exposure on peripheral blood mononuclear cell gene expression in a population of shoe factory workers with well-characterized occupational exposures was examined using microarrays. The microarray used was the Affymetrix U133A/B GeneChip (also called ‘U133 plus 2’). The top 25 probe sets up-regulated by benzene exposure were:

207630_s_at, 221840_at, 219228_at, 204924_at 227613_at, 223454_at, 228962_at, 214696_at, 210732_s_at, 212371_at, 225390_s_at, 227645_at, 226652_at, 221641_s_at, 202055_at, 226743_at, 228393_s_at, 225120_at, 218515_at, 202224_at, 200614_at, 212014_x_at, 223461_at, 209835_x_at, 213315_x_at (a) Generate a list of the genes to which these probesets map. Include the Ensembl Gene ID, name and description as well as the probeset name.

(b) As a first step in order to be able to analyse them for possible regulatory features they have in common, retrieve the 250 bp upstream of the transcripts of these genes. Include the Ensembl Gene and Transcript ID, name and description in the sequence header.

(c) In order to be able to study these human genes in mouse, generate a list of the human genes and their mouse orthologs. Include the Ensembl Gene ID for both the human and mouse genes in your list.

(d) Generate the same list as in (c), but now also include the name and description of both the human and mouse genes. As the name and description of the mouse genes are not available as attributes in the Ensembl human genes dataset, you have to add the Ensembl mouse genes as a second dataset.

______________________________________________________________

Answer

(a)

 Click on ‘Filters’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Select ‘ID list limit - Affy hg u133 plus 2 ID(s)’.  Enter the list of probeset IDs in the text box (either comma separated or as a list).

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Associated Gene Name’ and ‘Description’.  Expand the ‘EXTERNAL’ section by clicking on the + box.  Select ‘Affy HG U133-PLUS-2’.

Your results should show 23 genes. In most cases one probeset maps to one gene. Exceptions are 227613_at and 219228_at, that both map to ENSG00000130844, 212014_x_at and 209835_x_at, that both map to ENSG00000026508, and 213315_x_at, that maps to both ENSG00000197620 and ENSG00000197021. One probeset (226652_at) doesn’t map to an Ensembl gene at all and is therefore not shown in the list of results.

(b) You can leave the dataset and filters the same, so you can directly specify the attributes:

 Click on ‘Attributes’ in the left panel.  Select the ‘Sequences’ attributes page.  Expand the ‘SEQUENCES’ section by clicking on the + box.  Select ‘Flank (Transcript)’.  Enter ‘250’ in the ‘Upstream flank’ text box.  Expand the ‘Header Information’ section by clicking on the + box.  Select ‘Associated Gene Name’ and ‘Description’.

Note: ‘Flank (Transcript)’ will give the flanks for all the transcripts of a gene with multiple transcripts. ‘Flank (Gene)’ will only give the flank for the transcript with the outermost 5’ (or 3’) end.

 Click the [Results] button on the toolbar.  Select ‘View All rows as FASTA’ or export all results to a file.

(c) You can leave the dataset and filters the same, so you can directly specify the attributes:

 Click on ‘Attributes’ in the left panel.  Select the ‘Homologs’ attributes page.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Expand the ‘ORTHOLOGS’ section by clicking on the + box.  Select ‘Mouse Ensembl Gene ID’.

Your results should show that for 18 out of the 23 human genes one mouse ortholog has been identified, while ENSG00000123130 has two mouse orthologs and ENSG00000172716 has three mouse orthologs. For three human genes (ENSG00000186594, ENSG00000130844 and ENSG00000089335) no mouse ortholog has been identified.

(d) You can leave the Ensembl human genes dataset and filters the same, so you can directly specify the attributes:

 Click on ‘Attributes’ in the left panel.  Select the ‘Features’ attributes page.  Make sure that in the ‘GENE’ section the attributes ‘Ensembl Gene ID’, ‘Associated Gene Name’ and ‘Description’ are selected.  Deselect ‘Affy HG U133-PLUS-2’ in the ‘EXTERNAL’ section.

Add the Ensembl mouse genes as a second dataset:

 Click on ‘Dataset’ at the bottom of the left panel.  Choose the ‘[Ensembl Genes 62] Mus musculus genes (NCBIM37)’ dataset.

Specify the same attributes for mouse as for human:

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Associated Gene Name’ and ‘Description’.

Your results should show the same list as in (c), but now with the name and description added for both the human and mouse genes.