The Bioinformatics Roadshow

Tórshavn, The Faroe Islands

28-29 November 2012

BROWSING AND GENOMES WITH ENSEMBL

EXERCISES AND ANSWERS

1

BROWSER 3

BIOMART 8

VARIATION 13

COMPARATIVE GENOMICS 18

2 Note: These exercises are based on Ensembl version 69 (October 2012). After in future a new version has gone live, version 69 will still be available at http://e69.ensembl.org/. If your answer doesn’t correspond with the given answer, please consult the instructor. ______BROWSER ______

Exercise 1 – Exploring a

(a) Find the human F9 (coagulation factor IX) gene. On which and which strand of the genome is this gene located? How many transcripts (splice variants) have been annotated for it?

(b) What is the longest transcript? How long is the it encodes? Has this transcript been annotated automatically (by Ensembl) or manually (by Havana)? How many exons does it have? Are any of the exons completely or partially untranslated?

(c) Have a look at the external references for ENST00000218099. What is the function of F9?

(d) Is it possible to monitor expression of ENST00000218099 with the ILLUMINA HumanWG_6_V2 microarray? If so, can it also be used to monitor expression of the other two transcripts?

(e) In which part (i.e. the N-terminal or C-terminal half) of the protein encoded by ENST00000218099 does its peptidase activity reside?

(f) Have any missense variants been discovered for the protein encoded by ENST00000218099?

(g) Is there a mouse orthologue predicted for the human F9 gene?

(h) If you have yourself a gene of interest, explore what information Ensembl displays about it! ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select ‘Search: Human’ and type ‘f9’ or ‘factor IX’ in the ‘for’ text box. 8 Click [Go]. 8 Click on ‘Gene’ on the page with search results. 8 Click on ‘Human’.

3 8 Click on ‘F9’ or ‘ENSG00000101981’.

The human F9 gene is located on the on the forward strand. There have been three transcripts annotated for this gene, ENST00000218099 (F9-001), ENST00000394090 (F9-201) and ENST00000479617 (F9-002).

(b)

The longest transcript is ENST00000218099 (F9-001). The length of this transcript is 2780 base pairs and the length of the encoded protein is 461 amino acids.

8 Click on the transcript ‘F9-001’ in the ‘Gene Summary’ display.

It is an Ensembl/Havana merge transcript that has been annotated both automatically and manually. This can also be seen from the fact that the transcript is golden coloured.

8 Click on the Ensembl Transcript ID ‘ENST00000218099’ in the list of transcripts.

It has eight exons.

8 Click on ‘Sequence - Exons’ in the side menu.

The first and last exons are partially untranslated (sequence shown in purple).

(c)

8 Click on ‘External References - General identifiers’ in the side menu. 8 Explore some of the links (a good place to start is usually ‘UniProtKB/Swiss-Prot’). 8 Do the same for ‘Ontology – Ontology table’.

Factor IX is a vitamin K-dependent plasma protein that participates in the intrinsic pathway of blood coagulation by converting factor X to its active form in the presence of Ca2+ ions, phospholipids, and factor VIIIa (this is the function description as found in UniProtKB/Swiss-Prot).

(d)

8 Click on ‘External References - Oligo probes’ in the side menu.

The ILLUMINA HumanWG_6_V2 microarray contains one probe, ILMN_1787598, that maps to ENST00000218099, so it is possible to monitor expression of this transcript using this array.

4 8 Click on ‘ENST00000394090’ and ‘ENST00000479617’ in the list of transcripts.

No ILLUMINA HumanWG_6_V2 probes map to the other two transcripts, so expression of these transcripts cannot be monitored using this array.

(e)

8 Click on ‘ENST00000218099’ in case you are not already on the ‘Transcript: F9-001’ tab. 8 Click on ‘Protein Information – Protein summary’ in the side menu. 8 Click on ‘Protein Information - Domains & features’ in the side menu.

The peptidase activity of the protein resides in the peptidase domain that is located in the C-terminal half of the protein.

(f)

8 Click on ‘Protein Information - Variations’ in the side menu. 8 Type ‘missense in the ‘Filter’ text box at the top of the ‘Variations’ table.

For the protein encoded by ENST00000218099 many missense variants have been discovered. Most are imported from dbSNP (i.e. all variants with an identifier starting with ‘rs’) while a few are from the COSMIC (Catalogue of Somatic Mutations in Cancer) database (i.e. all variants with an identifier starting with ‘COSM’) and the NHLBI GO Exome Sequencing Project (ESP) (i.e. all variants starting with ‘TMP_ESP’).

(g)

8 Click on the ‘Gene: F9’ tab. 8 Click on ‘Comparative Genomics - Orthologues’ in the side menu. 8 Type ‘mouse’ in the ‘Filter’ text box at the top of the ‘Selected orthologues’ table.

There is one mouse orthologue predicted for human F9, i.e. ENSMUSG00000031138. ______

Exercise 2 – Exploring a region

(a) Go to the region from bp 32,448,000 to 33,198,000 on human chromosome 13. On which cytogenetic band is this region located? How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information)?

5

(b) Zoom in on the BRCA2 gene.

(c) Are there any BAC clones that contain the complete BRCA2 gene?

(d) Add the track with RefSeq gene models. Has RefSeq annotated the BRCA2 gene? If so, how many transcripts have been annotated? Do they differ from the Ensembl transcripts?

(e) Export the genomic sequence of the region you are looking at in FASTA format.

(f) Delete all tracks you added to the ‘Region in detail’ page.

(g) If you have yourself a genomic region of interest, explore what information Ensembl displays about it! ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select ‘Search: Human’ and type ‘13:32448000-33198000’ in the ‘for’ text box (or alternatively leave the ‘Search’ drop-down list like it is and type ’human 13:32448000-33198000’ in the ‘for’ text box). 8 Click [Go].

This genomic region is located on cytogenetic band q13.1. It is made up of seven contigs, indicated by the alternating light and dark blue coloured bars in the ‘Contigs’ track.

(b)

8 Draw with your mouse a box around the BRCA2 transcripts. 8 Click on ‘Jump to region’ in the pop-up menu.

(c)

8 Click [Configure this page] in the side menu (or the ‘Configure this page’ icon on the main panel). 8 Type ‘clones’ in the ‘Find a track’ text box. 8 Select ‘1Mb clone set’, ‘32k clone set’ and ‘Tilepath’. 8 Click (P).

6 It doesn’t look like there is a clone that contains the complete BRCA2 gene. For example clone RP11-37E23 contains most of the gene, but not its very 3’ end.

(d)

8 Click [Configure this page] in the side menu. 8 Type ‘refseq’ in the ‘Find a track’ text box. 8 Select ‘RefSeq import – Expanded with labels’. 8 Click (P). 8 Click on individual transcript models to retrieve more information about them.

There has been one transcript annotated by RefSeq for the BRCA2 gene, i.e. NM_000059.3. This transcript is almost identical to Ensembl transcript BRCA2-001 (ENST00000380152). Both encode a 3418 aa protein. The RefSeq transcript is 6 bp shorter at the 5’ end and 462 bp longer at the 3’ end.

(e)

8 Click [Export data] in the side menu. 8 Click [Next>]. 8 Click on ‘Text’.

Note that the sequence has a header line that provides information about the genome assembly (GRCh37), the chromosome, the start and end coordinates and the strand. For example:

>13 dna:chromosome chromosome:GRCh37:13:32883613:32978196:1

(f)

8 Click [Configure this page] in the side menu. 8 Click [Reset configuration]. 8 Click (P). ______

7 ______BIOMART ______

Exercise 1

The paper ‘Fine mapping of the usher syndrome type IC to chromosome 11p14 and identification of flanking markers by haplotype analysis’ (Ayyagari et al. Mol Vis. 1995 Oct 25;1:2) describes the mapping of the human Usher Syndrome type I C to the genomic region between the markers D11S1397 and D11S1310.

Confirm this finding by generating a list of the genes located in the region between D11S1397 and D11S1310. Include the Ensembl Gene ID, name and description. ______

Answer

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Click on the ‘BioMart’ link on the toolbar.

... or if you are already in BioMart:

8 Click the [New] button on the toolbar.

8 Choose the ‘Ensembl Genes 69’ database. 8 Choose the ‘Homo sapiens genes (GRCh37.p8)’ dataset.

8 Click on ‘Filters’ in the left panel. 8 Expand the ‘REGION’ section by clicking on the + box. 8 Type ‘Marker Start: d11s1397’ and ‘Marker End: d11s1310’.

8 Click on ‘Attributes’ in the left panel. 8 Expand the ‘GENE’ section by clicking on the + box. 8 Deselect ‘Ensembl Transcript ID’. 8 Select ‘Associated Gene Name’ and ‘Description’.

8 Click the [Results] button on the toolbar. 8 Select ‘View All rows as HTML’ or export all results to a file.

Your results should show 33 genes. Among these there should be one gene (ENSG00000006611) named ‘USH1C’ with the description ‘Usher syndrome 1C (autosomal recessive, severe) [Source:HGNC Symbol;Acc:12597]’. This confirms that Ayyagari et al. correctly mapped Usher Syndrome type I C to this genomic region.

8 ______

Exercise 2

In the paper ‘Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers’ (Forrest et al. Environ Health Perspect. 2005 June; 113(6): 801–807) the effect of benzene exposure on peripheral blood mononuclear cell gene expression in a population of shoe factory workers with well-characterized occupational exposures was examined using microarrays. The microarray used was the Affymetrix U133A/B GeneChip (also called ‘U133 plus 2’). The top 25 probe sets up-regulated by benzene exposure were:

207630_s_at, 221840_at, 219228_at, 204924_at 227613_at, 223454_at, 228962_at, 214696_at, 210732_s_at, 212371_at, 225390_s_at, 227645_at, 226652_at, 221641_s_at, 202055_at, 226743_at, 228393_s_at, 225120_at, 218515_at, 202224_at, 200614_at, 212014_x_at, 223461_at, 209835_x_at, 213315_x_at

(a) Generate a list of the genes to which these probe sets map. Include the Ensembl Gene ID, name and description as well as the probe set name.

(b) As a first step towards analysing them for possible regulatory features they have in common, retrieve the 250 bp upstream of the transcripts of these genes. Include the Ensembl Gene and Transcript ID, name and description in the sequence header.

(c) In order to be able to study these human genes in mouse, generate a list of the human genes and their mouse orthologues. Include the Ensembl Gene ID for both the human and mouse genes and the homology type in your list. ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Click on the ‘BioMart’ link on the toolbar.

... or if you are already in BioMart:

8 Click the [New] button on the toolbar.

8 Choose the ‘Ensembl Genes 69’ database. 8 Choose the ‘Homo sapiens genes (GRCh37.p8)’ dataset.

8 Click on ‘Filters’ in the left panel. 8 Expand the ‘GENE’ section by clicking on the + box.

9 8 Select ‘ID list limit - Affy hg u133 plus 2 probeset ID(s)’. 8 Enter the list of probeset IDs in the text box (either comma separated or as a list).

8 Click on ‘Attributes’ in the left panel. 8 Expand the ‘GENE’ section by clicking on the + box. 8 Deselect ‘Ensembl Transcript ID’. 8 Select ‘Associated Gene Name’ and ‘Description’. 8 Expand the ‘EXTERNAL’ section by clicking on the + box. 8 Select ‘Affy HG U133-PLUS-2 probeset’.

8 Click the [Results] button on the toolbar. 8 Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show 25 genes. In most cases, one probe set maps to one gene. Exceptions are 212014_x_at and 209835_x_at, that both map to ENSG00000026508 (CD44), 227613_at and 219228_at, that both map to ENSG00000130844 (ZNF331) and 213315_x_at, that maps to both ENSG00000197620 (CXorf40A) and ENSG00000197021 (CXorf40B).

(b)

You can leave the dataset and filters the same, so you can directly specify the attributes:

8 Click on ‘Attributes’ in the left panel. 8 Select the ‘Sequences’ attributes page. 8 Expand the ‘SEQUENCES’ section by clicking on the + box. 8 Select ‘Flank (Transcript)’. 8 Type ‘250’ in the ‘Upstream flank’ text box. 8 Expand the ‘Header Information’ section by clicking on the + box. 8 Select ‘Associated Gene Name’ and ‘Description’.

Note: ‘Flank (Transcript)’ will give the flanks for all the transcripts of a gene with multiple transcripts. ‘Flank (Gene)’ will only give the flank for the transcript with the outermost 5’ (or 3’) end.

8 Click the [Results] button on the toolbar. 8 Select ‘View All rows as FASTA’ or export all results to a file.

(c)

You can leave the dataset and filters the same, so you can directly specify the attributes:

10 8 Click on ‘Attributes’ in the left panel. 8 Select the ‘Homologs’ attributes page. 8 Expand the ‘GENE’ section by clicking on the + box. 8 Deselect ‘Ensembl Transcript ID’. 8 Expand the ‘ORTHOLOGS’ section by clicking on the + box. 8 Select ‘Mouse Ensembl Gene ID’ and ‘Homology Type’.

8 Click the [Results] button on the toolbar. 8 Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show that for most of the 25 human genes a one-to-one orthologue in mouse has been identified, while ENSG00000123130 has two mouse orthologues and ENSG00000172716 has three mouse orthologues. ENSG00000197620 and ENSG00000197021 map to the same mouse gene. For four human genes (ENSG00000186594, ENSG00000263141, ENSG00000130844 and ENSG00000089335) no mouse orthologue has been identified. ______

Exercise 3

In the paper ‘Identification of seven new prostate cancer susceptibility loci through a genome-wide association study’ (Eeles et al. Nat Genet. 2009 Oct;41(10):1116-21.) the following twelve variants are shown to be associated with prostate cancer susceptibility: rs10993994, rs2735839, rs4242384, rs6983267, rs7931342, rs7501939, rs9364554, rs6465657, rs5945619, rs2660753, rs1016343, rs1859962.

Use BioMart to generate a list of the genes to which these variants map. Include the Ensembl Gene ID, name and description.

Hint: You should start with the Ensembl Variation Mart. To be able to include the gene name and description, add the Ensembl human genes as a second dataset. ______

Answer

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Click on the ‘BioMart’ link on the toolbar.

8 Choose the ‘Ensembl Variation 69’ database. 8 Choose the ‘Homo sapiens Variation (GRCh37)’ dataset.

8 Click on ‘Filters’ in the left panel.

11 8 Expand the ‘GENERAL VARIATION FILTERS’ section by clicking on the + box. 8 Enter the list of SNP IDs in the ‘Filter by Variation ID’ text box (either comma separated or as a list).

Add the Ensembl human genes as a second dataset:

8 Click on ‘Dataset’ at the bottom of the left panel. 8 Choose the ‘[Ensembl Genes 69] Homo sapiens genes (GRCh37.p8)’ dataset.

8 Click on ‘Attributes’ in the left panel. 8 Expand the ‘GENE’ section by clicking on the + box. 8 Deselect ‘Ensembl Transcript ID’. 8 Select ‘Description’. 8 Expand the ‘EXTERNAL’ section by clicking on the + box. 8 Select ‘HGNC symbol’.

8 Click the [Results] button on the toolbar. 8 Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show that six out of the twelve variants map to one Ensembl gene, while three (rs5945619, rs10993994 and rs7501939) map to two Ensembl genes. Seven of the genes have an HGNC symbol assigned. Three of the twelve variants don’t map to an Ensembl gene. ______

12 ______VARIATION ______

Exercise 1 – Exploring a sequence variant

The MTHFR (methylenetetrahydrofolate reductase (NAD(P)H)) gene encodes an enzyme that is involved in the processing of amino acids. One of the variants in this gene, Ala222Val (A222V), can lead in human to elevated plasma homocysteine levels, which is considered to be a risk factor for cardiovascular disease (see also http://en.wikipedia.org/wiki/Methylenetetrahydrofolate_reductase and http://en.wikipedia.org/wiki/Homocysteine).

(a) Find the MTHFR gene for human. Go to the ‘Variation table’ page. What is the dbSNP accession number (rs number) for the Ala222Val variant?

(b) Is the consequence type of the variant missense for all transcripts that have been annotated for the MTHFR gene?

(c) Why are its alleles in Ensembl given as G/A and not as C/T, like in the literature (C665T or C677T) and dbSNP?

(d) Why does Ensembl put the G allele first (G/A)?

(e) Is there ethnic variability in the frequency of the A allele?

(f) In which paper is the association between the variant and homocysteine levels described?

(g) According to the data imported from dbSNP, G is the ancestral allele of this variant. Ancestral alleles in dbSNP are based on a comparison between human and chimp. Does the sequence at the position of the variant in the other primates, i.e. gorilla, orangutan, macaque and marmoset, confirm that G is indeed the ancestral allele?

(h) (optional) Were both alleles already present in Neandertal? To answer this question, have a look at the individual reads at the genomic position of the variant in the Neandertal Genome Browser (http://neandertal.ensemblgenomes.org/). ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select ‘Search: Human’ and type ‘mthfr’ in the ‘for’ text box.

13 8 Click [Go]. 8 Click on ‘Gene’ on the page with search results. 8 Click on ‘Human’. 8 Click on ‘Variation Table’ on the page with search results. 8 Click on ‘Show’ for the ‘Missense variants’ in the ‘Summary of variations in ENSG00000177000 by consequence’ table. 8 Type e.g. ‘222’ and/or ‘a/v’ in the ‘Filter’ text box.

The dbSNP accession number for the Ala222Val variant is rs1801133.

Note that HGVS ( Variation Society) notations are not by default shown in the table. They can be added as follows:

8 Click on ‘Configure this page’ in the side menu. 8 Click on ‘Consequence options’. 8 Check ‘Show HGVS notations’. 8 Click (P).

(b)

8 Click on ‘rs1801133’. 8 Click on ‘Genomic context – Genes and regulation’ in the side menu or on the ‘ Genes and regulation’ icon.

No, rs1801133 is missense for four MTHFR transcripts. It is downstream for one MTHFR transcript, i.e. ENST00000418034.

Note that in total nine transcripts have been annotated for the MTHFR gene: http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00 000177000.

(c)

In Ensembl the alleles of rs1801133 are given as G/A, because these are the alleles in the forward strand of the genome. In the literature and dbSNP (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs1801133) the alleles are given as C/T, because the MTHFR gene is located on the reverse strand of the genome, thus the alleles in the actual gene and transcript sequences are C/T.

(d)

In Ensembl the allele that is present in the GRCh37 reference genome assembly is put first, i.e. G. In the literature normally the major allele (in the population of interest) is put first. In the case of rs1801133 the allele in the reference genome is the major allele in almost all populations studied, but as

14 the reference genome is an amalgamation of the genomes of just a few individuals this is by no means the case for all variants.

(e)

8 Click on ‘Population genetics’ in the side menu.

Yes, there is considerable ethnic variation in the frequency of the A allele. Among the 1000 Genomes populations studied, it ranges from 0.108 in the YRI (Yoruba in Ibadan, Nigera) population to 0.567 in the CLM (Colombians from Medellin, Colombia) population.

(f)

8 Click on ‘Phenotype Data’ in the side menu. 8 Click on ‘pubmed/20031578’ behind ‘Homocysteine levels’ in the ‘Phenotype Data’ table.

The association between rs1801133 and homocysteine levels is described in the paper ‘Novel associations of CPS1, MUT, NOX4, and DPEP1 with plasma homocysteine in a healthy population: a genome-wide evaluation of 13 974 participants in the Women's Genome Health Study’ (Paré et al. Circ Cardiovasc Genet. 2009 Apr;2(2):142-50.).

(g)

8 Click on ‘Phylogenetic Context’ in the side menu.

Gorilla, orangutan, macaque and marmoset all have a G in this position, which confirms that G is indeed the ancestral allele.

(h)

8 Go to the Neandertal Genome Browser (http://neandertal.ensemblgenomes.org/). 8 Type ‘rs1801133’ in the ‘Search Neandertal’ text box. 8 Click [Go]. 8 Click on ‘rs1801133’ on the page with search results. 8 Click on ‘Jump to region in detail’. 8 Click on ‘Configure this page’ in the side menu. 8 Click on ‘Variation features’. 8 Select ‘All variations – Normal’. 8 Click [SAVE and close]. 8 Draw a box of about 50 bp around rs1801133 (shown in yellow in the center of the display). 8 Click on ‘Jump to region’ on the pop-up menu.

15

The ‘Sequences’ track shows that there are three reads for Neandertal at the position of rs1801133, all with a G, so based on these (extremely limited!) data there is no evidence that both alleles were already present in Neandertal. ______

Exercise 2 – Variant Effect Predictor

Resequencing of the genomic region of the human CFTR (cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7)) gene (ENSG00000001626) has, amongst others, revealed the following variants:

7 117171039 117171039 G/A + 7 117171092 117171092 T/C + 7 117171122 117171122 T/C +

(a) Determine if these variants result in a change in the encoded by any of the Ensembl transcripts of the CFTR gene. Are any of the variants deleterious according to the SIFT tool? Does PolyPhen agree with this? Have the variants already been annotated by Ensembl?

(b) Have a look at the uploaded variants on the ‘Region in detail’ page. ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Click on the ‘Tools’ link on the toolbar. 8 Click on the icon (spanner/wrench) for the ‘Variant Effect Predictor – Online tool’. 8 Type ‘New variants’ in the ‘Name for this upload (optional)’ text box. 8 Enter the list of variants in the ‘Paste file’ text box

7 117171039 117171039 G/A + 7 117171092 117171092 T/C + 7 117171122 117171122 T/C +

8 Select ‘SIFT predictions: Prediction only’. 8 Select ‘PolyPhen predictions: Prediction only’. 8 Click [Next>]. 8 Click on ‘HTML’.

Two of the variants (at positions 117171092 and 117171122) are missense in three of the encoded proteins and thus cause an amino acid change (L/P and

16 I/T, respectively). One variant (at position 117171039) is synonymous and thus doesn’t change the amino acid sequence (A).

Both missense variants are considered deleterious according to SIFT. PolyPhen does not agree in all cases.

All three variants have already been annotated by Ensembl as can be seen from the dbSNP and COSMIC accession numbers that are shown in the ‘Co- located Variation’ column.

(b)

8 Click for one of the variants on the coordinates in the ‘Location’ column on the ‘Variant Effect Predictor Results’ page.

The uploaded variants are shown on the ‘Region in detail’ page in a new track named ‘New variants’. The ‘Sequence variants (dbSNP and all other sources)’ track has also been added by default to allow comparison with the uploaded variants.

8 Draw with your mouse a box around the uploaded variants. 8 Click on ‘Jump to region’ in the pop-up menu. 8 Drag the ‘New variants’ track until it’s just above or below the ‘Sequence variants (dbSNP and all other sources)’ track, so the two tracks can be more easily compared.

It can be clearly seen that all three uploaded variants correspond to an already annotated variant shown in the ‘Sequence variants (dbSNP and all other sources)’ track, one synonymous (green) and two missense (yellow). ______

17 ______COMPARATIVE GENOMICS ______

Exercise 1 – Orthologues, paralogues and gene trees

The photoreceptor cells in the retina of the contain a number of different photoreceptors. The rod cells contain , which is responsible for monochromatic vision in the dark. The cone cells all contain one of three types of , which respond to long-wave (red), medium-wave (green) and short-wave (blue) light, respectively, and are responsible for trichromatic colour vision (see also http://en.wikipedia.org/wiki/Opsin).

(a) Find the gene encoding the long-wave-sensitive (red) for human.

(b) How many within-species paralogues have been identified for this gene? Note the ‘Target %id’ and ‘Query %id’. Which paralogues show the most sequence similarity with the red opsin?

(c) Have a look at the genomic location of the OPN1LW (red opsin), OPN1MW and OPN1MW2 (green opsin), and OPN1SW (blue opsin) genes. Does their location explain why red-green colour blindness is much more prevalent in males than in females (e.g. in the US population 7% vs 0.4%)?

(d) Have a look at the gene tree for the OPN1LW gene. Which of its paralogues are due to the most recent duplication event? Is this reflected in the sequence similarity between the red opsin and these paralogues when compared with the other paralogues (see question b)? On which taxonomic level did this duplication take place?

(e) Retrieve an alignment between the red opsin and all its paralogues in Jalview. To this end, select all aligned protein sequences using ‘Select > Select all’, then order them by using ‘Calculate > Order > by ID’, subsequently select all human protein sequences and finally use ‘Select > Invert Sequence Selection’ and ‘Edit > Delete’ to delete all non-human protein sequences. ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select ‘Search: Human’ and type ‘long wave sensitive opsin’ in the ‘for’ text box. 8 Click [Go]. 8 Click on ‘Gene’ on the page with search results. 8 Click on ‘Human’.

18 8 Click on ‘OPN1LW’.

Note that ‘LW’ in the gene symbol OPN1LW stands for ‘long-wave’.

(b)

8 Click on ‘Comparative Genomics - Paralogues’ in the side menu.

Nine within-species paralogues have been identified for the human OPN1LW gene. According to the Target and Query %id, the proteins encoded by the genes ENSG00000147380 (OPN1MW) and ENSG00000166160 (OPN1MW2), i.e. the medium-wave-sensitive (green) opsins, show the highest sequence similarity to red opsin (Target %id indicates the percentage of the sequence of red opsin matching the sequence of the paralogue protein. Query %id indicates the percentage of the sequence of the paralogue protein matching the sequence of red opsin).

(c)

8 Click on the ‘Location: X:153,409,698-153,424,507’ tab.

The OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes are located next to each other on the X chromosome, while the OPN1SW (blue opsin) gene is located on chromosome 7. As females have two X a normal gene on one chromosome can often make up for a defective one on the other, whereas males cannot make up for a defective gene. Thus, red-green colour blindness is much more prevalent in males than in females. Variation in the genes for red and green opsin can cause subtle differences in colour perception, while tandem rearrangements due to unequal crossing-over between these genes cause more serious defects in colour vision.

(d)

8 Click on the ‘Gene: OPN1LW’ tab. 8 Click on ‘Comparative Genomics - Gene tree (image)’ in the side menu. 8 Click on ‘View options: View paralogs of current gene’ below the gene tree image. 8 Click on the nodes (red squares) for the duplication events that have given rise to the various paralogues.

A duplication event on the level of the Hominines or Homininae (humans, gorillas and chimpanzees) has given rise to the OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes. The other paralogues are due to earlier duplication events. This agrees with the fact that the green opsins show the highest sequence similarity with red opsin (see question b) and the fact that the genes for the red and green opsins are located close to each other on the genome (see question c).

19

Note: On the ‘Paralogues’ page nine paralogues are shown (see question b). Five of these are of the type ‘other paralogue’. These are paralogues that are too distant to be in the same gene tree, but can still be related as part of a broader “super-family”. Therefore, the gene tree for the OPN1LW gene only shows four of its nine paralogues. The precise taxonomic level of duplication for the ‘other paralogues’ is left as undetermined.

(e)

8 Click on the speciation node (blue square) that is at the base of the complete gene tree. 8 Click on ‘Expand for Jalview’ in the pop-up menu (that should say ‘Taxon: Chordates’). 8 Click [Start Jalview]. 8 Close the pop-up window with the gene tree. 8 Click on ‘Select > Select all’ on the menu bar of the pop-up window with the protein sequence alignment. 8 Click on ‘Calculate > Sort > by ID’ on the menu bar. 8 Select the protein sequences of the human paralogues. 8 Click on ‘Select > Invert Sequence Selection’ on the menu bar. 8 Click on ‘Edit > Delete’ on the menu bar.

As the alignment is based on the complete set of protein sequences in the gene tree, the alignment of this subset of five proteins will contain empty columns. These can be removed using the option ‘Edit > Remove Empty Columns’ on the menu bar.

8 Click on ‘Edit > Remove Empty Columns’ on the menu bar. ______

Exercise 2 – Whole genome alignments

Not only protein coding sequences are evolutionary conserved. Many conserved, and even ultraconserved, sequences are found in the non-coding parts of the genome. These are of interest because of their potential to be involved in gene regulation.

(a) Find the Ensembl BRCA2 (breast cancer type 2 susceptibility protein) gene for human and go to its ‘Region in detail’ page.

(b) Turn on some of the ‘BLASTZ alignment’ and ‘Translated BLAT alignment’ tracks for a broad taxonomic range of species (e.g. chimp, mouse, platypus, chicken, anole lizard and zebrafish). Does the degree of conservation between human and the various other species reflect their evolutionary relationship? Which parts of the BRCA2 gene show the most conservation? Would you have expected this?

20

(c) Have a look at the ‘Conservation score’ and ‘Constrained elements’ tracks for the sets of 36 mammals and 19 vertebrates. Do these tracks confirm what is shown in the tracks with pairwise alignment data?

(d) Go to the human POLA1 (polymerase (DNA directed), alpha 1, catalytic subunit) gene. How does conservation in this gene compare with that in the BRCA2 gene?

(e) In the paper ‘Ultraconserved elements in the human genome’ (Bejerano et al. Science. 2004 May 28;304(5675):1321-5) sequences are described that are for no apparent reason perfectly conserved across many mammals (and often beyond).

Below are the coordinates of the ten ultraconserved elements that are located in the POLA1 gene and the intergenic region between POLA1 and its downstream neighbour, the ARX (aristaless related homeobox) gene. chrX 24823511 24823785 uc.460 chrX 24864797 24865193 uc.461 chrX 24894826 24895604 uc.462 chrX 24915882 24916156 uc.463 chrX 24916158 24916927 uc.464 chrX 24917481 24917790 uc.465 chrX 24946458 24946806 uc.466 chrX 25008354 25009084 uc.467 chrX 25017563 25018051 uc.468 chrX 25018053 25018274 uc.469

(These coordinates were taken from http://users.soe.ucsc.edu/~jill/ultra.html and converted from NCBI35 to GRCh37).

Upload these ten elements to Ensembl.

Hint: To this end, click [Manage your data] in the side menu and subsequently on ‘Upload Data’. The rest should be self-explanatory. The above data are in BED format.

(f) Have a look at the conservation of uc.460 on the basepair level for the group of 13 eutherian mammals and the group of 19 amniota vertebrates. Is this ultraconserved element indeed almost perfectly conserved across the mammals? And beyond? ______

Answer

(a)

8 Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select ‘Search: Human’ and type ‘brca2 ’ in the ‘for’ text box.

21 8 Click [Go]. 8 Click on ‘Gene’ on the page with search results. 8 Click on ‘Human’. 8 Click on ’13:32889611-32973805:1’ below ‘BRCA2’.

You may want to turn off all tracks that you added to the display in the previous exercises.

8 Click [Configure this page] in the side menu. 8 Click [Reset configuration]. 8 Click (P).

(b)

8 Click [Configure this page] in the side menu 8 Click on ‘Comparative genomics - BLASTZ/LASTz alignments’. 8 Select ‘Chicken (Gallus gallus) - BLASTZ_NET – Compact’, ‘Chimpanzee (Pan troglodytes) – LASTZ_NET – Compact’, ‘Mouse (Mus musculus) – LASTZ_NET – Compact’ and ‘Platypus (Ornithorhynchus anatinus) - BLASTZ_NET – Compact’. 8 Click on ‘Comparative genomics - Translated blat alignments’. 8 Select ‘Anole Lizard (Anolis carolinensis) – Compact’ and ‘Zebrafish (Danio rerio) – Compact’. 8 Click (P).

Yes, the degree of conservation does reflect the evolutionary relationship between human and the other species; the highest degree of conservation is found in chimp, followed by mouse, platypus, chicken, lizard and zebrafish, respectively. Especially the exonic sequences of BRCA2 seem to be highly conserved between the various species, which is what is to be expected because these are supposed to be under higher selection pressure than intronic and intergenic sequences.

(c)

8 Click [Configure this page] in the side menu 8 Click on ‘Comparative genomics – Conservation regions’. 8 Click on ‘Enable/disable all Conservation regions’ and select ‘On’. 8 Click (P).

The ‘Conservation score’ and ‘Constrained elements’ tracks largely correspond with the data seen in the pairwise alignment tracks; all exons of the BRCA2 gene show a high degree of conservation. Note that the 5’ and 3’ UTRs don’t, though.

22 (d)

8 Type ‘pola1’ in the ‘Gene’ text box. 8 Click [Go].

In contrast to the BRCA2 gene, the POLA1 gene also shows conservation in its non-coding regions, especially in its last three introns.

(e)

8 Click [Manage your data] in the side menu. 8 Click on ‘Upload Data’. 8 Type ‘Ultraconserved elements’ in the ‘Name for this upload (optional)’ text box. 8 Select ‘Data format: BED’. 8 Paste the list of ultraconserved elements in the ‘Paste data’ text box. 8 Click [Upload]. 8 Click (P).

A new track named ‘Ultraconserved elements’ has now been added to the ‘Region in detail’ page.

To display the names of the ultraconserved elements:

8 Hover over the ‘Ultraconserved elements’ track name. 8 Hover over the ‘Change track style’ icon (the cogwheel). 8 Select ‘Labels’.

8 Zoom in on the region encompassing the 3’ ends of the POLA1 and ARX genes and the intergenic region.

Because the group of conserved sequences lies at the 3’ end of the 303-kb POLA1 gene, nearer to the 3’ end of ARX than to the rest of POLA1 it has been suggested by Bejerano et al. that their function is not related to POLA1 but that they instead form a cluster of enhancers of ARX.

(f)

8 Type ‘X:24823511-24823785’ in the ‘Location’ text box. 8 Click [Go]. 8 Click on ‘Comparative Genomics – Alignments (text)’ in the side menu. 8 Select ‘Alignment: 13 eutherian mammals EPO. 8 Click [Go].

To give positions in the alignment where >50% of bases match a light-blue background colour:

23

8 Click [Configure this page]. 8 Select ‘Conservation regions: All conserved regions’. 8 Click (P).

Ultraconserved element uc.460 is indeed almost perfectly conserved across the mammals.

8 Select ‘Alignment: 19 amniota vertebrates Pecan’. 8 Click [Go].

Uc.460 is also conserved in other vertebrates like opossum, chicken, turkey, zebrafinch and anole lizard. ______

24