The Bioinformatics Roadshow

Copenhagen, Denmark

16 & 17 June 2011

BROWSING AND GENOMES WITH ENSEMBL

EXERCISES AND ANSWERS

Note: These exercises are based on Ensembl version 62 (13 April 2011). After in future a new version has gone live, version 62 will still be available at http://e62.ensembl.org. If your answer doesn’t correspond with the given answer, please consult the instructor. ______

BROWSER ______

Exercise 1 – Exploring a

(a) Find the human F9 (Coagulation factor IX) gene. On which and which strand of the genome is this gene located? How many transcripts (splice variants) have been annotated for it?

(b) What is the longest transcript? How long is the it encodes? How many exons does it have? Are any of the exons completely or partially untranslated?

(c) Have a look at the external references for ENST00000218099. What is the function of F9?

(d) Is it possible to monitor expression of ENST00000218099 with the CodeLink microarray? If so, can it also be used to monitor expression of the other two transcripts?

(e) In which part (i.e. the N-terminal or C-terminal half) of the protein encoded by ENST00000218099 does its peptidase activity, responsible for the cleavage of factor X to yield its active form factor Xa, reside?

(f) How many non-synonymous variants have been discovered for the protein encoded by ENST00000218099?

(g) Is there a mouse ortholog predicted for the human F9 gene?

(h) On which cytogenetic band and on which contig is the F9 gene located? Is there a BAC clone that contains the complete F9 gene?

(i) If you have yourself a gene of interest, explore what information Ensembl displays about it! ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘f9’ or ‘factor IX’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘F9 [ Ensembl/Havana merge gene: ENSG00000101981 ]’.

The human F9 gene is located on the X chromosome on the forward strand. There have been three transcripts annotated for this gene, ENST00000218099 (F9-001), ENST00000394090 (F9-201) and ENST00000479617 (F9-002).

(b)

The longest transcript is ENST00000218099. The length of this transcript is 2780 base pairs and the length of the encoded protein is 461 amino acids.

 Click on the Ensembl Transcript ID ‘ENST00000218099’ in the list of transcripts.

It has eight exons.

 Click on ‘Sequence - Exons’ in the side menu.

The first and last exon are partially untranslated (sequence shown in purple).

(c)

 Click on ‘External References - General identifiers’ in the side menu.  Explore some of the links (good places to start are usually ‘UniProtKB/Swiss-Prot’ and ‘WikiGene’).  Do the same for ‘Ontology – Ontology table’.

Factor IX is a vitamin K-dependent plasma protein that participates in the intrinsic pathway of blood coagulation by converting factor X to its active form in the presence of Ca2+ ions, phospholipids, and factor VIIIa (this is the function description as found in UniProtKB/Swiss-Prot).

(d)

 Click on ‘External References - Oligo probes’ in the side menu.

The CodeLink microarray contains one probe, GE80895, that maps to ENST00000218099, so it is possible to monitor expression of this transcript using this array.

 Click on ‘ENST00000394090’ and ‘ENST00000479617’ in the list of transcripts.

The probe GE80895 also maps to ENST00000394090. There is no CodeLink probe that maps to ENST00000479617 though, so expression of this transcript cannot be monitored using this array.

(e)

 Click on ‘ENST00000218099’ in case you are not already on the ‘Transcript: F9-001’ tab.  Click on ‘Protein Information – Protein summary’ in the side menu.  Click on ‘Protein Information - Domains & features’ in the side menu.

The peptidase activity of the protein resides in the peptidase domain that is located in the C-terminal half of the protein.

(f)

 Click on ‘Protein Information - Variations’ in the side menu.

For the protein encoded by ENST00000218099 738 variants are shown. The vast majority of these are mutations from the Human Gene Mutation Database (HGMD). As the HGMD is a subscription database, Ensembl is only allowed to show the ID and the position of these mutations (if you want to get more information, you have to register at the HGMD). Apart from these a small number of variants from other sources are shown with more comprehensive information. Four of these are non-synonymous, i.e. rs104894807, rs6048, rs1801202 and rs4149751.

(g)

 Click on the ‘Gene: F9’ tab.  Click on ‘Comparative Genomics - Orthologues’ in the side menu.  Type ‘mouse’ in the ‘Filter’ text box at the top of the ‘Orthologues’ table.

There is one mouse ortholog predicted for human F9, ENSMUSG00000031138.

(h)

 Click on the ‘Location: X:138,612,917-138,645,617’ tab.

The F9 gene is located on cytogenetic band q27.1 and on contig AL033403.1.

 Click [Configure this page] in the side menu.  Type ‘clone’ in the ‘Find a track’ text box.  Select ‘1 Mb clone set’, ‘30k clone set’, ‘32k clone set’ and ‘Tilepath’.  Click ().

There are several BAC clones that contain the complete F9 gene, e.g. the tilepath clone RP6-88D7 and the clones RP11-1144I12 and RP13-520K15. ______

Exercise 2 – Exploring a region

(a) Go to the region from bp 52,600,000 to 53,400,000 on human chromosome 4. How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information)? What does the open (i.e. non-blue) part in the ‘Contigs’ track represent?

(b) Do the tilepath clones (i.e. the BAC clones that were sequenced for the assembly) correspond with the contigs?

(c) Zoom in on the SGCB transcripts, including a bit of flanking sequence on both sides.

(d) CpG islands are genomic regions that contain a high frequency of CG dinucleotides and are often located near the promoter of mammalian genes. Is there a CpG island located at the 5’ end of the SGCB transcripts? Possible transcription start sites can be predicted using the Eponine program (http://www.sanger.ac.uk/Software/analysis/eponine/). Is there a transcription start site predicted by Eponine annotated for the SGCB transcripts?

(e) Export the genomic sequence of the region you are looking at in FASTA format.

(f) If you have yourself a genomic region of interest, explore what information Ensembl displays about it! ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘4:52600000-53400000’ in the ‘for’ text box (or alternatively leave the ‘Search’ drop-down list like it is and type ’human 4:52600000-53400000’ in the ‘for’ text box).  Click [Go].

This genomic region is made up of seven contigs, indicated by the alternating light and dark blue colored bars in the ‘Contigs’ track. One of them, AC107402.4, located between AC093858.2 and AC093880.4, is very small and possibly not visible. The open part in the ‘Contigs’ track represents a gap in the genome assembly (although the human genome is called ‘finished’, there are still gaps!). Note that this region is very close to the centromere of the chromosome.

If you cannot see the very tiny contig AC107402.4, you may want to increase the width of your display as follows:

 Click [Configure this page] in the side menu.  Click on the ‘Configure page’ tab.  Select a larger pixels value or ‘best fit’ from the ‘Width of image’ drop-down list.  Click ().

(b)

If the tilepath clones are not already shown:

 Click [Configure this page] in the side menu.  Type ‘tilepath’ in the ‘Find a track’ text box.  Select ‘Tilepath’.  Click ().

The tilepath clones do correspond to the contigs and it is easy to see from which BAC clone which contig sequence in the assembly is derived, e.g. AC027271.7 is derived from RP11-365H22, AC104784.5 is derived from RP11-61F5 etc.

(c)

 Draw with your mouse a box around the SGCB transcripts.  Click on ‘Jump to region’ in the pop-up menu.

(d)

 Click [Configure this page] in the side menu.  Type ‘cpg’ in the ‘Find a track’ text box.  Select ‘CpG islands’.  Click ()

There is indeed a CpG island located at the 5’ end of the SGCB-001, SGCB- 201 and SGCB-002 transcripts. Note that you also clearly can see this island in the ‘%GC’ track, that should be shown by default.

 Click [Configure this page] in the side menu.  Type ‘eponine’ in the ‘Find a track’ text box.  Select ‘TSS (Eponine)’.  Click ().

Eponine TSS-finder predicts a transcription start site close to the starts of the SGCB-001, SGCB-201 and SGCB-002 transcripts.

(e)

 Click [Export data] in the side menu.  Click [Next>].  Click on ‘Text’.

Note that the sequence has a header that provides information about the genome assembly (GRCh37), the chromosome, the start and end coordinates and the strand. For example:

>4 dna:chromosome chromosome:GRCh37:4:52886013:52905920:1 ______

______

BIOMART ______

Exercise 1

Generate a list of all human genes that are located on the Y chromosome, that are protein-coding and that encode for a protein containing one or more transmembrane domains. Do a count after selection of each filter to check the number of genes remaining in your dataset. ______

Answer

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘BioMart’ link on the toolbar.

Start with all human Ensembl genes:

 Choose the ‘Ensembl Genes 62’ database.  Choose the ‘Homo sapiens genes (GRCh37.p3)’ dataset.

Now filter for the genes on the Y chromosome:

 Click on ‘Filters’ in the left panel.  Expand the ‘REGION’ section by clicking on the + box.  Select ‘Chromosome – Y’. Make sure the check box in front of the filter is ticked, otherwise the filter won’t work.  Click the [Count] button on the toolbar.

This should give you 511 / 53334 Genes.

Now filter further for genes that are protein-coding:

 Expand the ‘GENE’ section by clicking on the + box.  Select ‘Gene type – protein_coding’.  Click the [Count] button on the toolbar.

This should give you 55 / 53334 Genes.

Finally filter for genes that encode containing one or more transmembrane domains:

 Expand the ‘PROTEIN DOMAINS’ section by clicking on the + box.  Select ‘Transmembrane domains – Only’.  Click the [Count] button on the toolbar.

This should give you 5 / 53334 Genes.

Specify the attributes to be included in the output (note that a number of attributes will already be default selected):

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Select, in addition to the attributes ‘Ensembl Gene ID’ and ‘Ensembl Transcript ID’ that are already default selected, for instance ‘Associated Gene Name’ and ‘Description’.

Have a look at a preview of the results (only 10 rows of the results will be shown):

 Click the [Results] button on the toolbar.

If you are happy with how the results look in the preview, output all the results:

 Select ‘View All rows as HTML’ or export all results to a file.

Note: When you select ‘View All rows as HTML’, your results will be shown under a new tab or in a new window in your internet browser.

Although you have filtered for only five genes, your results will contain more than five rows. This is because several of the genes have more than one transcript and consequently the results contain a separate row for each transcript. ______

Exercise 2

BioMart is a very handy tool when you want to map between different databases. The following is a list of 19 accession numbers from the UniProtKB/Swiss-Prot database (http://www.uniprot.org/) of human proteins that are supposedly involved in the sensory perception of pain:

Q99608, P34913, P28482, P28223, Q96LB1, P10997, P01210, P25929, P17481, P43681, P29460, Q9HC23, P20366, Q9Y2W7, Q99572, Q99571, Q00535, P01138, P17787

Generate a list that shows to which Ensembl Gene IDs and to which HGNC symbols these UniProtKB/Swiss-Prot accession numbers map. Also include the gene description.

Hint: Instead of doing a lot of typing (which is not only a waste of time but also introduces the risk of making typos), copy-paste the above list from the digital version of this document, which you can find at http://www.ebi.ac.uk/training/roadshow/110616_copenhagen.html. ______

Answer

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘BioMart’ link on the toolbar.

 Choose the ‘Ensembl Genes 62’ database.  Choose the ‘Homo sapiens genes (GRCh37.p3)’ dataset.

 Click on ‘Filters’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Select ‘ID list limit – UniProt/Swissprot Accession(s)’.  Enter the list of IDs in the text box (either comma separated or as a list).

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Description’.  Expand the ‘EXTERNAL’ section by clicking on the + box.  Select ‘HGNC symbol’ and ‘UniProt/SwissProt Accession’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Note: BioMart is ‘transcript-centric’, which means that it will often give a separate row of output for each transcript of a gene, even if you don’t include the Ensembl Transcript ID in your output. To get rid of redundant rows, use the ‘Unique results only’ option. Your results should show 19 genes. ______

Exercise 3

The paper ‘Fine mapping of the usher syndrome type IC to chromosome 11p14 and identification of flanking markers by haplotype analysis’ (Ayyagari et al. Mol Vis. 1995 Oct 25;1:2) describes the mapping of the human Usher Syndrome type I C to the genomic region between the markers D11S1397 and D11S1310.

Confirm this finding by generating a list of the genes located in this region. Include the Ensembl Gene ID, name and description. ______

Answer

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘BioMart’ link on the toolbar.

 Choose the ‘Ensembl Genes 62’ database.  Choose the ‘Homo sapiens genes (GRCh37.p3)’ dataset.

 Click on ‘Filters’ in the left panel.  Expand the ‘REGION’ section by clicking on the + box.  Enter ‘Marker Start: d11s1397’ and ‘Marker End: d11s1310’.

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Associated Gene Name’ and ‘Description’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file.

Your results should show 32 genes. Among these there should be one gene (ENSG00000006611) named ‘USH1C’ with the description ‘Usher syndrome 1C (autosomal recessive, severe) [Source:HGNC Symbol;Acc:12597]’. This confirms that Ayyagari et al. correctly mapped Usher Syndrome type I C to this genomic region. ______

Exercise 4

In the paper ‘Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers’ (Forrest et al. Environ Health Perspect. 2005 June; 113(6): 801–807) the effect of benzene exposure on peripheral blood mononuclear cell gene expression in a population of shoe factory workers with well-characterized occupational exposures was examined using microarrays. The microarray used was the Affymetrix U133A/B GeneChip (also called ‘U133 plus 2’). The top 25 probe sets up-regulated by benzene exposure were:

207630_s_at, 221840_at, 219228_at, 204924_at 227613_at, 223454_at, 228962_at, 214696_at, 210732_s_at, 212371_at, 225390_s_at, 227645_at, 226652_at, 221641_s_at, 202055_at, 226743_at, 228393_s_at, 225120_at, 218515_at, 202224_at, 200614_at, 212014_x_at, 223461_at, 209835_x_at, 213315_x_at

(a) Generate a list of the genes to which these probesets map. Include the Ensembl Gene ID, name and description as well as the probeset name.

(b) As a first step in order to be able to analyse them for possible regulatory features they have in common, retrieve the 250 bp upstream of the transcripts of these genes. Include the Ensembl Gene and Transcript ID, name and description in the sequence header.

(c) In order to be able to study these human genes in mouse, generate a list of the human genes and their mouse orthologs. Include the Ensembl Gene ID for both the human and mouse genes in your list.

(d) Generate the same list as in (c), but now also include the name and description of both the human and mouse genes. As the name and description of the mouse genes are not available as attributes in the Ensembl human genes dataset, you have to add the Ensembl mouse genes as a second dataset. ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘BioMart’ link on the toolbar.

 Choose the ‘Ensembl Genes 62’ database.  Choose the ‘Homo sapiens genes (GRCh37.p3)’ dataset.

 Click on ‘Filters’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Select ‘ID list limit - Affy hg u133 plus 2 ID(s)’.  Enter the list of probeset IDs in the text box (either comma separated or as a list).

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Associated Gene Name’ and ‘Description’.  Expand the ‘EXTERNAL’ section by clicking on the + box.  Select ‘Affy HG U133-PLUS-2’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show 23 genes. In most cases one probeset maps to one gene. Exceptions are 227613_at and 219228_at, that both map to ENSG00000130844, 212014_x_at and 209835_x_at, that both map to ENSG00000026508, and 213315_x_at, that maps to both ENSG00000197620 and ENSG00000197021. One probeset (226652_at) doesn’t map to an Ensembl gene at all and is therefore not shown in the list of results.

(b)

You can leave the dataset and filters the same, so you can directly specify the attributes:

 Click on ‘Attributes’ in the left panel.  Select the ‘Sequences’ attributes page.  Expand the ‘SEQUENCES’ section by clicking on the + box.  Select ‘Flank (Transcript)’.  Enter ‘250’ in the ‘Upstream flank’ text box.  Expand the ‘Header Information’ section by clicking on the + box.  Select ‘Associated Gene Name’ and ‘Description’.

Note: ‘Flank (Transcript)’ will give the flanks for all the transcripts of a gene with multiple transcripts. ‘Flank (Gene)’ will only give the flank for the transcript with the outermost 5’ (or 3’) end.

 Click the [Results] button on the toolbar.  Select ‘View All rows as FASTA’ or export all results to a file.

(c)

You can leave the dataset and filters the same, so you can directly specify the attributes:

 Click on ‘Attributes’ in the left panel.  Select the ‘Homologs’ attributes page.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Expand the ‘ORTHOLOGS’ section by clicking on the + box.  Select ‘Mouse Ensembl Gene ID’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show that for 18 out of the 23 human genes one mouse ortholog has been identified, while ENSG00000123130 has two mouse orthologs and ENSG00000172716 has three mouse orthologs. For three human genes (ENSG00000186594, ENSG00000130844 and ENSG00000089335) no mouse ortholog has been identified.

(d)

You can leave the Ensembl human genes dataset and filters the same, so you can directly specify the attributes:

 Click on ‘Attributes’ in the left panel.  Select the ‘Features’ attributes page.  Make sure that in the ‘GENE’ section the attributes ‘Ensembl Gene ID’, ‘Associated Gene Name’ and ‘Description’ are selected.  Deselect ‘Affy HG U133-PLUS-2’ in the ‘EXTERNAL’ section.

Add the Ensembl mouse genes as a second dataset:

 Click on ‘Dataset’ at the bottom of the left panel.  Choose the ‘[Ensembl Genes 62] Mus musculus genes (NCBIM37)’ dataset.

Specify the same attributes for mouse as for human:

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Associated Gene Name’ and ‘Description’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show the same list as in (c), but now with the name and description added for both the human and mouse genes. ______

Exercise 5

Design your own query! ______

______

VARIATION ______

Exercise 1 – Exploring a SNP

The non-synonymous coding SNP rs1805081 in the human NPC1 (Niemann- Pick disease, type C1) gene has been identified as a genetic risk factor for obesity. This SNP is also referred to as ‘H215R’ or ‘c.644A>G’.

(a) Find the page with information for rs1805081.

(b) Is rs1805081 non-synonymous coding in all transcripts of the NPC1 gene?

(c) Why are the alleles on the Variation page given as T/C and not as A/G, like in the literature and dbSNP (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs1805081)?

(d) What is the major and what the minor allele of rs1805081 in Caucasians?

(e) Why does Ensembl puts the T allele first (T/C)?

(f) In which paper is the association between rs1805081 and obesity described?

(g) According to the data imported from dbSNP the ancestral allele for rs1805081 is T. Ancestral alleles in dbSNP are based on a comparison between human and chimp. Does the sequence at the position of this SNP in the other primates, i.e. gorilla, orangutan, macaque and marmoset, confirm that the ancestral allele indeed is T?

(h) (optional) Were both alleles of rs1805081 already present in Neandertal? To answer this question, have a look at the individual reads at its genomic position in the Neandertal Genome Browser (http://neandertal.ensemblgenomes.org). ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Type ‘rs1805081’ in the ‘Search for’ text box.  Click [Go].  Click on ‘Variation’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘dbSNP Variation: rs1805081’.

(b)

 Click on ‘Gene/Transcript’ in the side menu.

No, rs1805081 is only non-synonymous coding in transcripts ENST00000269228 and ENST00000540608.

(c)

On the Variation page the alleles of rs1805081 are given as T/C, because these are the alleles in the forward strand of the genome. In the literature and dbSNP the alleles are given as A/G, because the NPC1 gene is located on the reverse strand of the genome, thus the alleles in the actual gene and transcript sequences are A/G.

(d)

 Click on ‘Population genetics’ in the side menu.

According to the available data from the 1000GENOMES:low_coverage:CEU population and CSHL-HAPMAP:HapMap-CEU populations, the T and C alleles occur at about the same frequency in Caucasians. In Asians and Africans T is the major allele, though.

(e)

In Ensembl the allele that is present in the GRCh37 reference genome assembly is put first, i.e. T. In the literature normally the major allele (in the population of interest) is put first.

(f)

 Click on ‘Phenotype Data’ in the side menu.  Click on ‘pubmed/19151714’ behind ‘Obesity’ in the ‘Phenotype data’ table.

The association between rs1805051 and obesity is described in the paper ‘Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations’ (Meyre et al. Nat Genet. 2009 Feb;41(2):157-9.).

(g)

 Click on ‘Phylogenetic context’ in the side menu.  Select ‘Alignment: 6 primates EPO’.  Click [Go].

Gorilla, orangutan, macaque and marmoset all have a T in this position, which confirms that T is indeed the ancestral allele.

(h)

 Go to the Neandertal Genome Browser (http://neandertal.ensemblgenomes.org).  Type ‘rs1805081’ in the ‘Search Neandertal’ text box.  Click [Go].  Click on ‘rs1805081’ on the page with search results.  Click on ‘Jump to region in detail’.  Click on ‘Configure this page’ in the side menu.  Click on ‘Variation features’.  Select ‘All variations – Normal’.  Click [SAVE and close].  Draw a box of about 50 bp around rs1805081 (shown in yellow in the center of the display).  Click on ‘Jump to region’ on the pop-up menu.

The ‘Sequences’ track shows that there are three reads for Neandertal at the position of rs1805081, all with a T, so based on these (very limited) data there is no evidence that both alleles were already present in Neandertal. ______

Exercise 2 – Comparing a SNP in different individuals

(a) Find the NPC1 (Niemann-Pick disease, type C1) gene for human and go to the ‘Genetic Variation - Comparison image’ page for transcript ENST00000269228. Configure the page to only show non-synonymous coding variants.

(b) Identify the SNP rs1805081 (also referred to as ‘H215R’ or ‘c.644A>G’). Do individuals NA12878, NA12891 and NA12892 (the child, father and mother from the Caucasian trio that were sequenced at high coverage in Pilot 2 of the 1000 Genomes Project) have resequencing information in this region?

(c) What are the genotypes of these three individuals for rs1805081?

(d) Show the resequencing alignment for the 1000 bp region around rs1805081. What does the Y at the position of this SNP in the sequence of individual NA12892 mean?

(e) (optional) Have a look at the individual reads for individual NA12892 in the region of rs1805081 in the 1000 Genomes Browser (http://browser.1000genomes.org). Can you see from the reads that this individual is heterozygous for rs1805081? ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘npc1’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘NPC1 [ Ensembl/Havana merge gene: ENSG00000141458 ]’.  Click on ‘ENST00000269228’.  Click on ‘Genetic Variation - Comparison image’ in the side menu.  Click [Configure this page] in the side menu.  Click on ‘Consequence type’.  Select ‘Select/deselect all’.  Select ‘Non-synonymous’.  Click ().

(b)

 Click on the yellow box with ‘H/R’ in it.

Yes, all three individuals have resequence coverage of 1 at this position, as indicated by the light grey bar shown below the transcript.

(c)

 Click on the green/purple boxes for rs1805081 for NA12878, NA12891 and NA12892.

The genotype of NA12878 (child) and NA12891 (father) for rs1805081 is C|C (purple), while the genotype of NA12892 (mother) is C|T (green/purple). Note that the (haploid) GRCh37 reference assembly has the T allele (green).

(d)

 Click on the yellow box with ‘H/R’ in it or the open yellow box for rs1805081 at the bottom of the figure.  Click on ‘rs1805081 properties’ in the pop-up menu.  Click on ‘View in location tab’.  Click on ‘Genetic Variation - Resequencing’ in the side menu.  Click ().

Y is the IUPAC (http://www.bioinformatics.org/sms/iupac.html) or ambiguity code for a C or a T (both of which are pYrimidines). Note that the resequencing alignment shows the sequence of the forward strand of the genome assembly. If you want the reverse strand instead, you can reconfigure the page using [Configure this page].

(e)

 Go to the 1000 Genomes Browser (http://browser.1000genomes.org).  Type ‘rs1805081’ in the ‘Search 1000 Genomes’ text box.  Click [Go].  Click on ‘rs1805081’ on the page with search results.  Click on ‘View in location tab’.  Click [Configure this page] in the side menu.  Click on ‘1000 Genomes BAM’.  Select ‘NA12892-SLX – Unlimited’.  Click ().  Draw a box of about 50 bp around rs1805081 (shown in yellow in the center of the display).  Click on ‘Jump to region’ on the pop-up menu.

The ‘NA12892-SLX’ track shows the individual reads for NA12892 in grey, while the consensus sequence derived from the reads is shown at the top in colour. Eight of the 16 reads show a T at the position of rs1805081 and eight show a C, confirming the heterozygosity of this individual for this SNP. ______

Exercise 3 – Variant Effect Predictor

Resequencing of the genomic region of the human CFTR (cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7)) gene (ENSG00000001626) has amongst others revealed the following variants (alleles defined in the forward strand):

G/A at 7:117,171,039 T/C at 7:117,171,092 T/C at 7:117,171,122

(a) Determine if these variants result in a change in the proteins encoded by any of the Ensembl transcripts of the CFTR gene. Which of these variants is the most damaging / deleterious according to the SIFT and PolyPhen tools? Have the variants already been annotated by Ensembl?

(b) Have a look at the uploaded variants in ‘Region in detail’. ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘Tools’ link on the toolbar.  Click on ‘Variant Effect Predictor Online version’.  Enter ‘My variants’ in the ‘Name for this upload (optional)’ text box.  Enter the following list in the ‘Paste file’ text box (be sure to separate the five values on each row by spaces):

7 117171039 117171039 G/A + 7 117171092 117171092 T/C + 7 117171122 117171122 T/C +

 Select ‘SIFT predictions: Prediction only’.  Select ‘PolyPhen predictions: Prediction only’.  Click [Next>].  Click on ‘HTML’.

Two of the variants are non-synonymous coding in three of the encoded proteins and thus cause an amino acid change. One variant is synonymous coding and thus doesn’t change the amino acid sequence. Of the two non- synonymous coding variants, the one at position 117171092 seems to be more deleterious / damaging according to SIFT and PolyPhen than the one at position 117171122. All variants have already been annotated by Ensembl as can be seen from the dbSNP accession numbers that are shown in the ‘Co- located Variation’ column.

(b)

 Click for one of the variants on the coordinates in the ‘Location’ column on the ‘Variant Effect Predictor Results’ page.

The uploaded variants should be shown on ‘Region in detail’ in a new track, named ‘My variants’.

 Draw with your mouse a box around the uploaded variants.  Click on ‘Jump to region’ in the pop-up menu.

It can be clearly seen that all three uploaded variants correspond to an already annotated variant shown in the ‘Sequence variants (all sources)’ track, one synonymous coding (green) and two non-synonymous coding (yellow). ______

Exercise 4 – Structural variation

In the paper ‘The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility’ (Gonzalez et al. Science. 2005 Mar 4;307(5714):1434-40.) it is shown that a higher copy number of the CCL3L1 (Chemokine (C-C motif) ligand 3-like 1) gene is associated with lower susceptibility to HIV infection.

(a) Find the genomic region of the human CCL3L1 gene.

(b) Have any CNVs been identified in this region? Note: CNVs are in Ensembl classified as structural variants.

(c) Have any CNVs been annotated in this region by the Database of Genomic Variants (DGV)? ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘ccl3l1’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ’17:34623842-34625731:-1’ below ‘CCL3L1 [ Ensembl/Havana merge gene: ENSG00000205021 ]’.

(b)

 Click [Configure this page] in the side menu.  Type ‘structural’ in the ‘Find a track’ text box.  Select ‘Structural variants (all sources)’.  Click ().

Yes, CNVs have been identified in this region by multiple studies, as indicated by the various blue bars in the ‘Structural variants (all sources)’ track.

(c)

 Click [Configure this page] in the side menu.  Type ‘dgv in the ‘Find a track’ text box.  Select ‘DGV loci – (force) Labels’.  Click ().

Yes, the Database of Genomic Variants has annotated many CNVs in this region.

To obtain detailed information on a CNV:

 Click on the CNV.  Click on ‘DGV’ in the pop-up menu. ______

Exercise 5 – Locus Reference Genomic

The human CTSC (Cathepsin C) gene is among the genes for which there exists a Locus Reference Genomic (LRG).

(a) Find the CTSC gene for human and find the LRG for this gene.

(b) To which of the Ensembl/Havana transcripts is the LRG gene most identical?

(c) Does the LRG sequence differ from the sequence of the GRCh37 reference assembly?

(d) Does Ensembl show LRGs for other genes than CTSC? ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘ctsc’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘CTSC [ Ensembl/Havana merge gene: ENSG00000109861 ]’.  Click on ‘LRG_50’ on the ‘Gene summary’ page.

(b)

The LRG gene, shown in lilac, is most identical to the transcript CTSC-001 (ENST0000227266), but it is slightly longer (1907 bp compared to 1903 bp) and differs at the 5’ and 3’UTR (the 5’UTR of the LRG is 14 bp shorter while the 3’UTR is 18 bp longer). The latter is not apparent from the figure, but can be seen when the actual sequences are compared.

(c)

 Click on ‘Reference comparison’ in the side menu.

Yes, there is one difference with the GRCh37 reference assembly, at position 30359 of the LRG (T>C).

(d)

 Click on ‘All LRGs’ in the side menu.

Yes, currently Ensembl has LRGs for close to 200 genes. Many more are pending though, as can be seen on this page on the LRG website: http://www.lrg-sequence.org/page.php?page=view_all_lrgs. ______

Exercise 6 – BioMart

In the paper ‘Identification of seven new prostate cancer susceptibility loci through a genome-wide association study’ (Eeles et al. Nat Genet. 2009 Oct;41(10):1116-21.) the following twelve SNPs are shown to be associated with prostate cancer susceptibility: rs10993994, rs2735839, rs4242384, rs6983267, rs7931342, rs7501939, rs9364554, rs6465657, rs5945619, rs2660753, rs1016343, rs1859962.

Use BioMart to generate a list of the genes to which these SNPs map. Include the Ensembl Gene ID, name and description.

Hint: You should start with the Ensembl Variation Mart. To be able to include the gene name and description, add the Ensembl human genes as a second dataset. ______

Answer

 Go to the Ensembl homepage (http://www.ensembl.org).  Click on the ‘BioMart’ link on the toolbar.

 Choose the ‘Ensembl Variation 62’ database.  Choose the ‘Homo sapiens Variation (dbSNP132;ENSEMBL)’ dataset.

 Click on ‘Filters’ in the left panel.  Expand the ‘GENERAL VARIATION FILTERS’ section by clicking on the + box.  Enter the list of SNP IDs in the ‘Filter by Variation ID’ text box (either comma separated or as a list).

Add the Ensembl human genes as a second dataset:

 Click on ‘Dataset’ at the bottom of the left panel.  Choose the ‘[Ensembl Genes 62] Homo sapiens genes (GRCh37.p3)’ dataset.

 Click on ‘Attributes’ in the left panel.  Expand the ‘GENE’ section by clicking on the + box.  Deselect ‘Ensembl Transcript ID’.  Select ‘Description’.  Expand the ‘EXTERNAL’ section by clicking on the + box.  Select ‘HGNC symbol’.

 Click the [Results] button on the toolbar.  Select ‘View All rows as HTML’ or export all results to a file. Tick the box ‘Unique results only’.

Your results should show that eight out of the twelve SNPs map to an Ensembl gene, seven of which have an HGNC symbol assigned. ______

______

COMPARATIVE GENOMICS ______

Exercise 1 – Orthologs, paralogs and genetrees

The photoreceptor cells in the retina of the human eye contain a number of different photoreceptors. The rod cells contain rhodopsin, which is responsible for monochromatic vision in the dark. The cone cells all contain one of three types of opsins, which respond to long-wave (red), medium-wave (green) and short-wave (blue) light, respectively, and are responsible for trichromatic color vision (see for instance http://en.wikipedia.org/wiki/Opsin).

(a) Find the gene encoding the long-wave-sensitive (red) opsin.

(b) How many within-species paralogs have been identified for this gene? Note the ‘Target %id’ and ‘Query %id’. Which paralog has the most sequence similarity with the long-wave-sensitive opsin?

(c) Have a look at the genomic location of the long-wave-sensitive (red), medium-wave-sensitive (green) and short-wave-sensitive (blue) opsin genes. Does this explain why red-green color blindness is much more prevalent in males than in females (e.g. in the US population 7% vs 0.4%)?

(d) Have a look at the gene tree for the long-wave-sensitive opsin gene. Which of its paralogs is due to the most recent duplication event? Is this reflected in the sequence similarity between the long-wave-sensitive opsin and this paralog when compared with the other paralogs (see question b)? On which taxonomic level did this duplication take place?

(e) Retrieve an alignment between the long-wave-sensitive opsin and all its paralogs in Jalview. To this end, select all aligned protein sequences using ‘Select > Select all’, then order them by using ‘Calculate > Order > by ID’, subsequently select all human protein sequences and finally use ‘Select > Invert Sequence Selection’ and ‘Edit > Delete’ to delete all non-human protein sequences. ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘long wave sensitive opsin’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘OPN1LW [ Ensembl/Havana merge gene: ENSG00000102076 ]’.

Note that the ‘LW’ in the gene symbol OPN1LW stands for ‘long-wave’.

(b)

 Click on ‘Comparative Genomics - Paralogues’ in the side menu.

There have nine within-species paralogs been identified for the human long- wave-sensitive (red) opsin gene. ENSG00000147380 (OPN1MW) and ENSG00000166160 (OPN1MW2), the genes encoding the medium-wave- sensitive (green) opsins, show the highest Target %id and Query %id. The medium-wave-sensitive opsins thus have the highest sequence similarity to long-wave-sensitive opsin (Target %id indicates the percentage of the sequence of long-wave-sensitive opsin matching the sequence of the paralog protein. Query %id indicates the percentage of the sequence of the paralog protein matching the sequence of long-wave-sensitive opsin).

(c)

 Click on ‘ENSG00000147380’, ‘ENSG00000166160’ and ‘ENSG00000128617’.

The genes for the long-wave-sensitive (red) and medium-wave-sensitive (green) opsins are located next to each other on the X chromosome, while the gene for the short-wave-sensitive (blue) opsin is located on . As females have two X a normal gene on one chromosome can often make up for a defective one on the other, whereas males cannot make up for a defective gene. Thus, red-green color blindness is much more prevalent in males than in females. Variation in the genes for long-wave- sensitive and medium-wave-sensitive opsin can cause subtle differences in color perception, while tandem rearrangements due to unequal crossing-over between these genes cause more serious defects in color vision.

(d)

 Click on ‘ENSG00000102076’ in case you are not already on the ‘Gene: OPN1LW’ tab.  Click on ‘Comparative Genomics - Gene Tree (image)’ in the side menu.  Click on ‘View options – View paralogs of current gene’ below the gene tree image.  Click on the nodes (red squares) for the duplication events that have given rise to the various paralogs.

A duplication event on the level of the Catarrhini (Old World monkeys and apes) has given rise to the long-wave-sensitive and medium-wave-sensitive opsins. The other paralogs are due to earlier duplication events. This agrees with the fact that the medium-wave-sensitive opsins show the highest sequence similarity with long-wave-sensitive opsin (see question b) and the fact that the genes for the long-wave-sensitive and medium-wave-sensitive opsins are located close to each other (see question c).

(e)

 Click on the speciation node (blue square) that is at the base of the complete gene tree.  Click on ‘Expand for Jalview’ in the pop-up menu (that should say ‘Taxon: Coelomata’).  Click [Start Jalview].  Close the pop-up window with the gene tree.  Click on ‘Select > Select all’ on the menu bar of the pop-up window with the protein sequence alignment.  Click on ‘Calculate > Sort > by ID’ on the menu bar.  Select the protein sequences of the seven human paralogs by clicking on them while holding the [Shift] key.  Click on ‘Select > Invert Sequence Selection’ on the menu bar.  Click on ‘Edit > Delete’ on the menu bar.

As the alignment is based on the complete set of protein sequences in the gene tree, the alignment of this subset of nine proteins will contain empty columns. These can be removed using the option ‘Edit > Remove Empty Columns’ on the menu bar.

 Click on ‘Edit > Remove Empty Columns’ on the menu bar. ______

Exercise 2 – Whole genome alignments

(a) Find the Ensembl BRCA2 (Breast cancer type 2 susceptibility protein) gene for human and go to the ‘Region in detail’ page.

(b) Turn on the ‘BLASTZ alignment’ tracks for chicken, chimp, mouse and platypus and the ‘Translated BLAT alignment’ tracks for anole lizard and zebrafish. Does the degree of conservation between human and the various other species reflect their evolutionary relationship? Which parts of the BRCA2 gene seem to be the most conserved? Did you expect this?

(c) Have a look at the ‘Conservation score’ and ‘Constrained elements’ tracks for the set of 35 mammals and the set of 19 vertebrates. Do these tracks confirm what you already saw in the tracks with pairwise alignment data?

(d) Retrieve the genomic alignment for a constrained element. Highlight the bases that match in >50% of the species in the alignment.

(e) Retrieve the genomic alignment for the BRCA2 gene for the primates. Highlight the bases that match in >50% of the species in the alignment. ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘brca2 ’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ’13:32889611-32973805:1’ below ‘BRCA2 [ Ensembl/Havana merge gene: ENSG00000139618 ]’.

You may want to turn off all tracks that you added to the display in the previous exercises as follows:

 Click [Configure this page] in the side menu.  Click ‘Reset configuration for Main panel to default settings’.  Click ().

(b)

 Click [Configure this page] in the side menu  Click on ‘BLASTZ/LASTz alignments’.  Select ‘Chicken (Gallus gallus) - BLASTZ_NET – Normal’, ‘Chimpanzee (Pan troglodytes) – BLASTZ_NET – Normal’, ‘Mouse (Mus musculus) – BLASTZ_NET - Normal’ and ‘Platypus (Ornithorhynchus anatinus) - BLASTZ_NET - Normal’.  Click on ‘Translated blat alignments’.  Select ‘Anole Lizard (Anolis carolinensis) - TRANSLATED_BLAT_NET - Normal’ and ‘Zebrafish (Danio rerio) - TRANSLATED_BLAT_NET – Normal’.  Click ().

Yes, the degree of conservation does reflect the evolutionary relationship between human and the other species; the highest degree of conservation is found in chimp, followed by mouse, platypus, chicken, lizard and zebrafish. Especially the exonic sequences of BRCA2 seem to be highly conserved between the various species, which is what is to be expected because these are supposed to be under higher selection pressure than intronic and intergenic sequences.

(c)

 Click [Configure this page] in the side menu  Click on ‘Multiple alignments’.  Select ‘Conservation score for 35 eutherian mammals EPO_LOW_COVERAGE’, ‘Constrained elements for 35 eutherian mammals EPO_LOW_COVERAGE’, ‘Conservation score for 19 amniota vertebrates Pecan’ and ‘Constrained elements for 19 amniota vertebrates Pecan’.  Click ().

The ‘Conservation score’ and ‘Constrained elements’ tracks largely correspond with the data in the pairwise alignment tracks; all exons of the BRCA2 gene show a high degree of conservation. Note that the UTRs for the largest part don’t, though.

(d)

 Click on a constrained element.  Click on ‘View alignments (text)’ in the pop-up menu.  Click [Configure this page] in the side menu.  Select ‘Conservation regions: All conserved regions’.  Click ().

(e)

 Click on the ‘Gene: BRCA2’ tab.  Click on ‘Comparative Genomics – Genomic alignments’ in the side menu.  Select ‘Alignment: 6 primates EPO’.  Click [Go].  Click [Configure this page] in the side menu.  Select ‘Conservation regions: All conserved regions’.  Click (). ______

Exercise 3 – Synteny

(a) Find the Ensembl CYCS (Cytochrome c) gene for human.

(b) Is there an ortholog predicted for the human CYCS gene in horse?

(c) Have for both possible orthologs a look at the ‘Multi-species view’ page. Based on synteny information, which of the two seems the best candidate to be the horse cytochrome c gene?

(d) Does the annotation (e.g. general identifiers, GO terms, protein domains) of this gene confirm that this is the horse cytochrome c gene? ______

Answer

(a)

 Go to the Ensembl homepage (http://www.ensembl.org).  Select ‘Search: Human’ and type ‘cycs’ in the ‘for’ text box.  Click [Go].  Click on ‘Gene’ on the page with search results.  Click on ‘Homo sapiens’.  Click on ‘CYCS [ Ensembl/Havana merge gene: ENSG00000172115 ]’.

(b)

 Click on ‘Comparative Genomics - Orthologues’ in the side menu.  Type ‘horse’ in the ‘Filter’ text box at the top of the ‘Orthologues’ table.

There are two orthologs predicted for the human CYCS gene in horse, ENSECAG00000014330 (an apparent 1-to-1 ortholog) and ENSECAG00000016170 (a possible ortholog).

(c)

 Click on ‘Multi-species view’ for ENSECAG00000014330.  Click on the Back button of the browser to get back to the ‘Orthologues’ page.  Click on ‘Multi-species view’ for ENSECAG00000016170.

In human the CYCS gene is located downstream of the MPP6, DFNA5 and OSBPL3 genes. In horse, ENSECAG00000014330 (CYC_HORSE) is also located downstream of these three genes, while ENSECAG00000016170 (LOC100053740) isn’t. Based on this information, ENSECAG00000014330 seems the best candidate to be the horse cytochrome c gene.

(d)

 Click on the Back button of the browser to get back to the ‘Orthologues’ page.  Click on ‘ENSECAG00000014330’.  Click on the ‘Transcript: CYC_HORSE’ tab.  Click on ‘External References - General identifiers’ in the side menu.  Click on ‘Ontology – Ontology table’ in the side menu.  Click on ‘Protein Information - Domains & features’ in the side menu.

The annotation confirms that ENSECAG00000014330 indeed is the horse cytochrome c gene.

Note: That ENSECAG00000014330 is only recognised as an apparent 1-to-1 ortholog of the human CYCS gene in the Ensembl ortholog/paralog prediction pipeline is due to the fact that the cytochrome c protein is very conserved between different species. Due to this high conservation, the amount of phylogenetic signal present in this gene family is low, which makes the tree topology determination less accurate. This makes certain orthologous/paralogous relationships difficult to infer. ______

Exercise 4 – Pan-taxonomic compara

(a) Use the Ensembl Genomes browser (http://www.ensemblgenomes.org) to find the fumarase gene of Saccharomyces cerevisiae.

(b) Is this gene conserved across all the kingdoms, i.e. animals, plants, fungi, protists and prokaryotes? Is that what you would expect? ______

Answer

(a)

 Go to the Ensembl Genomes homepage (http://www.ensemblgenomes.org).  Click on ‘Fungi’ in the menu bar.  Select ‘Search: Saccharomyces cerevisiae’ and type ‘fumarase’ in the ‘for’ text box.  Click [Go].  Click on ‘YPL262W (SGD: FUM1)’ on the page with search results.

(b)

 Click on ‘Pan-taxonomic Compara – Gene Tree (image)’ in the side menu.  Click on ‘View fully expanded tree’ below the gene tree image.

There are 52 orthologs predicted for the fumarase gene, covering all kingdoms, from bacteria to human. This is what you would expect as fumarase catalyzes a step in the citric acid cycle (TCA cycle), a process that is of central importance in all aerobic organisms. ______