Function Prediction

BCB 444/544 Fall 06 Sep 14 Lab 4 p. 1

BCB 444/544 Lab 5 (Sept 27)

Function Prediction Due Mon Oct 1 by 5 PM - email to [email protected]

Objectives 1. Predict genes in a DNA sequence. (ORFFinder @ ncbi , GeneMark) 2. Use web-based tools to predict protein family/function and understand how they work. 3. Refresh skills we have developed in previous labs.

Introduction In this lab you will be using sequence based approaches to predict genes and protein function. Gene prediction works best with an ensemble of techniques including ORF scanning, intron identification, transcription factor binding site identification, and codon bias. There are multiple approaches to the function prediction problem including: sequence alignments, motif identification, weight matrices, and Hidden Markov Models. After completing this lab you should be able to understand each of these methods and be familiar with online services that provide these services. We will be assuming knowledge of activities in previous labs such as sequence retrieval and sequence alignment via BLAST to complete this lab.

Exercises

Gene Prediction

Part I – ORF Finder We’ll start our gene prediction journey with the simple task of scanning a genome for open reading frames. An open reading frame or ORF is the DNA sequence located between the start codon and a stop codon. Open reading frames represent potential genes. Since each codon consists of three base pairs, each ORF must have a length divisible by three. By the same token, there exist three possible reading frames for a strand of DNA. A fourth reading frame would be inline with the first reading frame. Since DNA is double stranded we have a total of 6 reading frames (3 on each strand).

The three possible reading frames: AUG, UGG, GGT. Note: GTA is in the same reading frame as AUG

Let’s use ORF Finder at NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html to locate all the possible open reading frames. We’re going to locate all the open reading frames in the Human Immunodeficiency Virus (HIV). Its genbank accession number is NC_001802. We can either BCB 444/544 Fall 06 Sep 14 Lab 4 p. 2 enter the accession number or we could find the sequence ourselves and enter it directly. For now enter the accession and click the “OrfFind” button. The default settings show us all open reading frames greater than 100 bases in length.

1.) What is the longest open reading frame detected in the HIV virus.

Answer: 2739

2.) Assuming this is a valid ORF with no introns, how long would the protein produced by this ORF be?

Answer: 2739/3 - 1 (stop codon has no AA) = 912 amino acids

Change the minimum displayed ORF length by changing the drop down box from 100 to 50 and then clicking the “Redraw” button.

3.) Do you think most of these new ORF’s are translated into proteins? Why or why not?

Answer: No. They are short with ~120 nucleotides which would result in proteins with ~39 amino acids. Most proteins are much larger with at least several hundred amino acids.

4.) Take note of the frames that these genes are found in. What do you think the + and the – stand for?

Answer: They indicate which strand the ORF was found on.

Let’s assume we think the longest ORF is a real gene. Click either on the highlighted ORF in the 2D chromosomal picture or on the colored box next to the ORF in the ORF list. We can either accept this ORF as an actual gene or select the “Alternative Initiation Codons” button. Although we learned that open reading frames start with ATG, Biology has another rule that takes precedence over all other rules: “There is an exception to every rule except this one.”

5.) Select the “Alternative Initiation Codons” button. The gene now starts with what earlier codon.

Answer: CTG (L) eucine

Part II – GeneMark ORF’s go a long way to helping us identify potential genes. However, not all ORF’s have the appropriate transcription factor binding sites to provide translation. Also, prediction BCB 444/544 Fall 06 Sep 14 Lab 4 p. 3 gets much more challenging when we consider alternative start codons, introns, and alternative splicing. Thankfully there are plenty more tricks we can pull out of our bag. Nearly all eukaryotic introns begin with GT and end with AG; this is known as the GT- AG rule. We can also take advantage of something known as codon bias. There are 20 Amino Acids and 64 possible codons (4^3). This means that many Amino Acids are represented by multiple codon triplets. However separate species are able to utilize certain codons better due to their specific cell machinery. If we know how often a given triplet is used in known genes for that organism, we can assign probabilities to each codon occurring. We can then look at these probabilities for the entire protein to assess whether it is likely to be a gene in this organism. We can also expand our search to scan for transcription factor binding sites upstream of the ORF. By combining all of these methods and several other techniques we can do a much better job of predicting genes than simply looking at all ORF’s. There are many programs out there, taking advantage of different combinations of this information. One such program, Gene Seqer, was developed by Volker Brendel, an Iowa State professor and is accessible through his site at http://www.plantgdb.org/ . Feel free to go through their tutorial after you have completed the lab. In this lab, we will be using a the GeneMark program. Go to http://www.ncbi.nlm.nih.gov and click on the “Genomic Biology” link on the left side of the screen. Here we can access whole genome sequences. On the right side of the screen under Organism-Specific select the G for Genome Resources next to Human. There should be a figure with the human chromosomes in the upper left hand corner. Click on chromosome 1 to go to a map of that chromosome. Click on the “Download/View Sequence/Evidence” link in the upper right corner. The chromosome is very large and is broken into several files. We just want ~50,000 bases of the chromosome for this lab. Change the range values to display the sequence from 50000 to 100000, then click the “Change Region/Strand” button. A new region with the desired region should appear below. Click display and select FASTA format. Now copy this shortened FASTA sequence and go to http://opal.biology.gatech.edu/GeneMark/genemark24.cgi and paste in the copied sequence. Under running options change the species to H. sapiens. Under output options check “Generate PDF graphics (screen)” Click the “Start GeneMark” button then click on “View PDF Graphical Output”

6.) In which regions is the most probable gene?

Answer: 14,800 - 16,600

7.) In what reading frame is the most probable gene?

Answer: Direct Strand (+1)

Function Prediction Part III - BLAST - Alignments Last lab we used sequence alignments to find similar proteins with similar sequence composition to our own. It is a common assumption that proteins of similar sequence share similar function. BCB 444/544 Fall 06 Sep 14 Lab 4 p. 4

Protein sequence determines protein structure, and the combination structure and sequence determine function. You first instinct may be to directly infer that sequence determines function, and this is largely true. So why not drop “sequence infers structure” and skip directly to “sequence infers function”. It is worded this way because multiple diverse sequences may result in the same structure. In this case, proteins with vastly different sequence compositions may share similar functions. We state that sequence and structure infer function because structure itself is not enough to produce an actively interacting protein. The interacting regions must have the proper biochemical properties as well; therefore the sequence at these interacting regions is important.

We will start today’s lab with the basic assumption that sequence infers function. Let’s return to NCBI and retrieve the protein sequence NP_031939. Now BLAST the protein sequence against all Fungi.

8.) How many sequences do you find?

Answer: 243

Many of these sequences will be unnamed or hypothetical, meaning their function or actual existence as a protein is unknown. Several with good e-vals exist and provide a known function.

9.) Do you think this protein shares similar function with the other proteins? Why? (Hint look at the picture of the aligned regions near the top of the page.)

Answer: No - Only one small region of the larger protein matches. It may share partial function via a related domain.

10.) Looking at the best match, what positions in the query sequence align to the target sequence?

Answer: ~334 - ~419

At the top of the screen, click on the “Show Conserved Domains” link in the upper right corner.

11.) Do these domains appear to correlate with the conserved region identified in the BLAST search?

Answer: Yes

Now repeat the BLAST again; however, this time BLAST against the entire NR database (not just Fungi). There should be many hits this time. Look at what regions of the blast match this time. BCB 444/544 Fall 06 Sep 14 Lab 4 p. 5

12.) Based solely on the description of the protein hits from this BLAST, what would you predict the function of this entire protein to be?

Answer: The entire protein sequence matches with other proteins, which are labeled only as zinc finger proteins. One could easily interpret this meaning the entire protein has only this function.

13.) How does this contradict with what we know to be the case? (Hint: Look at the answers to questions 10 and 11)

Answer: We know that only a small of the protein has zinc finger protein function and not the entire protein, but here the entire protein sequence matches against a protein which is labeled as a zinc finger protein.

14.) What potential problems could you see in trying to identify protein function strictly by sequence alignment in the following scenario: A second protein exists with sequence related to only the first 200 bases of this protein? The blast results give this protein as the top hit.

Answer: The whole protein could be annotated with a function that actually belongs to a different domain. (e.g. Protein 1 has domain A and domain AB. Protein 2 has domain AB and domain B. Protein 1 and 2 would be identified by Blast as related because they share domain AB. If protein 1 is labeled for the function of domain A, one could mistakenly assign that function to protein 2)

Now imagine this incorrect label propagating to several other related proteins. A user could blast a sequence and get a number of hits as matches, all of which have incorrect labels stemming from one original error. Unfortunately this scenario does happen. We must be careful when assessing function via alignments.

Part IV – PROSITE – Motifs & Weight Matrices In lecture we have already been introduced to protein motifs, a recurring pattern of amino acids that denote a characteristic such as a structure or function. Identifying protein function by scanning for known protein motifs can prove more reliable than sequence alignments. Go to the PROSITE website at http://ca.expasy.org/prosite/. Read the PROSITE User Manual Introduction under Documents. Under the Tools for PROSITE section submit the NP_031939 sequence we explored above in Blast.

15.) How many hits were found for this sequence?

Answer: 6 BCB 444/544 Fall 06 Sep 14 Lab 4 p. 6

16.) What conditions are met for the profile identified?

Answer: 3=C 6=C 19=H 23=H

17.) What is better evidence of function – a profile or a patter? What is the difference between the two?

Answer: Patterns show greater confidence of function. Profiles are used to show relationships between sequences that have more evolutionary divergence. Patterns have stricter sequence requirements. They are also used to recognize sequences which have less evolutionary divergence- both which would indicate less functional divergence.

Part V – SuperFam – Hidden Markov Models (HMM)

Read “A concrete Example” of a HMM http://en.wikipedia.org/wiki/Hidden_Markov_Models . Then go to the Super Family website at http://supfam.org/SUPERFAMILY/hmm.html . This website uses pre-trained HMM’s to determine what family a protein falls into. Again paste the NP_031939 FASTA sequence into the sequence window. Change the notification to Browser and click submit. When “YOUR RUN IS COMPLETE” click “here” to see the output.

18.) What domains does SUPERFAM identify?

Answer: C2H2 zinc fingers

There are alternative programs that predict Superfamily based on HMM’s. Each program may give varying results depending how they were train.

Part VI - JAFA – Metaserver Using one method of function prediction may be adequate. However, it is often better to use multiple methods at once and combine the results. JAFA (Joint Assembly of Function Annotations) is one such metaserver. Go to the JAFA meta server at http://jafa.burnham.org/ and use the NP_031939 FASTA sequence again.

19.) Do you receive any new information?

Answer: Not really. Three molecular functions came up- zinc ion binding, transcription factor activity, and nucleic acid binding, which are all related to the same thing, the function of the zinc finger domain. BCB 444/544 Fall 06 Sep 14 Lab 4 p. 7

20.) Based on all the function prediction results what can you say about our mystery protein? (What do we know and what do we not know?)

Answer: Based on the functional prediction results, we are very confident that there is a domain in this protein which is actually a zinc finger domain. It is located in the regions from around amino acid 320 to 420. Its function is DNA binding and this part of the sequence is very common in many other proteins, which also have a zinc finger domain. From this exercise, we have not been able to identify the function of other regions of the protein.

Due Mon Sept 18 by 5 PM - email to [email protected]