Computational Molecular Biology

Lecture Notes

by A.P. Gultyaev

Leiden Institute of Applied Computer Science (LIACS) Leiden University February 2016

1 Contents

Introduction . . . 3

1. Sequence databases . . . 4 1.1. Nucleotide sequence databases . . . 4 1.2. sequence databases . . . 4

2. Sequence alignment . . . 5 2.1. Pairwise alignment . . . 5 2.2. Sequence database similarity search . . . 10 2.3. Multiple sequence alignment . . . 13

3. Sequence motifs, patterns and profile analysis . . . 15

4. finding and genome annotation . . . 18 4.1. Search by signal . . . 18 4.2. Search by content . . . 19 4.3. Probabilistic models . . . 20 4.4. Similarity-based search . . . 21 4.5. Genome annotation strategies . . . 21

5. Protein structure prediction . . . 22 5.1. Homology modeling . . . 23 5.2. Fold recognition . . . 23 5.3. Ab initio protein structure prediction . . . 25 5.4. Combinations of approaches . . . 26 5.5. Predictions of coiled coil domains and transmembrane segments . . . 26 5.6. Inverse folding and protein design . . . 27

6. RNA structure prediction . . . 28 6.1. Thermodynamics of RNA folding and RNA structure predictions . . . 28 6.2. Comparative RNA analysis . . . 29 6.3. Detecting conserved structures in related RNA sequences . . . 29 6.4. RNA motif search . . . 30

Recommended reading . . . 31

Exercises . . .32

2 Introduction

The course Computational Molecular Biology aims to make participants familiar with a number of widely used bioinformatic approaches. The purpose of the course is not to make a guided tour along the available resources in the Internet such as databases, programs etc. but rather to give an idea about the main principles behind efficient algorithms for genomic data analysis. Of course, the text below is just some (lecture) notes that may assist in understanding the material given in lectures and should not be considered as independent comprehensive description. This material is accompanied by exercises with tasks to be accomplished using a number of bioinformatic resources, in order to get hands-on experience.

3 1. Sequence databases

The databases of protein amino acid sequences have appeared before nucleotide databases. The Atlas of Protein Sequences and Structures was published in 1965. The development of efficient DNA sequencing techniques resulted in the explosion of the information on nucleotide sequences. In the early 1980’s three major databases have been created: European EMBL nucleotide sequence database, American GenBank and Japanese DDBJ. In 1988 an agreement of a common format has been achieved. Nowadays, the three databases exchange all sequences. However, any update of a sequence entry is performed by a database that owns the entry from the moment of direct submission, to prevent conflicts of updates.

1.1. Nucleotide sequence databases

The main elementary unit of sequence databases is a flatfile representing a single sequence entry. Although the database management systems (DBMS) of modern sequence databases may invoke relational database management systems (DBMS), flatfile format remains a convenient format of exchange between GenBank, EMBL and DDBJ databases. Some small (mainly syntax) differences between flatfiles of the three databases do exist, for instance, line- type prefixes at every line of EMBL files. Below the structure of the GenBank flatfile is considered as an example (see also “Sample GenBank Record”, http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) .

1.1.2 The GenBank flatfile

The main three parts of the GenBank entry are the header (description of the record), features (annotations on the record) and the nucletide sequence itself. These main parts contain various datafields.

The header contains the description of sequence including data such as names of organism, gene name, sequence length, keywords, literature reference data. A very important part of the header is identifiers that are used for searching the entries. Two identifiers are used: “Accession” and “GI” numbers. An accession number is unique for a sequence record. It does not change, even if information in the record is changed (e.g. sequence correction), but version number is assigned (U12345.1; U12345.2). GI-numbers (GenInfo) identifiers are unique for every sequence: if a sequence changes, a new GI-number is assigned. A separate number is used to every ORF encoded by a nucleotide sequence.

Features describe the biological information relevant to the sequence (annotation), such as encoded , RNA sequences, various signals such as protein binding sites, mRNA, exons and introns etc.

Many of the datafields in GenBank sequence entry provide links to the relevant entries of other databases such as literature (e.g. PubMed) or protein databases. These are so-called hard links (between entries from different databases that share the same datafields) of the retrieval system. ENTREZ is maintained by American National Center for Biotechnology Information (NCBI) and provides unified access to various databases and resources. In addition to the access to the primary database of nucleotide sequences (to which data is submitted by scientific community), ENTREZ provides the access to various curated databases such as RefSeq (non-redundant reference sequences), ENTREZ Gene (gene-centered information) etc.

1.2. Protein sequence databases

One of the oldest and very popular protein sequence databases is SWISS-PROT. It was created in 1987 by collaboration of University of Geneva and EMBL. The database is accessible via both Entrez (NCBI) and SRS (European Bioinformatics Institute, EBI) systems. It can be called a secondary (curated) database: it adds value to what is already present in a primary database (nucleotide sequence database). A very small proportion of amino acid sequences is directly submitted to SWISS-PROT.

The data in SWISS-PROT sequence file may be divided into the core data and annotation. The core contains a sequence, bibliography and taxonomic information. The annotation includes function(s) of the protein, domains, specific sites etc.

In addition to SWISS-PROT, the protein entries in the Entrez retrieval system have been compiled from several databases, and also include translations from annotated coding regions in GenBank and RefSeq. ENTREZ Protein database entries provide hard links to relevant entries from other databases. Furthermore, ENTREZ system provides the links to the protein sequences similar to the one contained in the accessed entry (a concept of neighboring entries in ENTREZ).

It should be noted that some protein-coding sequences (open reading frames, ORFs) may remain unnoticed (unannotated). A popular resource of ENTREZ is the ORF Finder with relatively simple algorithm searching for potential ORFs in a given nucleotide sequence in all 6 reading frames (including complementary strand).

4 2. Sequence alignment

In general, alignment of sequences can be defined as mapping between the residues of two or more sequences, that is, finding the residues that are related to each other due to phylogenetic or functional reasons. One of the main goals of sequence alignment is to determine whether sequences display sufficient similarity to conclude about their homology. However, similarity is not equivalent of homology. Similarity (of sequences) is a quantity that might be measured in e.g. percent identity. Homology means a common evolutionary history.

In principle, all alignment methods try to follow the molecular mechanisms of sequence evolution, namely substitutions, deletions and insertions of monomers:

AGTCA AGTCA AGT-CA AGACA AG-CA AGTACA (substitution) (deletion) (insertion)

An alignment that spans complete sequences is called global alignment, an alignment of sequence fragments - local alignment. Depending on a biological problem, either global or local alignment is more efficient to detect a similarity between sequences. For instance, global alignments are advisable in case of relatively closely related sequences. Local alignments are a powerful tool to localize the most important regions of similarity, e.g. functional motifs, that may be flanked by variable regions that are less important for functioning. Obviously, local alignments should be applied in cases of mosaic arrangement of protein domains or for comparisons between spliced mRNAs and genomic sequences. Alignment tasks can be subdivided into pairwise alignment (two sequences), multiple alignment (more than two) and database similarity search (searching for the sequences in the database that are similar to a given sequence).

2.1. Pairwise alignment

For two sequences, many different alignments are possible (for instance, about 10179 different alignments for two sequences of length 300). Obviously, an alignment with relatively high number of matching residues in two sequences and relatively small number of deletions/insertions is the most likely to have biological meaning, because it assumes smaller number of evolutionary events. The problem to find such an alignment is equivalent to finding the “best” path in a dot matrix, where two sequences are plotted at the coordinates of a two-dimensional graph and dots indicate matching residues. Qualitatively, the best global alignment would be a path visiting as many dots as possible with a preference for diagonal moves as compared to horizontal or vertical moves (because the latter represent deletions/insertions in the sequences). Dot matrices are also very demonstrative for visualizing alignments, especially after “filtering” local regions of low similarity, for instance, leaving only diagonals of consecutive similar residues.

Sequence 2 Sequence 2 A G C T A G G A G A G C T A G G A G A . . . A . G . . . . G . C . C . G . . . . G . Seq.1 G . . . . G . A . . . A . G . . . . G . . A . . . A . G . . . . G .

(filtering out similarities of less than 3 consecutive residues)

The example shown above suggests two alternative alignments:

AGCTAGGAG AGCTAGGAG-- seq.1 ||| ||| or ||| |||| AGCGGAGAG AGC--GGAGAG seq.2

The second alignment has more matching residues than the first one, but it contains deletions. Obviously, the selection of the best alignment depends on scoring system for matches, mismatches and deletions/insertions. Thus, finding the optimal alignment requires: (1) a definition of scores for matches, mismatches and deletions/insertions; (2) an algorithm to find the best score.

5 2.1.1 Substitution matrices

A simple match/mismatch scoring (such as 1 for a match, 0 for mismatch) is not the most effective, especially in proteins, where “conservative substitutions” of amino acids with similar properties are relatively frequent. For instance, substitutions of functionally important polar residues by polar ones or mutual substitutions of hydrophobic residues, e.g. Arg → Lys or Val → Ile. Therefore, the idea of substitution matrix (Dayhoff et al., 1978) has been introduced. Such a matrix (dimensions 20×20 with diagonal symmetry) defines different scores for residue matches and all possible substitutions as well. The score values can be estimated from the previously aligned sequences with trusted alignments.

An efficient way to introduce such scores is a log-odds calculation: a score proportional to the logarithm of the ratio of observed substitution frequency (target frequency) to the background frequency which is determined by chance. Thus, the score for a substitution of two residues a and b is

S(a,b) ~ log [ pabobserved / ] = log [ pabobserved / ], pabbackground fa fb where pabobserved is the probability (frequency) of a/b substitution observed in used alignments, and fa and fb are the amino acid occurencies. The logarithm is used for calculations because adding such scores for positions in any new alignment would be equivalent to estimating a likelihood of the hypothesis that the alignment reflects and has a non-random pattern (this likelihood is equal to the product of likelihoods of non-random aligning of all residue pairs, if the pairs are statistically independent).

The first widely used substitution matrices were based on the point-accepted-mutation (PAM) evolutionary model. In this model, a unit of protein divergence (1 PAM) was introduced, corresponding to a change of 1% of amino acid sequence. It is very important to note that 100 PAM change does not lead to a complete change of the sequence, because some residues may change several times while others will not be substituted at all. Alignments of very closely related sequences (trusted alignments in any scoring system) were used to estimate the target frequencies corresponding to 1 PAM. These values can be extrapolated to more distant sequences, e.g. PAM250 (the matrix originally published by M. Dayhoff et al., 1978).

A different strategy was used for so-called BLOSUM matrices. They were derived from alignments contained in the database BLOCKS of local multiple alignments. Here the frequencies were computed directly from alignments. In contrast to PAM model, relatively distant sequences were considered, to avoid statistical biases due to overrepresentation of some sequences. Thus, BLOSUM62 matrix has been computed from the sequences having maximum 62% identity (sequences with higher identities were merged into single strings). The performance of BLOSUM62 turned out to be very good, and this is a standard matrix for many protein alignment programs.

The BLOSUM62 scoring matrix:

A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V

6 For DNA, the 4×4 matrices are relatively more simple, for instance, BLASTN program is using a scoring system with the score +2 for a match, and -1 for a mismatch. Sometimes the differences between more frequent purine-purine or pyrimidine-pyrimidine transitions (A ↔ G, C ↔ T ) as compared to transversions (A ↔ C, A ↔T, G ↔ T etc.) are taken into account. For instance, a matrix used in BLASTZ, a program for comparisons of large genomic sequences:

A C G T A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91

2.1.2 Gap penalties

Locations of insertions/deletions in the alignments are called gaps. In attempt to minimize the number of gaps, some negative scores (penalties) are introduced. The penalty values are subtracted from the scores calculated in non-gapped regions of alignments. The most common formula for a gap penalty is

S(gap) = G + L.n, where G is the gap-opening penalty (also called gap existence cost), L is the gap-extension penalty (gap extension cost) and n is the number of residues in the gap (the gap length). There is no solid theory here and the choice of values is highly empirical. In the combination with BLOSUM62 substitution matrix, the values of G around 10-15 and L around 1-2 turned out to perform rather well.

2.1.3. Algorithms for finding the optimal alignment

The problem of finding the optimal alignment of two sequences (given a defined scoring system) has been solved using a dynamic programming algorithm. Dynamic programming approach is used in various complex optimization problems when solution of a problem can be reduced to a subproblem solution with some recursive calculation for the problem solution on the basis of the subproblem. In case of the alignment problem, formulated as finding the “best” path in a matrix (using scores for matches/mismatches and gap penalties), a very important observation is that any partial subpath of the optimal one is also the optimal path for the alignment of two subsequences. The main recursive definition for alignment of two (sub)sequences of the length m and n residues is as follows: the optimal score S(m,n) is the best value out of three scores:

S(m-1, n-1) + (score of aligning residues m and n), S(m-1, n) + (gap penalty), S(m, n-1) + (gap penalty).

Using this definition is possible to both build a matrix of the best scores for all possible subalignments (called dynamic programming matrix) and to backtrack the optimal alignment for the full sequences. The dynamic programming matrix is filled starting from the smallest subsequences using “bottom-up” approach. After completing the matrix, it is possible to restore the sequence of calculations that has led to the final score.

7 An example (from S. Eddy, 2004): two DNA sequences X = TTCATA and Y = TGCTCGTA. The scores: +5 for a match, -2 for a mismatch, -6 for each insertion or deletion. The dynamic programming matrix is filled as follows:

0 T G C T C G T A (Y) 0 0 -6 -12 -18 -24 -30 -36 -42 -48 [alignment to “initialising” 0] T -6 5 -1 -7 -13 -19 -25 -31 -37 [ S(1,1)=5 (T-T match) ] T -12 -1 X C -18 -7 other values A -24 -13 correspond to gaps T -30 -19 A -36 -25

S(2,2) = max (5-2; -1-6; -1-6) = 3 S(2,3) = max (-1-2; 3-6; -7-6) = -3 S(2,4) = max (-7+5; -3-6; -13-6) = -2 etc.

S(3,3) = max (3+5; -3-6; -3-6) = 8 S(3,4) = max (-3-2; 8-6; -2-6) = 2 S(3,5) = max (-2+5; 2-6; -8-6) = 3 etc.

0 T G C T C G T A 0 0 -6 -12 -18 -24 -30 -36 -42 -48 T -6 5 -1 -7 -13 -19 -25 -31 -37 T -12 -1 3 -3 -2 -8 -14 -20 -26 C -18 -7 -3 8 2 3 A -24 -13 -9 T -30 -19 -15 A -36 -25 -21

↓ ... complete dynamic programming matrix:

0 T G C T C G T A 0 0 -6 -12 -18 -24 -30 -36 -42 -48 T -6 5 -1 -7 -13 -19 -25 -31 -37 T -12 -1 3 -3 -2 -8 -14 -20 -26 C -18 -7 -3 8 2 3 -3 -9 -15 A -24 -13 -9 2 6 0 1 -5 -4 T -30 -19 -15 -4 7 4 -2 6 0 A -36 -25 -21 -10 1 5 2 0 11

The final best score S(6,8)=11 has been computed (backtracking) as follows:

S(6,8)← S(5,7)← S(4,6)← S(3,5)← S(2,4)← S(1,3)← S(1,2)← S(1,1) (+5) (+5) (-2) (+5) (+5) (-6) (-6) 11 6 1 3 -2 -7 -1 5

The backtracking allows one to reconstruct the optimal alignment:

T--TCATA sequence X TGCTCGTA sequence Y

Such an algorithm, given a scoring system, guarantees to find the optimal global alignment of two sequences. The algorithm is often called Needleman-Wunsch algorithm by names of the authors (1970). An extension of this strategy, the Smith-Waterman algorithm (1981) has been developed to find the optimal local alignment. A locally optimal alignment is defined as the one that cannot be improved either by extending or shortening the alignment regions. Both global and local optimal alignments can be computed, for instance, by a tool align (www.ebi.ac.uk/Tools/emboss/align/): algorithms needle and water.

8 2.1.4. Estimates of statistical significance for alignments

Given a scoring system and a (dynamic programming) algorithm for optimal alignment, any two sequences can be aligned. However, there is no guarantee that this alignment is a consequence of homology between the sequences. Apparently, this is likely to be true if the obtained alignment would be very unlikely to be determined by chance alone. Therefore, an estimate of statistical significance is very important.

For global alignments, there is no mathematical theory to estimate a score that can be expected by chance. Significance of a global alignment can be estimated by comparison with statistics yielded by alignments of randomised (permuted) sequences. For instance, in order to estimate a significance for the alignment of two sequences X and Y of lengths m and n residues, respectively, one can generate 100 sequences of m residues and 100 sequences of the length n by “reshuffling” of residues. It is important that these sequences have the same sizes and residue composition as original ones, because the statistics depend on these parameters. Thus, 100 “random” alignments can be produced and their scores can be compared with the score of the real alignment. Still, a quantitative estimate of significance is not very straightforward, because the distribution of scores in random alignment is not normal. Qualitatively, it is possible, for instance, to estimate the probability to get a given alignment by chance (P-value) as less than 0.01 if 100 random alignments yielded the scores that are less than the alignment of interest.

For ungapped local alignments the statistical estimates can be directly calculated. For two random sequences of lengths m and n (sufficiently large), the probability to find at least one gap-free alignment (called high-scoring segment pair, HSP) with a score at least S (P-value) is equal to

P(S) = 1- exp [ - Kmn exp ( - λS) ], where K and λ depend on scoring rules and residue frequencies. The formula is derived from a so-called extreme value distribution, which is valid for the scores of random HSPs. The expected number HSPs with score at least S (E-value) is

E(S) = Kmn exp ( - λS).

For gapped alignments, a similar theory can be developed, but some large-scale estimates for randomised sequences are necessary to calculate the parameters used in the analytical formulas.

• The optimal alignment of two sequences (in terms of scoring system) is not always the optimal one in terms of biological significance. Frequently, the outputs of standard alignment programs require further manual improvement. For instance, it can be done on the basis of known features in the sequences such as codons or structural motifs.

9 2.2. Sequence database similarity search

One of the very frequent problems is to search in the database of sequences for the entries that share some similarity with a given sequence (query). A general approach here is to align the query sequence to each subject sequence in the database, producing a list of so-called hits: sequences with relatively high similarity to the query.

For a database search, a straightforward application of the optimal alignment methods (dynamic programming) is not efficient: this would be too slow. Some approximations are necessary to speed up the database scanning. One of the essential approximations is to break the sequences into short “words” (stretches of residues) and to look for the words similar in two sequences (an idea is that a “true” alignment of two sequences should contain at least one common word).

The first widely used program of this kind was FASTA (Pearson & Lipman, 1988). The main strategy of the algorithm was first to identify some diagonal regions of high similarity according to a simple scoring for words, with subsequent extension and joining of these initiating diagonals using more accurate computation of scores. The main steps of FASTA are as follows (variations of the described strategy are possible): (Pearson & Lipman, 1988) (A) A search for “words”, for instance, with at least 4 consecutive matches. Then any diagonal region in the complementarity matrix can be scored using some formula that takes into account the number of words and the distances between them. The best (say, 10) diagonal regions are selected at this step.

(B) The best 10 regions are rescored using a substitution matrix, allowing conservative substitutions and shorter runs of identities to contribute to the similarity score. The subregions with maximal scores for each of the best diagonals are identified (“initial regions”).

(C) The algorithm joins some of the compatible initial regions, computing an optimal combination.

(D) Within some band centered around the highest scoring intitial region, the alignment is recalculated using a dynamic programming algorithm.

10 Currently, the most popular program for database similarity search is BLAST (Basic Local Alignment Search Tool). BLAST implements a strategy to find statistically reliable local alignments. The main features of this strategy are as follows: (Altschul S.F. et al., 1990)

- Instead of looking for words with exact matches, the algorithm searches for words in a subject sequence that have a score at least T (threshold) in comparison with a word from the query. Thus the word size W could be high without sacrificing sensitivity. Using the parameter T, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance to achieve a score at least S. In the standard BLAST for nucleic acids, W=11. More fast algorithm is used in the megaBLAST with the default of 28 nucleotides. The standard value for protein-protein BLAST is 3.

- When a word hit is found, BLAST extends this local alignment, thus yielding the maximal score possible for the region.

- An important feature of BLAST is using mathematically tractable statistical estimates in order to derive the optimal parameters.

- The BLAST strategy admits numerous variations, in particular, in the ways to allow gaps in the extensions of hits.

An example below shows two BLAST hits obtained with the sequence of human interferon-gamma as a query. The output provides the links to the subject sequence entries in the database and the statistical estimates. In the protein sequence search, a “consensus” sequence with identical residues and conservative substitutions (together defined as “positives”) is also shown.

>gi|27549285|gb|AAO17286.1| interferon-gamma [Gallus gallus] Length = 164 Score = 66.2 bits (160), Expect = 3e-10 Identities = 40/114 (35%), Positives = 59/114 (51%), Gaps = 9/114 (7%)

Query: 34 LKKYFNAGHSDVADNGTLFLGILKNWKEESDRKIMQSQIVSFYFKLFKNF----KDDQSI 89 LK FN+ HSDVAD G + LKNW E + ++I+ SQIVS Y ++ N + I Sbjct: 37 LKADFNSSHSDVADGGPIIAEKLKNWTERNQKRIILSQIVSMYLEMLANTDKTKPHTKHI 96

Query: 90 QKSVETIKEDMNVKFFNSNKKKRDDFEKLTNYSVTDLNVQRKAIHELIQVMAEL 143 + + T+K ++ KK D L + DL VQ KA +EL ++ +L Sbjct: 97 SEELYTLKNNL-----PDGVKKVKDIMDLAKLPMNDLRVQLKAANELFSILQKL 145

>gi|16588861|gb|AAL26921.1| IFN-gamma [Canis familiaris] Length = 53 Score = 62.0 bits (149), Expect = 5e-09 Identities = 32/53 (60%), Positives = 38/53 (71%)

Query: 103 KFFNSNKKKRDDFEKLTNYSVTDLNVQRKAIHELIQVMAELSPAAKTGKRKRS 155 KF NS+ KR+DF KL V DL VQRKAI+ELI+VM +LSP + KRKRS Sbjct: 1 KFLNSSTSKREDFLKLIQIPVNDLQVQRKAINELIKVMNDLSPRSNLRKRKRS 53

Different BLAST programs are developed for various tasks. In addition to straightforward protein/protein and nucleotide/ nucleotide alignments, BLAST provides the options to compare possible proteins encoded by nucleotide sequences. Such options are very useful, for instance, for finding unannotated in the database (using a protein query, to find nucleotide sequences that might encode similar proteins). The main BLAST programs are as follows:

- blastp: protein query - protein database - blastx: nucleotide query translated - protein database - tblastn: protein query - nucleotide database translated - tblastx: nucleotide query translated - nucleotide database translated

It is possible to limit the search by some part of the database: for instance, “blasting” , EST database etc. Also, a special option to align two sequences is developed.

Statistical estimates are extremely important for the database similarity search, especially due to the enormous sizes of modern databases. Every BLAST hit is accompanied by such estimates, the E-value being the most demonstrative, showing a number of alignments with the scores better or equal to the given one, expected by chance alone (significant alignments should have rather low value).

11 Essentially, BLAST searches may be optimized by user for different levels of expected sequence similaty. For instance, highly similar nucleotide sequences are recommended to search using a so-called megablast option. In megablast, relatively large word sizes W are used (default: W=28). More dissimilar sequences may be searched by the discontinuous megablast algorithm (e.g. W=11 or W=12). The blastn option is intended for the less similar sequences, with W=11 (default) or even W=7. Diminishing word size makes the algorithm slower, but allows to retrieve more distant similarities. Furthermore, an option to automatically adjust parameters for short input sequences may be used as well.

BLAST is a very widely used program: (hundreds of) thousands of queries per day. It is accessible via ENTREZ system.

Accepted input formats in BLAST

- Accession number used in the ENTREZ system.

- A “bare sequence” can be typed or copy-pasted into the BLAST window, e.g.

MTINFEKLT… or 1 mtinfeklt... 61 fipsanltgi ssaeslkisq...

- A so-called FASTA format, frequently used in many programs for sequence analysis. It has a header line starting with right angle bracket “>” usually followed by some identifiers, and a sequence in the subsequent lines. >gi|12345 XXX-protein.. MTINFEKLT… ...

- Amino acid residues should be in single letter coding.

- In addition to standard letters used for nucleotides and amino acids, BLAST accepts specific coding for flexible definitions like e.g. R = [G or A] (nt), Y = [T or C] (nt), N = any nucleotide, X = any amino acid etc. The same standard coding is used in the sequence databases.

12 2.3. Multiple sequence alignment

2.3.1. Progressive multiple sequence alignment

Multiple alignment of many sequences is significantly more complicated task than the pairwise alignment. A direct application of accurate dynamic programming procedure is very demanding computationally and is limited to relatively small numbers of relatively short sequences. The commonly used approach is progressive multiple sequence alignment, based on pairwise alignment of the most similar sequences with gradual addition of the more distant ones. The most widely used is CLUSTAL series of programs.

The progressive alignment procedure is divided into three main parts:

1. All possible pairwise alignments between sequences in a given set. These pairwise alignments are used to calculate a distance matrix (the pairs of sequences with high similarity scores are assumed to be characterised by short evolutionary distances).

2. A guide tree is calculated from the distance matrix. The branching order of the guide tree, similar to evolutionary trees, follows the extent of relatedness of sequences.

3. The sequences are progressively aligned according to the tree branching order, starting from the closely related sequences. At every step the pairwise alignment is used to align two (clusters of) sequences belonging to some node in the tree. Alignment of sequence clusters differs from simple pairwise sequence alignment only by calculation of substitution scores from many residues rather than from two (see also below). Within an already built cluster, the positions of sequences are not changed, so that the gaps appearing during alignment to another cluster are introduced simultaneously in all cluster members.

Various modifications are possible at every part of the algorithm, for instance, different ways to convert similarity scores to distances, clustering techniques to build a tree, approaches to construct consensus sequences etc.

Guide tree. The first versions of CLUSTAL used a procedure of UPGMA (unweighted pair group method with arithmetic averages) to build the guide tree. This procedure is very simple from both conceptual and computational points of view. In this method, the first cluster is formed between the most related sequences (say, seq1 and seq2). A distance between this cluster and any other sequence seqX is calculated as arithmetic average d(seqX, new cluster) = [ d(seqX, seq1) + d(seqX, seq2) ] / 2.

Now a new distance matrix can be calculated and the closest sequences (or clusters) selected. The procedure is repeated, so as at every step the distance between any two clusters is calculated as arithmetic average of all pairs of sequences from these clusters. In CLUSTAL W and subsequent packages, the neighbour-joining method (NJ) has been implemented. This method estimates a divergence along each branch of the tree. It builds unrooted trees by a “star-decomposition” procedure. At every step a decision to join two sequences (or clusters) is done using a modified distance matrix that also takes into account a divergence of these two objects from others. The root of the tree is found by a “mid-point” method: finding a point where the means of branch lengths on two sides of the root are equal.

Sequence weights have been introduced in CLUSTAL W in order to compensate an unequal representation of some motifs in the dataset: for instance, the presence of a subgroup of very similar sequences while another subgroup has few representatives. The weights were introduced to up-weight the divergent sequences and down-weight those that are very similar to other sequences. The weight of a sequence was calculated by adding the values equal to lengths of branches in the tree between the sequence and the root, divided by numbers of sequences sharing them, for instance, as shown below:

0.08 seq.1 W(1)=0.205 0.2| | |0.08 seq.2 W(2)=0.205 ______| | 0.1 | 0.05 seq.3 W(3)=0.225 | |0.3| _____| |0.06 seq.4 W(4)=0.226 | |______seq.5 W(5)=0.5 0.5

W(1) = 0.08 + (0.2/2) + (0.1/4) = 0.205 W(2) = 0.05 + (0.3/2) + (0.1/4) = 0.225 W(5) = 0.5 13 The weighting is used for scoring the matching of residues in the alignment process. For a given position in the alignment, the score is “weighted average” of values M from a substitution matrix. If clusters of sequences are considered, sequences from one cluster are compared to sequences from another.

For instance, assume aligning a cluster of 4 sequences and a cluster of 2. The score at any position of the alignment is calculated as weighted average of 4x2=8 matrix values: ↓ seq.1 PEEKSAVTAL Score = M(T,V)*W(1)*W(5) seq.2 GEEKAAVLAL + M(T,I)*W(1)*W(6) seq.3 PADKTNVKAA + M(L,V)*W(2)*W(5) seq.4 AADKTNVKAA + M(L,I)*W(2)*W(6) ↑ + M(K,V)*W(3)*W(5) + M(K,I)*W(3)*W(6) seq.5 EGEWQLVLHV + M(K,V)*W(4)*W(5) seq.6 AAEKTKIRSA + M(K,I)*W(4)*W(6)

In order to optimize the performance of multiple alignment procedure for sequences of various degrees of similarity, some flexible definitions of scoring are very useful, such as position-specific gap penalties and variable substitution matrices. For instance, gap penalties at some position may be decreased if the previous alignment steps have already identified a gap at this position. The choice of the matrix may be for instance BLOSUM80 for 80-100% similarities, BLOSUM 62 for 60-80%, BLOSUM45 for 30-60% and BLOSUM30 for less than 30%. The choice of a suitable matrix is done automatically at each step of the multiple alignment.

Multiple alignment quality may be esimated in terms of sequence similarity derived from the alignment. Different scoring systems, so-called objective functions, can be introduced. For instance, for each position (column) of the alignment a quality score may be calculated as e.g. sum-of-pairs similarity score (according to a substitution matrix) and the total alignment score is computed as the sum of all columns.

One of the serious problems in straightforward progressive multiple alignment is the local maximum problem: due to a “greedy” nature of the alignment strategy, there is no guarantee that the global optimal solution will be found. Any misaligned region produced early in the alignment process cannot be corrected later, when new information from other sequences is added. In particular, an incorrect branching order of the initial tree may significantly contribute to the problem.

2.3.2. Alternatives to the progressive alignment strategy

Some modifications of progressive alignment strategy can improve the algorithm performance. For instance, iterative procedures may refine and improve the initial alignments and the tree: the estimates of the scores in the intermediate alignments allow the algorithm to make alignments and the tree mutually consistent, thus improving the accuracy of the final result. For instance, a guide tree can be build using so-called k-mer (or k-tuple) pairwise distances (based on common subsequences of the length k in two sequences) and used for progressive alignment. In turn, this progressive alignment can be used to calculate pairwise distances yielding another tree, which could yield a better alignment.

It is possible to generate alternative alignments that can be estimated by an objective function and improved using some algorithm for alignment modification. For instance, the algorithm of T-Coffee package (Notredame et al., 2000) begins with a “library” of alignments: for each pair of sequences in the dataset, both best global alignment and the best 10 local alignments are computed. In the second step, a multiple sequence alignment is assembled, with a criterion to be maximally consistent with the library alignments.

In recent years several algorithms have been developed in order to handle large alignments. Progressive alignments that are based on all pairwise distances become impossible in case of large numbers of sequences (time and memory requirements grow quadratically). It is possible to make guide tree construction faster by using a selection of seed sequences and calculation distances of all sequences to seeds only. This strategy is used in Clustal Omega, which is able to construct accurate alignments for large numbers of sequences. Clustal Omega can use both k-mer distances and those based on pairwise alignments (so-called Kimura distances). Clustal Omega also exploits clustering of sequences to construct subclusters that allow the calculation of profiles that describe alignments within subclusters by probabilities of substitutions, deletions and insertions. The profiles can be aligned to each other (see also next chapter), and can be converted into so-called hidden Markov Models (HMMs), like e.g. in Clustal Omega. Alignments of HMMs can be efficiently computed, yielding the full alignment.

Finally, it is important to mention that there is no precise definition of the optimal alignment and there is no guarantee that the best solution is found in any of the programs available. Therefore, an inspection of multiple alignments is a “must”. Usually multiple alignment packages include special editing algorithms to make the changes manually.

14 3. Sequence motifs, patterns and profile analysis

Sequence motif is a conserved stretch of residues that confer a specific function and/or structure to a biopolymer. Inferring a function from (new) sequence is often based upon comparative sequence analysis with identification of (partial) sequence similarities. Instead of aligning a sequence against many entries of a database (to find a functional similarity), a library of functionally important motifs can be defined and used for similarity search. For instance, a consensus sequence can be determined as the one that has all the absolutely conserved residues and arbitrary ones at positions where some variability is observed:

KSQETPAR protein 1 fragment KSQETQAR protein 2 fragment KVPEVQGR protein 3 fragment K..E...R consensus

Some consensus sequences may have degenerate positions. Such motifs can be defined using motif descriptors like generalized profiles:

C-x(4,6)-[FYH]-x(5,10)-C-x(0,2)-C-x(2,3)-C (profile or pattern) C---RKNQ--Y-RHYWSENLFQ-C---FN---C---SL---C (sequence)

Sometimes a pattern is rather “weak”:

[DNSTAGC]-[GSTAPIMVQH]-x(2)-G-[DE]-S-G-[GS]-[SAPHV]-[LIVMFYWH]-[LIVMFYSTANQH]

(example of a pattern from the PROSITE database that will match any sequence which contains D, N, S, T, A, G or C followed by G, S, T, A, P, I, M, V, Q or H followed by 2 residues followed by G etc.)

Positions with ambiguities can be also indicated with curly brackets: e.g. {AP} means “any amino acid except A and P ”.

The information contained in the alignment of sequences constituting a pattern (consensus) can be converted into a scoring based on a position-specific scoring matrix (PSSM). For any target sequence, PSSM (also called profile) specifies the scores of finding each of the 20 amino acid residues at particular positions of a target sequence. For proteins, PSSM has dimensions 21xL, where L is the motif size, to include possible gaps. Of course, similar approach can be applied for nucleic acids, in searching for protein binding sites in DNA, specific RNA motifs etc. PSSM are also used in gene-finding algorithms (see chapter 4).

The approach has been first formulated by Gribskov et al. (1987). In this work, the PSSM profiles were calculated on the basis of some alignments called “probe”. A modification of dynamic programming algorithm has been also designed to compare any sequence to the profile (compared to the standard alignment program, the major difference is simply in counting the scores from the appropriate PSSM values instead of those from a residue-residue substitution matrix). It has been shown that this approach is very effective in both determining whether a single sequence satisfies a motif and retrieving such sequences from a database.

Many modern algorithms exploit similar approach. Frequently the scores in the PSSM are taken as proportional to logarithms of frequencies of finding particular residues at particular positions, normalized by background frequencies: position weight matrices (PWM).

15 Example of a profile generated by Gribskov’s approach for a 49-residue pattern:

1 2 3 4 5 ...... 48 49 positions

E L V K A ...... S S | sequences (“probe”) G L V E P ...... G S | V S V A L ...... N N | L P V T P ...... S Y |

Profile 1 2 3 4 5 ...... 48 49 positions A 3 2 2 6 6 ...... 4 2 C -2 -2 2 -2 -1 ...... 3 5 D 3 -2 -2 5 0 ...... 5 2 E 4 -1 -2 6 1 ...... 3 1 F 0 3 2 -5 -2 ...... -4 1 G 4 0 2 4 2 ...... 7 2 H -1 -1 -3 1 0 ...... 0 1 I 3 3 11 0 1 ...... -2 0 K -1 -1 -2 5 0 ...... 2 1 L 4 6 8 -2 2 ...... -4 -2 M 4 5 6 0 2 ...... -3 -2 N 1 -1 -2 3 0 ...... 6 5 P 1 3 1 3 8 ...... 3 1 Q 1 0 -2 3 2 ...... 1 -1 R -2 -1 -2 1 0 ...... 0 0 S 1 3 0 3 2 ...... 10 8 T 2 1 2 6 2 ...... 3 1 V 6 4 15 0 3 ...... 0 -1 W -6 1 -9 -6 -5 ...... -2 3 Y -2 -1 -1 -4 -4 ...... -4 1 Δ -9 -9 -9 -9 -9 ...... -9 -9 (gap penalties)

Based on the Gribskov approach and related algorithms, various programs for construction of profiles and for searching a profile library with a sequence query have been developed. Also, various collections of protein profiles are stored in the specialized databases such as PROSITE, Pfam, PRINTS. Being curated databases, these collections have slightly different focuses.

PROSITE was one of the first databases of biologically significant patterns and is still one of the most widely used databases. In PROSITE, each entry is a concise description of the protein family or domain as well as a summary of the reasons leading to the development of the pattern or profile. A very tight relationship exists with the SWISS-PROT protein sequence data bank. Updating of PROSITE and of annotations of the relevant SWISS-PROT entries are often done in parallel. Many programs are developed by different groups to search PROSITE. The InterPro database is an integrated documentation resource with access to several databases of patterns and a combined search program InterProScan.

16 PSI-BLAST: a new generation of protein database search programs

An approach based on iterated construction of profiles and their use for searching in the protein sequence database resulted in a very sensitive algorithm to search for weak but biologically significant similarities. The algorithm, called Position-Specific Iterated BLAST (PSI-BLAST), is based on repeated cycles of combining the hits obtained by a BLAST search into multiple alignment with the query as an “anchor”, followed by construction of corresponding PSSM and using this PSSM as a query. As a result, a PSI-BLAST search may retrieve some distantly related sequences as compared to the standard BLAST, because the search becomes directed to find a (presumably) functionally important motif.

(Altshul et al., 1997) A simplified description of the approach:

Collection of all database sequence segments aligned to the query with E-value below a threshold (default 0.01). ↓ Multiple alignment M. ↓ Alignment columns involving gaps inserted into the query are ignored, so M has the same length as the query. ↓ The matrix scores for a given alignment column could be calculated, using sequence weighting to avoid “outvoting” a small number of divergent sequences by a large set of closely related ones. ↓ Position-specific score matrix (PSSM) is constructed. ↓ PSSM is submitted to BLAST as a query (only minor modifications to the code as compared with single sequence querying). ↓ Multiple alignment M etc.

Some related programs have been developed, for instance, IMPALA (Integrating Matrix Profiles And Local Alignments): a software package for searching a database of PSSM (PSI-BLAST-generated) using a protein sequence as query. Furthermore, the search in the database of profiles is included in current standard BLAST protein-protein program: the detection of known motif (if found) is always displayed before the list of similarity hits.

17 4. Gene finding and genome annotation

The methods for identifying of (protein) coding regions in nucleic acid sequences can be classified into three main categories: - search by signal (search for a specific pattern or consensus); - search by content (constraints upon sequence due to thecoding); - similarity-based search (similarities to known genes).

Obviously, search for gene-finding is based on finding the positions of transcription and translation initiation, pre-mRNA processing sites, stop-codons and transcription termination. Prokaryotes and eukaryotes present different problems for gene finding. Mechanisms of translation differentiation are different. Prokaryotic mRNAs may contain several genes, while eukaryotic mRNAs are nearly always mono-cistronic. Eukaryotic genes contain introns while prokaryotic ones don’t.

4.1. Search by signal

In prokaryotes, only few promoter sequences (so-called “-10 regions) have exact consensus sequence, and, therefore, a matrix evaluation is possible to find potential sites. For instance, for the consensus promoter sequence in E.coli ( TATAAT) various matrices can be constructed like shown below: pos. 1 2 3 4 5 6 A 2 95 26 59 51 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96

(percentages of base occurrences in 112 examples of E.coli –10 regions)

Because of multicistronic operons, a search for translation initiation signals in prokaryotes is more important search than that for promoter localization (in contrast to eukaryotes). In addition to the start-codon ATG, a ribosome binding site (RBS, Shine-Dalgarno site) should be searched for.

In prokaryotes, simple rule “ATG preceded at 6 to 9 bases by AGG, GAG or GGA” will find about 80 % of known sites, plus about 60 % more false-positives. If the Shine-Dalgarno consensus includes 4 bases, the number of false-positives is twice less, but at the expence of missing 33% of known sites.

Statistical analysis shows that other positions around initiation codon display biased base compositions. A matrix can be derived for E.coli translation initiation sites:

start-codon pos. –60...... –3 -2 -1 0 1 2 3 4 5...... 40

A 7 6 -2 -15 66 -69 -52 -5 -4 6 5 C -21 1 -9 1 -87 -55 -64 -45 11 -22 14 G -6 -33 -28 -53 -36 -50 107 -5 -37 -44 -21 T 5 -40 -28 0 -53 75 -62 -20 -40 -10 0

In eukaryotes there is no Shine-Dalgarno sequence. A “scanning model” of ribosome is usually valid, so if the 5’ end of mRNA is known, the first “good” ATG in the 3’-direction is normally a start codon. The consensus requirements are more simple than in prokaryotes:

AccATGg (lower case positions are less conserved) or in a matrix form: A T G pos: -4 -3 -2 -1 0 1 2 3 4 5 A -1 12 3 0 14 -40 -40 -1 1 -5 C 8 -18 3 7 -40 -40 -40 -18 5 -5 G -8 -7 -11 -6 -40 -40 14 9 -4 1 T -8 -32 0 -7 -40 14 -40 -8 -6 5

18 For identification of eukaryotic promoters, important signals are the binding sites of various transcription factors. The PSSMs describing these sites are collected in specific databases and are used for search usually in combination with other features (see also below) Various matrices may be calculated for finding positions of introns in eukaryotes. For instance, a PSSM for finding exon-intron junction:

-3 -2 -1 1 2 3 4 5 6

A 34.0 60.4 9.2 0.0 0.0 52.6 71.3 7.1 16.0 C 36.3 12.9 3.3 0.0 0.0 2.8 7.6 5.5 16.5 G 18.3 12.5 80.3 100 0.0 41.9 11.8 81.4 20.9 U 11.4 14.2 7.3 0.0 100 2.5 9.3 5.9 46.2

It is easy to see that these matrices cannot define very specific search (high score thresholds would miss many sites while low thresholds would produce many false-positives). Search for signals can fail in many cases, because these signals are not always present or may occur in other context. The accuracy and sensitivity of such a straightforward search is not sufficient. Therefore, many algorithms are based on the search for content.

4.2. Search by content

Search by content is based on the fact that coding for a protein puts constraints on the nucleotide sequence. The most obvious constraint: a coding sequence should not have a stop-codon within it. In a random sequence, the probability of any codon being followed by an open reading frame (ORF) of over 50 codons is less than 10% and that of over 100 codons – less than 1 %. Thus, since most proteins are more than 100 amino acids, searching for long ORFs is an effective way of finding genes in prokaryotes: AUG ... (long ORF) ... UAG In genomes with high G-C content the reliability of searching for long ORFs decreases because the probability of random encountering a stop-codon (in any frame) is decreased. In eukaryotes, ORF searches can be a useful first step, but not very reliable due to intron presence and relatively short exon lengths.

Searching for ORFs can be made more specific by taking into account amino acid occurrence and codon preferences. It turns out that proteins with different functions and from all species share a typical average amino acid composition (%):

Ala 8.6 Arg 4.9 Asn 4.3 Asp 5.5 Cys 2.9 Gln 3.9 Glu 6.0 Gly 8.4 His 2.0 Ile 4.5 Leu 7.4 Lys 6.6 Met 1.7 Phe 3.6 Pro 5.2 Ser 7.0 Thr 6.1 Trp 1.3 Tyr 3.4 Val 6.6

This can be used to rate ORFs on their likelihood of being translated. Also, some codons are more frequent than others, and this may be used as well. Some other biases are observed in sequences, and this can be also used to measure the probability of a given sequence to be protein-encoding.

Among the most important features, collectively called coding measures, are: - ORF measure (ORF length); - amino acid frequencies; - codon frequencies; - frequencies of bases at certain codon positions (composition measure); - hexamer measure (e.g. the frequency of CAGCAG is relatively high in exons).

19 The most efficient way to combine coding measures to define a single value called a discriminant. A discriminant analysis can be done by categorizing samples (in this case, sequences) into two classes and finding the best discriminant that separates the classes. (Two types are usually used: linear discriminant or quadratic discriminant). For instance, a “computer experiment” has been done in 1992 by Fickett and Tung:

GenBank was divided into successive 108 bp windows. ↓ Only those known to be fully coding or noncoding were selected. ↓ Half of the windows were used to define a discriminant, using various measures. ↓ The other half was used to test the accuracy of predictions: ↓ a correct prediction rate of 88% was found.

It is very important to note that many coding measures are organism - specific. Therefore many computer programs are specifically designed (trained) to predict protein-coding regions in particular species and are not applicable for other organisms!

Modern programs frequently combine the search by signal and search by content into a single algorithm. For instance, search for promoters may score the nucleotide sequences by both the scores defined by some transcription-factor PSSM and the values derived from certain promoter-specific measures (e.g. frequencies of some oligonucleotides). In particular, vertebrate promoter regions frequently have higher contents of CpG nucleotides (CpG islands) than non- promoter regions. The search for promoters can be significantly enhanced by comparative analysis (see also below) in various species and gene expression data.

4.3. Probabilistic models

Basic idea of a probabilistic approach for gene finding is to find the sequence annotation (locations of coding and non- coding sequences) that has the highest probability. The goal is to define a functional role for each nucleotide in the sequence, so as for a DNA sequence S = {n1, n2, ..., nL}, (ni is a nucleotide at position i ) the functional role of each nucleotide could be indicated by a “functional” sequence A = {a1, a2, ..., aL}, where each ai may take some values depending on a functional role, e.g. ‘0’ if ni is a part of non-coding region, value ‘1’ if ni is a part of a gene etc. In other words, the aim of gene-finding is to determine the most likely functional sequence A for given DNA sequence S.

In a Hidden Markov Model (HMM) algorithm, statistical patterns specific for coding/non-coding sequences may be converted into parameters of Markov chain models (a Markov chain represents a series of observations wherein a probability of a given observation is dependent on previous observations). A biologically relevant nucleotide sequence may be modeled in this way, because a probability of observing particular nucleotide is dependent on nucleotides observed at other positions. For instance, within a coding region a probability of observing nucleotides G or A after two nucleotides TA in a coding frame is zero, because this would lead to the stop-codon UAG or UAA (in a non-coding region, there is no such dependence). Many other parameters, e.g. specific nucleotide frequencies, may be introduced into a model. In the framework of such a model, the algorithm is able to compute the probability that any particular functional sequence A underlies a given sequence S. Finding the sequence A corresponding to the highest probability and estimating reliability of the result actually means finding functional regions.

20 4.4. Similarity-based search

The (significant) similarity of a region in a genome to a known gene in some other genome is one of the most reliable predictors. For instance, the hits of BLASTX (translated nucleotide query vs. protein database) may already predict some genes contained in the query. Another approach is to compare genomic sequences with the databases of expressed sequence tags (EST). EST’s are short (300-500 bp) single reads from 3’ends of mRNA. Although these sequences are frequently incomplete and contain errors, a match between a genomic sequence and an EST can be used for gene annotation, especially when the EST spans several exons. Even cross-species EST comparisons (e.g. human vs. mouse EST) can be used for gene searching. There are some limitations in similarity search, such as existence of many pseudogenes in eukaryotic genomes, contamination and errors in EST databases, EST’s derived from unprocessed pre-mRNA (containing introns).

4.5. Genome annotation strategies

The programs for gene annotation are generally classified as similarity-based and ab initio. Ab initio programs use both rule-based algorithms (searching for specific signals, usually with some scoring) and probabilistic models. There is no “ideal” approach. Often a combination of programs is used, for instance to identify “consensus exons” found by several programs.

Annotation strategies may vary considerably. For instance, the approach for human genome annotation, applied by the Human Sequencing Consortium, begins with ab initio gene predictions and then the predictions are strengthened using detected similarities, in particular with EST sequences. The approach of Celera company gave sequence similarities the highest priority, followed by refinement of predictions by a program using a probabilistic model. These are almost opposite strategies!

21 5. Protein structure prediction

The development of algorithms for protein structure prediction from sequence has relatively long history. Various secondary (2D) structure prediction algorithms consider the residues in a polypeptide chain to be in three (helix, strand, coil), four (helix, strand, coil, turn) or even eight states and try to predict the location of these states. The majority of methods for protein structure prediction are based on empirical (statistical) approaches. They try to extrapolate the statistics, revealed in the known structures, to other proteins.

The most simple observation: - Ala, Gln, Leu and Met are commonly found in α helices; - Pro, Gly, Tyr and Ser usually are not; - Pro – a “helix-breaker” (bulky ring prevents the formation of n/n+4 H-bonds)

The algorithm of Chou and Fasman (1978) was a widely used approach in 1970’s-1980’s. It was based on the calculation of a moving average of values that indicated the probability (or propensity) of a residue type to adopt one of three structural states, α-helix, β-sheet and turn conformation. The probabilities were simply the frequencies of a given residue type to be observed in a particular secondary structure, normalized by the frequency expected by chance. Some additional heuristic rules were added that attempted to determine the exact ends of secondary structure elements.

In the further developments of these ideas, the algorithm of Garnier et al. (1978) introduced an estimate of the effect that residues within the region eight residues N-terminal to eight residues C-terminal of a given position have on the structure of that position. For each of 20 amino acids, a profile (17 residues long) was defined that quantifies the contribution this residue makes towards the probabilities of other residues to be in one of four states, α-helix, β-sheet, turn and coil. Thus, for every residue the values from +8 positions are added (four values from each), so as four probability profiles are produced, and at any position the highest profile value predicts the structure.

Further developments of prediction methods (until the early 1990s) were based on implementing various rules for pattern recognition, in particular, the hydrophobicity patterns, periodicity of helices, different ways for calculating propensities in windows of variable sizes (3-51 residues). However, it seemed that prediction accuracy of such approaches stalled at levels slightly above 60% (percentage of residues predicted correctly in one the three states: helix, strand, and other). Next generation of methods incorporated multiple alignments into predictions. They incorporated a concept that homologous proteins should have similar structures. For example, all naturally evolved proteins with more than 35% pairwise identical residues over more than 100 aligned residues have similar structures (Rost, 1999). Predictions were further improved by application of neural networks, trained on known structures.

An efficient algorithm has been developed that combined multiple alignments with neural networks for secondary structure predictions (PSIPRED, Jones, 1999). At the first step, the algorithm uses iterations of PSI-BLAST so as to construct a profile (PSSM) corresponding to the query protein. This profile is used as the input to neural network, and the final output is a 3-state prediction (helix/strand/coil).

Multiple algorithms for protein 3D structure prediction are developed. The modern approaches for protein tertiary structure prediction can be divided into three general strategies:

(1) Comparative (homology) modeling. If there is a clear sequence homology between the target and one or more known structures, an algorithm tries to obtain the most accurate structural model for the target, consistent with the known set. (2) Fold recognition. Algorithms that try to recognize a known fold in a domain within the target protein.

(3) Ab initio methods. Modeling of structures using energy calculations, considering both secondary and tertiary structures. There are overlaps between the methodologies.

22 5.1. Homology modeling

Homology modeling approach assumes that a sequence similarity between a target protein and at least one related protein with known structure (the template) implies the 3D-similarity as well, and therefore allows one to extrapolate template structures to the target sequence.

Main steps in comparative protein structure modeling (Fiser et al., 2001) 1. Identify related structures. 2. Select templates. 3. Align target with templates. 4. Build a model for the target (using information from template structures). 5. Evaluate the model. 6. If model is not satisfactory, repeat the steps 2-5 or 3-5.

Identification of structures related to the target sequence is usually done by searching the database of known protein structures (PDB) using the target sequence as the query. For instance, a specific algorithm for template search in PDB is PDB-BLAST that not only builds a multiple alignment using target as a query, but also constructs similar multiple alignments using all found potential templates as queries (here each alignment is called sequence profile, and it can be converted into a matrix). The templates are found by comparing the target sequence profile with each of the template sequence profiles (e.g. by dynamic programming method). This allows to capture essential sequence motifs for a fold to be predicted. Selection of optimal templates is guided by several criteria, such as overall sequence similarity to the target and the quality of the experimentally determined structure. It is not necessary to select only one template: alignments of the target to different templates may be used for model building. These alignments may be refined after template selection. Various algorithms are used for model building, e.g. based on "rigid bodies" or spatial restraint satisfaction (based on core conserved structures obtained from aligned template structures). There are systems for (web-based) automated homology modeling that are able to predict the structures for one or many sequences without human intervention. For instance, SWISS-MODEL (http://swissmodel.expasy.org), one of the first servers for protein structure predictions, initiated in 1993 and accessible via the ExPASy web server.

5.2. Fold recognition

In case of relatively low sequence similarity of a target protein to the known structures, an attempt may be done to recognize a known fold within the target by a search for an optimal sequence-to-structure compatibility. Mostly this is done by threading algorithms: a target sequence is threaded through templates from the structure database and alternative sequence-structure alignments are scored according to some measure of compatibility between the target sequence with the template structures. The scoring is done using threading potentials, for instance so-called knowledge-based or mean-force potentials. These potentials do not consider physical nature of interactions in proteins and are derived from the statistics of the databases of known structures.

Knowledge-based (database-derived) mean force potentials incorporate all forces (electrostatic, van der Waals etc) acting between atoms as well as the influence of the environment (solvent). For the interaction between two residues (a,b) with a sequence separation k and distance r between specified types of atoms (e.g. Cβ→Cβ, Cβ→N etc.) a general definition of the potential is (“inverse Boltzmann equation”)

Eabk (r) = - RT ln [ fabk (r) ], where the occurrence frequency fabk (r) is obtained from a database of known structures.

A definition of the reference state is very important. A convenient choice for the reference state is Ek (r) = - RT ln [ fk (r) ] , where fk (r) is an average value over all residue types.

Thus the potential for the specific interaction of residues is

ΔEabk (r) = Eabk (r) - Ek (r) = - RT ln [ fabk (r) / fk (r) ]

A solvation potential for an amino acid residue a is defined as:

ΔEasolv (r) = - RT ln ( fa(r) / f(r)), where r is the degree of residue burial, fa(r) is the frequency of occurrence of residue a with burial r, f(r) is the frequency of occurrence of an arbitrary residue with burial r.

23 Obviously, threading algorithms are more complex than sequence-sequence alignments and require some approximations. For instance, in ungapped threading, the query sequence is mounted over an equally long part of template fold. The total alignment score is easily computed as sum of pairwise potentials for all query residues. Ungapped threading is mostly used not for real predictions, but rather for testing and adjusting the energy functions.

Treatment of gaps in sequence-structure alignments presents a problem because a score that should be computed for a given residue in a query sequence, assumed to be aligned to a residue in a template structure, would depend not only on the type of these two residues (as in sequence-sequence alignments), but on the gaps that may be introduced at other alignment positions. Here usually a so-called frozen approximation is used. In frozen approximation, an appropriate comparison matrix of the size N×M (N residues in the template and M in the query) is calculated by replacing the amino acids in the template structure with amino acids from the target sequence one at a time. The rest of the structure is kept intact, and it is assumed that the field created by the native protein will also favour the correct replacement. Although it is a very crude approximation, it is rather efficient, despite the fact that it does not solve the full threading problem. (Sippl, 1993)

The subsequent steps are equivalent to sequence-sequence alignment: the scores in the comparison matrix may be used for calculating dynamic programming matrix leading to the final alignment.

Often in fold recognition methods several scores are produced, related to different aspects of the sequence-structure alignment, some of the most important: initial sequence profile alignment score, number of aligned residues, length of target sequence, length of template sequence, pairwise energy sum, solvation energy sum. Each of these scores separately may be not sufficient, but they can be used to calculate some determinant score.

24 5.3. Ab initio protein structure prediction

Ab initio, or de novo approaches predict a protein structure and folding mechanism from knowledge only of its amino acid sequence. Often the term ab initio is interpreted as applied to an algorithm based entirely on physico-chemical interactions. On the other hand, the most successful ab initio methods utilize information from the sequence and structural databases in some form. Basic idea of an ab initio algorithm: search for the native state which is presumably in the minimum energy conformation. Usually an ab initio algorithm consists of multiple steps with different levels of approximated modeling of protein structure.

For a consideration of side chains in ab initio predictions, a so-called united residue approximation (UNRES) is frequently used: - Side chains are represented by spheres (“side-chain centroids”, SC). Each centroid represents all the atoms belonging to a real side chain. A van der Waals radius is introduced for every residue type. - A polypeptide chain is represented by a sequence of Cα atoms with attached SCs and peptide group centers (p) centered between two consecutive Cα atoms. - The distance between successive Cα atoms is assigned a value of 3.8 Å (a virtual-bond length, characteristic of a planar trans peptide group CO-NH). - It is assumed that Cα - Cα - Cα virtual bond angles have a fixed value of 90° (close to what is observed in crystal structures). - The united side chains have fixed geometry, with parameters being taken from crystal data.

(Liwo et al., 1993)

The only variables in this model of protein conformation are virtual-bond torsional angles γ.

The energy function for the simplified chain can be represented as the sum of the hydrophobic, hydrophilic and electrostatic interactions between side chains and peptide groups (potential functions dependent on the nature of interactions, distances and dimensions of side chains). The parameters in the expressions for contact energies are estimated empirically from crystal structures and all-atom calculations.

An example of the algorithm for structure prediction using UNRES:

1. Low-energy conformations in UNRES approximation are searched using Monte Carlo energy minimization. A cluster analysis is then applied to divide the set of low-energy conformations whose lowest-energy representatives are hereafter referred to as structures. Structures having energies within a chosen cut-off value above the lowest energy structure are saved for further stages of the calculation.

2. These virtual-bond united-residue structures are converted to an all-atom backbone (preserving distances between α-carbons).

3. Generation of the backbone is completed by carrying out simulations in a “hybrid” representation of the polypeptide chain, i.e. with an all-atom backbone and united side chains (still subject to the constraints following the UNRES simulations, so that some or even all the distances of the virtual-bond chain are substantially preserved). The simulations are performed by a Monte Carlo algorithm.

4. Full (all-atom) side chains are introduced with accompanying minimization of steric overlaps, allowing both the backbone and side chains to move. Then Monte Carlo simulations explore conformational space in the neighborhood of each of the low-energy structures.

Monte Carlo algorithms start from some (random) conformation and proceed with (quasi)randomly introduced changes, such as rotations around a randomly selected bond. If the change improves energy value, it is accepted. If not, it may be accepted with a probability dependent on energy increase. The procedure is repeated with a number of iterations, leading to lower energy conformations. A function defining higher energy acceptance probability is usually constructed 25 with a parameter that leads to lower probabilities in the course of simulation ("cooling down" the simulation) in order to achieve convergence and stop the algorithm.

5.4. Combinations of approaches

Many of the modern packages for protein structure predictions attempt to combine various approaches, algorithms and features. One of the most successful examples is Rosetta - ab initio prediction using database statistics. (D. Baker & coworkers) Rosetta is based on a picture of protein folding in which local sequence fragments (3-9 residues) rapidly alternate between different possible local structures. The distribution of conformations sampled by an isolated chain segment is approximated by the distribution adopted by that sequence segment and related sequence segments in the protein structure database. Thus the algorithm combines both ab initio and fold recognition approaches.

Folding occurs when the conformations and relative orientations of the local segments combine to form low energy global structures. Local conformation are sampled from the database of structures and scored using Bayesian logic:

P(structure | sequence) = P(structure) x P(sequence | structure) / P(sequence).

For comparisons of different structures for a given sequence, P(sequence) is constant. P(structure) may be approximated by some general expression favouring more compact structures. P(sequence | structure) is derived from the known structures in the database by assumptions somewhat similar to those used in fold recognition, for instance by estimating probabilities for pairs of amino acids to be at particular distance and computing the probability of sequence as the product over all pairs).

Non-local interactions are optimized by a Monte Carlo search through the set of conformations that can be built from the ensemble of local structure fragments.

In the standard Rosetta protocol, an approximated protein representation is used: backbone atoms are explicitly included, but side chains are represented by centroids (so-called low-resolution refinement of protein structure). The low-resolution step can be followed by high-resolution refinement, with all-atom protein representation. Similar stepwise refinement protocols can be used to improve predictions yielded by other methods, for instance, in loops (variable regions) of homology-modeling structures.

In recent CASP experiments (Critical Assessment of Structure Prediction), the Rosetta approach turned out to be one of the most successful prediction methods in the novel fold category.

Obviously, none of prediction approaches is ideal. Therefore it is reasonable to try to combine the best features of many different procedures or to derive a consensus, meta-prediction. For instance, the 3D-Jury system generated meta- predictions using models produced by a set of servers. The algorithm scored various models according to their similarities to each other.

5.5. Predictions of coiled coil domains and transmembrane segments

Special algorithms have been developed for domains characterized by special types of interactions. The coiled coil domains are very stable structures formed by regular arrangement of hydrophobic and polar residues in adjacent α- helices. This is possible in the amino acid sequences containing repeats of seven residues (heptads) with hydrophobic residues located at the first and the fourth positions of the heptad and preferences for polar residues at positions 5 and 7. It is possible to design an algorithm that would take into account stabilizing interactions in coiled coils to predict such conformations.

Transmembrane proteins contain α-helical segments buried in the membranes. Due to the specific hydrophobic environment in a membrane, protein folding occurs differently as compared to globular proteins folded in the polar water environment. This leads to special folding algorithms, mostly based on known statistics of amino acid frequencies in transmembrane α-helices. Efficient modern algorithms use probabilistic approaches such as Markov models and Bayesian approach.

26 5.6. Inverse folding and protein design

The so-called inverse folding task is the search for a sequence that will fold into a desired structure. Unlike the protein folding problem with presumably single solution (native state), the inverse problem may have many solutions. There are examples of structurally similar proteins with very different amino acid sequences.

Computationally, inverse folding (or protein design) is not an easy problem. Importantly, it is not enough to find a sequence that may fold into a given topology, it is also necessary to ensure that the sequence would not fold into any alternative structure (because of lower free energy). Usually protein design methods are based on probabilistic approaches that gradually improve the quality of solution (sequence folded into unique structure). For a long time, successful examples were restricted by relatively simple topologies such as coiled coils. A breakthrough was achieved in 2003, when a novel 93-residue fold was designed using a computational strategy with multiple iterations back and forwards between sequence design and structure refinement (Rosetta) in order to produce a desired topology. Recently a significant progress was achieved in both algorithms for protein design and understanding of important elements of protein structure that may by used as standard "building blocks".

Inverse folding may be also used for design of RNA secondary structures, with the same main principle: searching for a sequence with the lowest free energy conformation consistent with the predefined structure.

27 6. RNA structure prediction

6.1. Thermodynamics of RNA folding and RNA structure predictions

RNA folding occurs in hierarchical way, with secondary structure formed first, followed by the formation of tertiary structure. The secondary structure of an RNA molecule is determined by base-pairing between complementary regions and is usually described as the configuration consisting of double-stranded helices (stems) and loops of different topologies (hairpins, bulges, internal and multibranch loops), formed by various combinations of stems. Tertiary structure is a spatial (3D) conformation of RNA determined by various interactions between atoms within RNA (bases, ribose, phosphates) and with environment (solvent).

Usually the energy content of the 2-D structure is very high compared to that of 3-D, and the majority of RNA structure prediction algorithms deal with secondary structure prediction without account for 3D-structure elements. The methods for calculating secondary structure are based either on thermodynamic approach or comparative analysis, or some combination of them. The thermodynamics-based methods may apply free energy minimization (thermodynamics) or folding simulations (kinetics). The approaches for 3D structure predictions are based on stereochemical considerations and approximated energy potentials estimated for loop conformations (e.g. in RNA pseudoknots) and/or for interactions between atomic groups present in RNA.

Free energy ΔG of RNA secondary structure can be computed as the sum of stabilizing free energies determined by base-pairing and stacking of neighboring bases (ΔG < 0) and destabilizing loops (ΔG > 0). Many of these parameters have been estimated from thermodynamic experiments and are available in the Internet (e.g. Turner & Mathews, 2010).

The positive free energies of loops are approximated by logarithmic dependences on their sizes. The approximations of general type loop ΔG ~ 1.75 RT ln (size) are used for various loop topologies. This formula is derived from size dependence of conformational entropy (in assumption that enthalpy of loop formation is zero). Some specific sequences, in particular, tetraloops, are extrastable, and usually this is taken into account by addition a negative term in the formula.

The stacking free energies in the stems are calculated according to the nearest-neighbor rules. In helices, the stabilizing free energies, determined by H-bonds and stacking effects, depend on combinations of adjacent base pairs rather than on single pairs. Thus, for calculating the energy of a helix with 5 base pairs 4 stacking values have to be added, see example:

G G A A A U ΔGloop(N=8) = + 5.5 kcal/mol A A C-G ΔGmismatch(CG/AA) = - 1.5 kcal/mol U-A ΔGstacking(UA/CG) = - 2.4 kcal/mol A-U ΔGstacking(AU/UA) = - 1.1 kcal/mol C-G ΔGstacking(CG/AU) = - 2.1 kcal/mol G.U ΔGstacking(GU/CG) = - 2.5 kcal/mol

Stacking between the planar rings of nucleotide bases occurs also in the mismatches and single unpaired nucleotides adjacent to stems. These values also depend on nearest neighbors and are tabulated taking this into account. RNA folding free energies may depend on other contributions such as those arising from pseudoknot configurations, specific sequence motifs and coaxial stacking of helices.

For a given secondary structure, it is easy to compute its free energy by simply adding all values for individual structural elements. However, the number of all possible structures for a given sequence is very large, and in order to find the one with the lowest free energy the dynamic programming is used. The algorithm used for energy minimization by dynamic programming is somewhat similar to the algorithm used for alignment problem. The program mfold (http:// mfold.rna.albany.edu/) is the most frequently used.

28 Using dynamic programming algorithm, it is also possible to compute the equilibrium partition function Q:

Q = ∑ exp [ - ΔG(S) / kT ] s as the sum over all possible structures S. Knowing the partition function, one can compute the probability of a given structure S:

exp [ - ΔG(S) / kT ] P(S) = Q

Furthermore, it is possible to calculate the probability of a given .

Apart from the global minimum conformation, analysis of probable alternative structures is frequently necessary. For instance, mfold can calculate suboptimal structures, within some ΔG increment of the energy minimum. A convenient way to overview the most likely suboptimal structures is to use energy dot plots that contain superpositions of possible foldings (e.g. base pairs from all structures with free energies less than some value). Dot plot is a two-dimensional plot where the sequence positions are on the two coordinates so as any base-pair can be represented by a dot and a helix by a diagonal region.

6.2. Comparative RNA analysis

Prediction of structure using comparative analysis is based on the assumption that functionally related sequences should have similar structures. Comparisons of RNA secondary structures usually exploit a principle of nucleotide covariations (coordinated variations), in order to derive a consensus secondary structure for related RNA molecules.

n n n n n n n n n n n n n-n n-n n-n G-C U-A A-U n-n n-n n-n n-n n-n n-n A-U G-C C-G

RNA 1 RNA 2 RNA 3

AnnGnnnnnnCnnU RNA 1 GnnUnnnnnnAnnC RNA 2 CnnAnnnnnnUnnG RNA 3 (((((....))))) "bracket view" of the consensus

Many models of RNA secondary structures such as rRNAs, RNase P etc. were deduced mostly from comparative analysis. Base covariations are not necessary determined by Watson-Crick pairing, but can be a consequence of non- canonical base pairs (e.g.G•A, C•C etc.) that may be found in both helical duplexes and long-range interactions.

6.3. Detecting conserved structures in related RNA sequences

A number of algorithms is designed to predict most likely conserved structures in datasets of related RNA molecules.For instance, the program RNAalifold of ViennaRNA Web Services (http://rna.tbi.univie.ac.at) predicts the consensus secondary structure for a set of aligned sequences using modified dynamic programming algorithm that adds a covariance term to the standard energy model. The program calculates the probabilities of separate base pairs from the ensemble of suboptimal structures, taking base pair conservation into account. If RNA sequences are not reliably aligned, the algorithms computing both the alignment and consensus structure are used. Usually they iterate back and forward from the steps improving the structure-based alignment and retrieving the consensus structures from the alignment. Of course, due to complexity of the problem, such algorithms are less accurate than those based on reliable alignments.

29 6.4. RNA motif search

Several programs have been developed for genome-wide search of known structural motifs (e.g. tRNAs, snoRNAs, iron- responsive elements etc.) using descriptors for consensus structures. Different descriptors with various sequence and structure requirements (scoring functions or threading potentials) may be combined. For instance, below a descriptor in a so-called bracket notation for the consensus structure consisting of a conserved hairpin with a bulge:

(((((.(((((...... )))))))))) NNNNNCNNNNNCAGWGHNNNNNNNNNN where N is any nucleotide; W is A or U; H is C, A or U; base-pairs are indicated by parentheses. Such kind of descriptor can be threaded along the sequences in the database to search for matching sequences.

More efficient and flexible descriptions of RNA structural motifs are based on covariance models derived from alignments of related RNA molecules. Such covariance models, that serve as mathematical descriptions of RNA structures, can be used as queries for the search in a given sequence, genome or database. The idea is somewhat comparable to the use of profiles (PSSMs) in the protein motif search. For instance, such an approach is implemented in the database of RNA families (Rfam, http://rfam.sanger.ac.uk/), where each family is stored in the form of an alignment (“SEED” alignment) that specifies a covariance model used for the search for new family members.

For instance: M14879/224-175 AAACAGAGAAGUCAACCAGAGAAACACACGUUGUG..GUAUAUUACCUGGUA M17439/226-177 AAACAGAGAAGUCAACCAGAGAAACACACGUUGUG..GUAUAUUACCUGGUA M21212/157-106 CAACAGCGAAGCGGAACGGCGAAACACACCUUGUGUGGUAUAUUACCCGUUG #=GC SS_cons ...... <<<<<...... <<<...>>>...... >>>>>.

New sequences (database hits) are aligned to the CM, leading to an extended FULL alignment. The process can be iterated, if some sequences from the FULL are taken to the SEED in order to refine it (this is done in curated way).

30 Recommended reading

Altschul S.F. et al. The statistics of sequence similarity scores. (BLAST tutorial). http:// www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Eddy S.R. (2004a). What is dynamic programming? Nature Biotechnology 22:909-910

Eddy S.R. (2004b). Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22:1035-1036.

Gribskov M., McLachlan A.D. & Eisenberg D. (1987). Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84:4355-4358.

NCBI Resource Coordinators (2016). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 44:D7-D19. (Reference source/summary/update of NCBI resources, in particular, Entrez retrieval system.)

Pearson W.R. & Lipman D.J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448.

Thompson J.D., Higgins D.G. & Gibson T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.

31 Exercises

The exercises are intended to make the course participants familiar with some of the popular tools and resources.

Sequence databases, sequence alignment, sequence database similarity search:

1. The NCBI reference sequence of human beta-globin mRNA has the accession NM_000518. What is the accession number of the encoded protein ? How many amino acids does it contain ?

2. A plant transcription factor TOE1 (TARGET OF EARLY ACTIVATION TAGGED 1) was recently suggested to regulate development of specialized organs called nodules in legume plants, e.g. in soybean. Using the TOE1 sequence from the model organism Arabidopsis thaliana (NP_001189625), search for similar proteins in soybean (Glycine max) using BLAST program. How many hits with Glycine max proteins are yielded by BLAST search ? What is the name of protein and accession number yielded as the best G.max hit ? Is its similarity to the TOE1 query significant ? Give the E-value of this alignment, the number of identical amino acids and the number of conservative substitutions (with positive scores).

3. The following DNA sequence fragment, containing some mutation, was isolated from a patient: tttgctccccgcgcgctgtttttctcagtgactttcagcgggcggaaaag

In what gene the mutation is located? On which ? How many nucleotides are changed?

4. A special option of BLAST for pairwise alignment of two sequences (bl2seq) is sometimes a quick way to determine similarity between two closely related sequences. (a) For instance, determine identity percentage in the alignment of two genomes of Zika virus: the one of an isolate from Suriname, October 2015 (accession KU312312) and the reference genome (NC_012532). - Also, using bl2seq for proteins, determine identities and positives in the alignment of polyproteins encoded by these two genomes. Locate the amino positions of gaps in the alignment: what are the residues inserted (or deleted) in the Suriname isolate as compared to the reference genome? - Are there insertions or deletions in the Suriname isolate polyprotein as compared to the polyprotein of Zika virus from a French Polynesia outbreak in 2013 (KJ776791) ?

(b) Is this BLAST option also optimal for the proteins of assignment 2: A.thaliana TOE1 (NP_001189625) and its homolog from G.max ? What is the alternative algorithm/program in this case ? Determine the percentage of similarity between these two proteins.

5. It has been reported that a transcript annotated as a long non-coding RNA in mouse genome encodes a peptide of 34 amino acids with the following sequence:

MAEKESTSPHLIVPILLLVGWIVGCIIVIYIVFF.

It was also suggested that a transcript annotated as a long non-coding RNA in human genome (Accession NR_037902) might also contain a small open reading frame (ORF) encoding similar peptide. Determine the nucleotide positions of this ORF in the human transcript, the sequence of the peptide and its length.

6. Using ENTREZ Gene database, determine the differences between alternative splicing isoforms of the human microtubule-associated protein tau (MAPT, GeneID 4137). How many exons are contained in the tau gene according to the RefSeqGene data? How many exons do alternative transcripts lack?

7. Calculate pairwise alignments of two homologous segments (PA-segments, Accessions CY046942 and EF626633) of influenza A and B viruses using different algorithms: (a) Make the global optimal alignment by needle algorithm (www.ebi.ac.uk/Tools/emboss/align/); (b) Using the option of BLAST for two sequences (bl2seq), align these two nucleotide sequences with blastn algorithm; (c) Use bl2seq again, but with tblastx: alignment of translated nucleotide sequences. What are the main differences between the alignment results? Try to explain the origins of these differences. What are the advantages/disadvantages of each of these approaches in this case?

32 Multiple sequence alignment, profile search:

8. Using Clustal Omega program, available among the tools at the EBI website (http://www.ebi.ac.uk/Tools/msa/ clustalo/), calculate a multiple alignment of five homologous Cas9 proteins from the following organisms: Streptococcus pyogenes (Accession NP_269215), Streptococcus thermophilus (WP_011225725), Staphylococcus aureus (CCK74173), Staphylococcus lugdunensis (WP_002460848) and Halalkalibacillus halophilus (WP_035512507). What is the length of the longest stretch of amino acid residues completely conserved in all 5 sequences ? Give this motif in single letter code. (NB. The easiest input for Clustal Omega is a FASTA file of sequences, prepared in advance. For this assignment, default parameters of Clustal Omega are sufficient)

9. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine whether the amino acid sequence vkpklplipghegvgvieevgpgvt contains some consensus pattern. Give the motif description of this consensus pattern.

10. Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions of conserved motifs (profiles) in one of human proteins involved in RNA splicing, 9G8 (Accession NP_001026854). Using the resources of ENTREZ system, determine how many exons are in the 9G8 gene, and which of these exons contain the sequences encoding the found motifs.

11. Using the database “Gene” of ENTREZ, retrieve the sequences of the shortest and longest isoforms of the human enzyme ADAR adenosine deaminase (GeneID 103). What are the lengths of the amino acid sequences of these isoforms? Using the database of protein profiles PROSITE (http://prosite.expasy.org/), determine the positions and the function of a conserved motif contained in the longer isoform, but absent in the shortest one.

12. Human gene ADAM15 (Gene ID:8751) contains 24 exons. Several protein isoforms are produced by alternative splicing that may lead to a shift in the translation frame. For instance, a variant encoded by transcript NM_207191.2 is annotated as lacking three exons in the coding region compared to the longest isoform (transcript NM_207197.2).

Identify nucleotide positions of the exons in the longest isoform, which are skipped in the shorter one. Which of these three exons causes a translation frameshift? What are the lengths of C-terminal amino acid sequences that are different in the encoded proteins? Using the database PROSITE (http://prosite.expasy.org/), determine what patterns can be located in these distinct C-termini (NB. in PROSITE search, in contrast to default option, do not exclude patterns with a high probability of occurrence).

Structure prediction:

13. A number of RNA-binding proteins contain a conserved RNA recognition motif (RRM) with a general structure consisting of two α-helices and four β-strands: β1-α1-β2-β3-α2-β4. Identify this structure in the human RNA-binding protein FUS (accession NP_004951). (a) In the protein database entry NP_004951, identify the location (amino acid positions) of the RRM domain in the "FEATURES" datafields (region RRM_FUS). (b) Using PSIPRED protein secondary structure prediction program (www.psipred.net), predict the secondary structure of this region.

Is the predicted secondary structure consistent with the motif description, mentioned above? What are the amino acid positions of the α-helices and β-strands ?

14. A sequence entry (Accession DJ048639) from the nucleotide sequence database is patented as a tRNA variant for introducing an artificial amino acid into a protein. Using the program mfold (http://mfold.rna.albany.edu/?q=mfold/RNA- Folding-Form ), test whether a stable tRNA “cloverleaf” structure (three hairpins flanked by a stem and a single-stranded CCA 3’-end) is predicted. What is the energy of the lowest free energy conformation in mfold prediction?

15. Some RNA molecules have alternative secondary structures with close values of free energy. For the sequence given below

ACAGGUUCGCCUGUGUUGCGAACCUGCGGGUUCG predict the folding, using the program mfold (http://mfold.rna.albany.edu/?q=mfold/RNA-Folding-Form). What is the free energy of the lowest energy conformation? Free energy of the structure 2? What is the main difference between these two structures? 33