<<

BioMath

Evolution By Substitution: Changes Over Time

Student Edition Funded by the National Science Foundation, Proposal No. ESI-06-28091

This material was prepared with the support of the National Science Foundation. However, any opinions, findings, conclusions, and/or recommendations herein are those of the authors and do not necessarily reflect the views of the NSF.

At the time of publishing, all included URLs were checked and active. We make every effort to make sure all links stay active, but we cannot make any guaranties that they will remain so. If you find a URL that is inactive, please inform us at [email protected].

DIMACS Published by COMAP, Inc. in conjunction with DIMACS, Rutgers University. ©2015 COMAP, Inc. Printed in the U.S.A.

COMAP, Inc. 175 Middlesex Turnpike, Suite 3B Bedford, MA 01730 www.comap.com

ISBN: 1 933223 75 8

Front Cover Photograph: EPA GULF BREEZE LABORATORY, PATHO-BIOLOGY LAB. LINDA SHARP ASSISTANT This work is in the public domain in the United States because it is a work prepared by an officer or employee of the United States Government as part of that person’s official duties.

Evolution By Substitution: Amino Acid Changes Over Time

DNA, or deoxyribonucleic acid, carries the code for life and that code directs the making of proteins that will carry out the organism’s functions. Proteins are made from twenty different amino acids and the number and order of those amino acids will determine the properties and function of the protein. Any alterations in the sequence of amino acids may have an effect on the function of the protein. The protein may not function as well, may lose all function, or may possibly function better. It is also possible that the substitution may not affect the function of the protein at all.

Mathematical analysis of similar proteins in different organisms based on the sequence of amino acids may give insight into their possible evolutionary history and perhaps even that of the organisms that contain those proteins. Such analysis may also lead to explanations of the mechanisms of evolution, which resulted in the natural selection of these proteins.

Unit Goals and Objectives

Goal: Students will experience the excitement of modern biology from both the biological and mathematical point of view. Objectives:  Relate DNA changes and resulting amino acid substitutions to evolution.  Develop a deeper understanding of evolution through the study of amino acid substitution and multiplication.

Goal: Students will explore the connections between the mathematical and biological sciences. Objectives:  Identify the probability for single events.  Relate the use of a matrix to the probability for compound events and for events repeated over time.  Demonstrate a proficiency in multiplying two matrices together and raising a square matrix to a power.  Understand the relationship between powers of a matrix and future evolutionary states.

Goal: Students will experience how mathematical modeling simulates theoretically behavior of a proposed system. Objectives:  Identify state diagrams and their properties.  Construct a state diagram to describe changes in a system.

Evolution By Substitution Student 1

Lesson 1 Evolutionary Relationships

Man has grouped organisms based on physical similarities for hundreds of years. Scientists have used these similarities to determine evolutionary relationships among organisms. For example, a mouse and a rat have many characteristics in common, more than a mouse shares with a chicken. Based on these observations, a mouse and a rat are more closely related to each other than they are to a chicken; therefore, they share a more recent common ancestor. Again based on similarities, a mouse is more closely related to a chicken than it is to a fish. As more and more information is gathered, a tree can be drawn to show these relationships, as shown in Figure 1.1.

Figure 1.1: Portion of an evolutionary tree.

The study of evolutionary relationships raises many questions. Given two organisms, what is their evolutionary relationship? What was their common ancestor like? How long ago did they diverge from this common ancestor?

Biology Background

In all living organisms, DNA, or deoxyribonucleic acid, carries the code for life. The code determines the proteins that an organism’s cells will make, and proteins carry out the organism’s functions. Through a series of complex processes, a segment of DNA called a gene may be read and the message in the gene may be used to build a protein from building blocks called amino acids. There are a total of twenty amino acids used to build proteins in living things (see Table 1.1). A protein is a chain of amino acids whose

properties are determined by its particular amino acid sequence. In summary, it is differences in DNA that result in different amino acid sequences.

Evolution By Substitution Student 2

Changing one amino acid in the sequence can result in a protein that does not function as well or may completely destroy the functioning of the protein altogether. Occasionally the change produces a protein that functions better than its predecessor and improves the fitness of the organism. In such cases, natural selection will result in improved survival rates for the organisms with this protein. Ultimately, nature will determine which proteins function best given a particular environment.

Molecular biology today offers new ways to compare organisms. Proteins may be sequenced and compared giving a more detailed comparison of organisms. Today’s scientist can look for evolutionary relationships based on the sequence of amino acids in proteins rather than looking at bone structure or type of teeth. This unit uses mathematics to examine changes over time in the amino acids that make up proteins.

Making the BioMath Connection

There are twenty amino acids that may be coded for in DNA. Amino acids are all alike in that they have 3 common parts: an amino group (NH2), a carboxyl group (COOH), and an R group all attached to a central carbon. Figure 1.2 shows the characteristic makeup of an amino acid.

Figure 1.2: Characteristics of an amino acid.

The Amino Acid table on the next page shows each of the amino acids with its unique R group. R groups have different chemical properties. For example, an R group may make an amino acid polar (charged positive or negative) or nonpolar (uncharged), hydrophobic (repelled by water) or hydrophilic (attracted to water). These differences impact the overall functioning of the protein and how it will fold when made from a long string of amino acids with different properties.

Some substitutions in amino acids will have greater effects than others. A change from a polar amino acid to another polar amino acid will not affect the protein as much as a change to a nonpolar amino acid. If the change is too great and the protein does not function, then nature would select against that change. One can begin to see how the probability of some selected changes may be greater than other selected changes.

The table also shows the common abbreviations used for each amino acid. For example, can be abbreviated ‘Ala’ or represented by the capital letter A. is

Evolution By Substitution Student 3

abbreviated ‘Arg’, but is represented by the letter R since A has already been used. This use of letters is universal and recognized by scientists.

Table 1.1: Amino Acids

Evolution By Substitution Student 4

When two amino acid sequences are compared, it is possible to consider how recently they shared a common ancestor. Mathematically, two sequences can be aligned to determine their evolutionary relationship. Amino acids are denoted by a capital letter.

The two amino acid sequences below illustrate an alignment between two growth hormone proteins. The top is the partial protein from a domesticated cat and the bottom is from a domesticated dog.

MAAGPRNSVLLAFALLCLPWPQEVGTFPAMPLSSLFANAVLRAQHLHQLAADTYKEFERA

MAASPRNSVLLAFALLCLPWPQEVGAFPAMPLSSLFANAVLRAQHLHQLAADTYKEFERA

The alignment below is for the same partial protein from a domesticated chicken and a domesticated dog. As might be expected a dog and a cat share more common amino acids than the dog does with a chicken. If scores were being assigned to show their commonalities, then the dog and cat protein alignment would receive a higher score.

MAPGSWFSPLL-IAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFER

MAASPRNSVLLAFALLCLPWPQEVGA-FPAMPLSSLFANAVLRAQHLHQLAADTYKEFER

Differences in like (homologous) proteins are the result of mutations in the DNA of a common, but perhaps unknown, ancestor. As seen in the first alignment, the amino acid in the 4th position was replaced by another amino acid, but the substitution allowed for functionality of the protein since cats and dogs do quite fine with their growth hormones.

In this mathematical model of evolution by examination of amino acids, the assumption is made that amino acids change independently of each other. One evolutionary unit (e.u.) is the average amount of time it takes for 1% of the amino acids to change. Suppose that over a period of one e.u., 3 out of every 1000 amino acids V change into amino acid R. This is denoted as a probability:

P(V changes into R) = 3/1000 = .003.

Generally we talk about probabilities of events so if E is the event that V changes to R then we say that P(E) = .003. Also, the chance that a certain change takes place is the probability of its occurrence, often stated as a percentage. In this case .003 becomes 0.3%.

As stated earlier, substitutions of dissimilar amino acids are less likely to be acceptable. Based on these ideas a 20 x 20 matrix can be assembled based upon the probabilities that the substitutions can occur over time. Margaret Dayhoff developed just such a substitution data matrix in 1978.

Evolution By Substitution Student 5

From Original Amino Acid (For a given row, the cell entries give a 1 PAM. for matrix data mutation 1978 Dayhoff’s from Margaret Adapted Val Tyr Trp Thr Ser Pro Phe Met Lys Leu Ile His Gly Glu Gln Cys Asp Asn Arg Ala from To Replacement Amino Acid R A into V Y W T S P F M K L I H G E Q C D N

.0018 .0018 .0002 .0000 .0032 .0035 .0022 .0002 .0006 .0002 .0004 .0006 .0002 .0021 .0017 .0008 .0003 .0010 .0009 .0002 .9867 A Ala .0001 .0001 .0000 .0008 .0001 .0006 .0004 .0001 .0004 .0019 .0001 .0003 .0010 .0000 .0000 .0010 .0001 .0000 .0001 .9913 .0001 R Arg Asn .0001 .0001 .0004 .0001 .0009 .0020 .0002 .0001 .0000 .0013 .0001 .0003 .0021 .0006 .0006 .0004 .0000 .0036 .9822 .0001 .0004 N

.0001 .0001 .0000 .0000 .0003 .0005 .0001 .0000 .0000 .0003 .0000 .0001 .0004 .0006 .0053 .0006 .0000 .9859 .0042 .0000 .0006 D Asp .0002 .0003 .0000 .0001 .0005 .0001 .0000 .0000 .0000 .0000 .0001 .0001 .0000 .0000 .0000 .9973 .0000 .0000 .0001 .0001 C Cys ll possible probabilities for that amino acid to .0001 .0000 .0000 .0002 .0002 .0006 .0000 .0004 .0006 .0003 .0001 .0023 .0001 .0027 .9876 .0000 .0005 .0004 .0009 .0003 Q Gln .0002 .0001 .0000 .0002 .0004 .0003 .0000 .0001 .0004 .0001 .0003 .0002 .0004 .9865 .0035 .0000 .0056 .0007 .0000 .0010 E Glu .0005 .0000 .0000 .0003 .0021 .0003 .0001 .0001 .0002 .0001 .0000 .0001 .9935 .0007 .0003 .0001 .0011 .0012 .0001 .0021 G Gly .0001 .0004 .0001 .0001 .0001 .0003 .0002 .0000 .0001 .0001 .0000 .9912 .0000 .0001 .0020 .0001 .0003 .0018 .0008 .0001 H His .0033 .0001 .0000 .0007 .0001 .0000 .0007 .0012 .0002 .0009 .9872 .0000 .0000 .0002 .0001 .0002 .0001 .0003 .0002 .0002 I Ile .0015 .0002 .0004 .0003 .0001 .0003 .0013 .0045 .0002 .9947 .0022 .0004 .0001 .0001 .0006 .0000 .0000 .0003 .0001 .0003 L Leu change, so each row should add up to .0001 .0001 .0000 .0011 .0008 .0003 .0000 .0020 .9926 .0001 .0004 .0002 .0002 .0007 .0012 .0000 .0006 .0025 .0037 .0002 K Lys .0004 .0004 .0000 .0000 .0002 .0001 .0000 .0001 .9874 .0004 .0008 .0005 .0000 .0000 .0000 .0002 .0000 .0000 .0000 .0001 .0001 M Met .0000 .0028 .0003 .0001 .0002 .0000 .9946 .0004 .0000 .0006 .0008 .0002 .0001 .0000 .0000 .0000 .0000 .0001 .0001 .0001 F Phe .0002 .0000 .0000 .0004 .0012 .9926 .0001 .0001 .0002 .0002 .0001 .0005 .0002 .0003 .0008 .0001 .0001 .0002 .0005 .0013 P Pro .0002 .0002 .0005 .0038 .9840 .0017 .0003 .0004 .0007 .0001 .0002 .0002 .0016 .0006 .0004 .0011 .0007 .0034 .0011 .0028 S Ser .0009 .0002 .0000 .9871 .0032 .0005 .0001 .0006 .0008 .0002 .0011 .0001 .0002 .0002 .0003 .0001 .0004 .0013 .0002 .0022 T Thr 1.) .0000 .0001 .9976 .0000 .0001 .0000 .0001 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0002 .0000 W Trp .0001 .9945 .0002 .0001 .0001 .0000 .0021 .0000 .0000 .0001 .0001 .0004 .0000 .0001 .0000 .0003 .0000 .0003 .0000 .0001 Y Tyr .9901 .0002 .0000 .0010 .0002 .0003 .0001 .0017 .0001 .0011 .0057 .0003 .0003 .0002 .0002 .0003 .0001 .0001 .0002 .0013 V Val

Table 1.2:

Evolution By Substitution Student 6

For this discussion, the assumption is made that the rate of change remains constant, but that is not always the case.

Different proteins will have different rates of substitutions of their amino acids depending on their function and how harmful a change may be to that function. As a result, one protein may have an e.u. of 5 million years and another may have an e.u. of 50 million years. Table 1.2 is based upon the average of many different proteins.

Pioneer of Dr. (1925-1983) was a pioneer in the use of computers in chemistry and biology, beginning with her PhD thesis project in 1948. Her work was multi-disciplinary, and she used her knowledge of chemistry, mathematics, biology and computer science to develop an entirely new field. She is credited today as a founder of the field of Bioinformatics. This field is defined as the use of computers in solving information problems in the life sciences, mainly involving the creation of extensive electronic databases on protein sequences and genomes. Dr. Dayhoff was the first woman in the field of Bioinformatics. She was also the first woman to hold office in the Biophysical Society, serving first as Secretary and later source: www.dayhoff.cc/ as President.

Questions for Discussion

1. Where in each row does the highest probability occur in the substitution matrix (Table 1.2)? Explain what the information tells us about amino acid evolution.

2. The probabilities of an amino acid not changing are given along the diagonal. a. What amino acid has the greatest chance of not changing after one evolution unit? Explain how your answer is represented in the matrix.

b. Hypothesize why this amino acid is not likely to be substituted by another amino acid.

Evolution By Substitution Student 7

ACTIVITY 1-1 Using the Substitution Matrix

Objective: Use the substitution matrix to examine amino acid relationships and determine probabilities of substitutions. Participants: Groups of 2-3 students Materials:  Handout ES-H1 Substitution Matrix  Handout ES-H3 Using the Substitution Matrix Worksheet

1. Examine the substitution matrix. a. Determine the most common amino acid substitution.

b. What is the probability that this substitution will occur?

c. Look at the amino acid table and compare the two amino acids in this substitution. Why is this change so probable?

2. The probability of amino acid D changing into N is 0.0036 in one e.u. a. Out of 1000, what is the number of D’s expected to change into N after one e.u.?

b. The probability of S changing into N is 0.0020 in one e.u. Out of 1000, what is the number of D’s expected to change into N after one e.u.?

c. Out of 1000 amino acids, which are half D and half S, how many N’s should be expected after one e.u.?

3. To actually calculate the probability of a certain change in a protein, it is necessary to know the probability of each amino acid changing into each of the other amino acids. There are 400 probabilities! Why are there 400 of them?

4. In one e.u., what is the probability that: a. A changes into D? b. K changes into R? c. M does not change?

If A and B are disjoint events (cannot happen at the same time), then the Probability of A or B is .

5. Assume that in one e.u. amino acid substitutions are disjoint events. What is the probability that: a. N changes into S or T?

b. A changes into S or R?

6. What is the probability that L does change?

Evolution By Substitution Student 8

7. What is the probability of C changing into F?

8. After one e.u., does R have a higher probability of changing into a Q or a K?

9. What amino acids will C least likely change into?

Possible Amino Acid Substitutions

Notice the sum of probabilities in each of the first three rows of the substitution matrix equals 1. Should this property hold for all the rows in the matrix? Why?

All rows should sum up to 1. Each amino acid can only change to one new amino acid, at the end of one evolution unit, thus the sum of the probabilities cannot exceed 1. On the other hand, an amino acid will either change into one of the other 19 amino acids or not change and stay the same. This will exhaust all the possibilities, thus the sum is exactly 1. If you compute all rows you will notice that most of them do not add up to exactly 1. This is due to the fact that our accuracy is four decimal points.We took the liberty of rounding the first five rows, so that the sum is exactly 1.

Questions for Discussion

3. Amino Puzzle. a. Find three examples of pairs of amino acids that have the following property: the probability of the first changing into the second, after one e.u., equals the probability of the second changing into the first after one e.u.

b. Hypothesize why this is true.

4. Consider the following matrix as a substitution matrix in a world with only five amino acids. Describe the “evolution” that takes place in this world.

into Ala Arg Asn Asp Cys from A R N D C Ala A 1 0 0 0 0 Arg R 0 1 0 0 0 Asn N 0 0 1 0 0 Asp D 0 0 0 1 0 Cys C 0 0 0 0 1 If A and B are independent events then the Probability of A and B is: .

5. Given 1000 G amino acids, can the number of H amino acids expected after two evolution units be determined? Assume substitutions in the first and second e.u. are independent events.

Evolution By Substitution Student 9

Practice

1. Referring to the substitution matrix, state the following probabilities: a. A changing into Q after one e.u.

b. K changing into L after one e.u.

c. Q changing to E or H after one e.u.

d. V not changing after one e.u.

e. Q changing after one e.u.

2. Which is more likely to occur after one e.u.: A changing into I or I changing into A?

Extension

DNA is made of four different bases: A, T, C, and G. Combinations of three of these bases called triplets code for the 20 different amino acids or signal the end of the protein. The following chart that depicts which DNA triplets code for each amino acid. Use this chart and the substitution matrix to answer the following questions.

Amino Acid DNA Triplet Amino Acid DNA Triplet

Leucine (L) AAT, AAC, GAT, GAG, Alanine (A) CGG, CGT, CGC, CGA GAC, GAA

TCT, TCC, GCA, GCT, Arginine (R) Lycine (K) TTT, TTC GCG, GCC

Asparagine TTG, TTA TAC (N) (M)

Aspartic Acid CTG, CTA AAG, AAA (D) (F)

Cysteine (C) ACG, ACA (P) GGT, GGG, GGC, GGA

Glutamic AGT, AGG, AGC, AGA, CTT, CTC (S) Acid (E) TCG, TCA

Glutamine (Q) GTT, GTC (T) TGT, TGG, TGC, TGA

Glycine (G) CCT, CCG, CCC, CCA ACC

Evolution By Substitution Student 10

(W)

Histidine (H) GTG, GTA (Y) ATG, ATA

Isoleucine (I) TAT, TAG, TAA (V) CAT, CAG, CAC, CAA

STOP ATT, ATC, ACT

1. Amino acids are rarely substituted with tryptophan. Does the fact that only one DNA triplet codes for tryptophan support this? Explain.

2. Determine if the fact that only one triplet codes for trytophan affects this rate. Compare the rate of substitution with tryptophan to the rate of substitution with methionine, which also has only one coding triplet. How do the two compare? Does this comparison support the fact that substitutions with trytophan are rare because it has only one coding triplet?

3. Amino acids are more commonly substituted with serine. Do the number of DNA triplets that code for serine support this? Explain.

4. Arginine has the same number of DNA triplets that code for it as serine does. Is it substituted as often as serine?

5. Based on the answers to a–d, does the number of triplets coding for an amino acid appear to be the main factor contributing to its rate of substitution? If not, what other factor(s) may be important in determining the rate of substitution?

Evolution By Substitution Student 11

Lesson 2 Multiple Substitutions

In Lesson 1, the problem of monitoring how amino acids change in species was introduced. The natural way to describe that change is by using probabilities--numbers between 0 and 1 that measure how likely (or unlikely) it is that a particular substitution will occur during one e.u. For example, P(DA) = 0.001 means that “there is a 1 in 1000 chance that DNA changes will result in Alanine (A) being substituted for (D).” It is also convenient to store those numbers in a two-way table called a matrix, like the data contained in Table 1.2. In that matrix, the various rows represent the original amino acid and the various columns represent the new amino acid resulting from a substitution.

Nature is too complex to regulate that these transitions take place one at a time. With twenty amino acids all capable of turning into twenty different amino acids, calculations can be quite difficult. It turns out that matrices are also useful for handling the specific kind of calculations needed for making predictions. This lesson will examine all the possibilities at one time, help you understand exactly what calculation is being made, and allow you to learn how to use a calculator to do the work.

Examining a Smaller Problem

In modeling complex situations, it is sometimes convenient to pretend that a simpler set of conditions is present. For now, make the following simplifying assumption:

“There are only 5 kinds of amino acids: alanine (A), arginine (R), (N), aspartic acid (D) and one of the other kinds of amino acids. Consider the other sixteen amino acids to be a category called ‘other’ (O).”

With only 5 kinds of amino acids we can investigate how substitutions occur over e.u. time periods.

ACTIVITY 2-1 Investigating Substitutions

Objective: Investigate substitutions over e.u. time periods. Materials:  Handout ES-H4 Investigating Substitutions Worksheet  Handout ES-H1 Substitution Table

1. Create a matrix describing the various probabilities relative to the simplified version of this problem. Set up a matrix like what is shown in Table 2.1. a. Assign the appropriate probabilities for A to be replaced by A, R, N and D, in their correct locations in your matrix.

Evolution By Substitution Student 12

to replacement A R N D O A R N D from original from original O Table 2.1: Substitution matrix for simplified problem.

b. Keep in mind that our simplified model uses O to represent any one of the other sixteen amino acids that appear at the end of the first row from Table 1.2. What is P(AO)? How is that answer calculated? (Be sure to record that answer in your matrix.)

c. The sum of all the numbers across the row should be ‘1’. Why does that make sense in the context of changing amino acids?

d. In a similar fashion, fill in the entries for Rows 2, 3 and 4 of your matrix, where the original amino acid was R, N or D.

e. The sum of all the numbers in any column is not ‘1’, since each row refers to a different initial condition. However, it is still possible to determine P(OA), P(OR), P(ON) and P(OD) from the information in Table 1.2, and you can use the property that each row must add up to 1 to find the final entry, P(OO). Fill in the entries for Row 5 of your matrix.

Labeling and Reading a Matrix

It is useful to give a matrix a name, so it can be easily referred to. Rather than a long name like “substitution matrix”, a single bold capital letter is used. For example, the substitution matrix in Table 1.2 will be named S. In referring to individual entries in a matrix, the matrix name and two subscript numbers, one for the row followed by one for the column are used. For example, S3 1 or S3,1 has the value 0.0009. Some calculators might require syntax such as [S](3,1).

Questions for Discussion

The matrix created in Activity 2-1 will be named B.

1. What is the value of B2 5 and the value of B5 2?

2. Which is bigger: B1 1 or B3 3?

3. What is: (B4 3)(B1 5)?

Evolution By Substitution Student 13

State Diagram

It is sometimes useful to look at something in a different way, either to understand it better or to provide an alternative explanation. Scientists use a picture called a state diagram to visualize the various possibilities for an event. Figure 2.1 contains a portion of the work recorded in matrix B – the part that affects amino acid A:

0.9867

A 0.0180 0.0001

O R 0.0004 0.0006

D N

Figure 2.1: Incomplete state diagram.

In Figure 2.1 the curved arrow and its associated number (0.9867) represent the situation in which an alanine (A) molecule remains an alanine molecule; in other words, no substitution has taken place. The number comes from the fact that P(AA) = 0.9867.

Think of Figure 2.1 as describing, “What happened to alanine (A)?” since the arrows are going away from A in the diagram. If you start with 50,000 (A), how many A’s, R’s, N’s, D’s, and O’s can be expected at the end of one e.u.?

We calculated this by multiplying the initial number of molecules by the percentages along each path. (50000)(0.9867) = 49335 molecules of A would remain. (50000)(0.0001) = 5 molecules of R would be formed. (50000)(0.0004) = 20 molecules of N would be formed. (50000)(0.0006) = 30 molecules of D would be formed. (50000)(0.0122) = 610 molecules of A would become something else other than R, N or D.

Note the sum of new molecules is 49335+5+20+30+610 = 50000.

Evolution By Substitution Student 14

ACTIVITY 2-2 Calculating Resulting Amino Acids

Objective: Calculate amino acids resulting from substitutions Materials:  Handout ES-H5: Calculating Resulting Amino Acids Worksheet  Handout ES-H4: Investigating Substitutions (completed)

Now, instead of thinking about the various amino acids that might be substituted for alanine (A), reverse the process and consider the ways in which a substitution might result in A.

1. Draw a state diagram to describe this situation; attach probabilities for each path.

2. Suppose you start with 50000 molecules of A, 40000 of R, 30000 of N, 20000 of D and 10000 of O. Write a single expression that will compute the number of molecules of A expected after one e.u.

3. How many molecules of A can be expected after one e.u.?

Start again with 50000 molecules of A, 40000 of R, 30000 of N, 20000 of D and 10000 of O. (Note: from a biological point of view, these values are unreasonably large, but are assumed true in order to create whole-number answers for the computation.) Consider the matrix A = [50000 40000 30000 20000 10000]. Notice that A is a 1x5 matrix and contains the number of each amino acid present in our current situation. The order in which numbers appear in this matrix must be the same as the order the amino acids appear in the substitution matrix used (Table 2.1, Matrix B) – A, R, N, D, and O respectively.

4. Use matrices A and B to answer the following questions. a. Write a single expression that will compute how many molecules of R are expected after one e.u. (Hint: drawing a new state diagram may help.)

b. How many molecules of R are expected after one e.u.?

c. Which amino acid, A or R, underwent more substitutions during that time period? Explain.

Matrix Multiplication

In working through Question 4, you have actually done part of a . The matrix multiplication problem A • B is below:

Evolution By Substitution Student 15

   0.9867 0.0001 0.0004 0.0006 0.0122    0.0002 0.9913 0.0001 0.0000 0.0084     20000 10000   50000 40000 30000   0.0009 0.0001 0.9822 0.0042 0.0126   0.0010 0.0000 0.0036 0.9859 0.0095     0.0180 0.0069 0.0092 0.0083 0.9576   

In working out previous calculations, it is easier to imagine the process of multiplying the two matrices. The first two steps are shown below:

Row 1 and Column 1 Row 1 and Column 2

and

50000 • 0.9867 50000 • 0.0001 +40000 • 0.0002 +40000 • 0.9913 +30000 • 0.0009 +30000 • 0.0001 +20000 • 0.0010 +20000 • 0.0000 +10000 • 0.0180 +10000 • 0.0069 = 49570 = 39729

The numbers in A are paired with the numbers in the first column of B and multiplied. These 5 products are added to get the expected number for each amino acid. This process is repeated for all 5 amino acids (all 5 columns of B). The results represent the expected number of each amino acid after one e.u. and are recorded in a matrix.

A • B = 49570 39729 ??? ??? ???

One reason for using matrices is to organize information. For that purpose, there is no difference between tables and matrices. Matrices allow certain types of computations, but the procedures for these computations can be laborious. The bigger the matrix, the more helpful technology is.

Matrix Technology Notes

Matrix multiplication operations are much easier to accomplish with the help of technology. You can use either a calculator or a spreadsheet program to assist you.

Graphing Calculator Most graphing calculators have matrix operations as a built-in function. For example, the TI-84 Plus calculator accesses all matrix operations by pressing the 2nd and x-1 keys in

Evolution By Substitution Student 16

sequence. Notice that by pressing the 2nd key you are using the Matrix (or Matrx) command above the x-1 key to access the Matrix menus. The steps involved may include the following: Create the matrix first (or modify an existing one) Select ‘EDIT’ menu, then choose the matrix with which to work. Specify the dimensions (rows and columns) for the matrix. Type in the specific entries for the matrix by highlighting the appropriate cell, typing the number and pressing enter.

From the Home-Screen, use the matrix names in calculation expressions like [A]*[B]^5. You generate the matrix name on the Home Screen from the same menu (use the 2nd and x-1 keys to get there) by pressing a number key when the “NAMES” menu is highlighted.

Spreadsheets Spreadsheet programs are also useful for matrix multiplication. The programs may refer to the matrices as arrays. For example, if you enter matrix B as a 5x5 array in J3:N7, the matrix B2 is created using the command MMULT (matrix multiplication). In order to do this, you would highlight a 55 set of cells (where the result will appear), and type the command: =MMULT(J3:N7, J3:N7) and then press CTRL + SHIFT + ENTER to execute the instruction on the entire matrix at one time.

In a similar fashion, successive powers of the substitution matrix can be built to display (or view) the transition probabilities over more e.u.’s.

Questions for Discussion

Using a calculator or spreadsheet that is able to perform matrix computations, create a matrix A that contains the original number of amino acids and a substitution matrix B, as shown in Figure 2.1.

4. Interpret what the values of A1,4 and B1,4 mean for this problem.

5. Why does it make no sense to ask for an interpretation of the value of A4,1 here?

6. Using available technology (calculator or spreadsheet), multiply A • B. (Be careful; the order makes a big difference!) How many of each of the N, D and O molecules are expected after one e.u.?

7. What happens if the order is reversed and you attempt to find B • A?

In order to do matrix multiplication, the number of columns in the first matrix must match the number of rows in the second matrix. The product is another matrix that has the same number of rows as the first matrix, and the same number of columns as the second matrix. This can be summed up easily, using the notation m  n to describe the dimensions of a matrix with m rows and n columns. In that case, matrix multiplication is

Evolution By Substitution Student 17

possible when the matrix of dimension m  n is multiplied by a matrix of dimension n  p. The dimension of the answer (product) is m  p.

Practice

1. A specialist is interested in tracking the number of a particular amino acid, (G), present in a sample that contains 6000 molecules of G and 94000 molecules of something else (O). (Refer to Table 1.2.) a. How many rows and columns does the specialist’s substitution matrix need?

b. What substitution matrix describes this situation?

c. What does the complete state diagram look like?

d. How many molecules of the G amino acid are expected after 1 e.u.?

2. Perform the following matrix multiplications. 0.2   a. 10 20 30  0.5 0.3  

0.6 0.4 0.5 0.2 0.4 b. 20 10    c. 10 20    0.4 0.6 0.5 0.8 0.6

3. Use the numbers 1, 2, 3, 4, 5, 6 (in order from left to right, and from top to bottom) to create a 2  3 matrix called A, and a 3  2 matrix called B. a. What is a2,2  b2,1?

b. For this example, what is AB?

c. What is BA?

d. Does the commutative property hold for matrix multiplication? In other words, does AB give you the same answer as BA? Explain.

4. Perform the given multiplication operations and record each answer. 0.1 0.8 0.1   a. 100 300 600  0.3 0.1 0.6 0.6 0.1 0.3

 2 5  3 1 b.      1 4 1 3 

Evolution By Substitution Student 18

1 2 3 4   4 3 2 1 2 3 4 1 c.    1 2 3 4 3 4 1 2   4 1 2 3

Extension

1. In Discussion Question 6, the number of each amino acid expected after 1 e.u. was computed, starting with [50000 40000 30000 20000 10000] molecules of each amino acid. The result is: [49570 39729 29654 19957 11090]. What happens if the time interval is extended one more e.u.? (Hint: Begin with [49570 39729 29654 19957 11090]. Use the 5  5 substitution matrix B one more time, and record the new numbers expected after the second e.u.).

2. Biologists have created three categories of coyote—pup, yearling and adult. The following matrix describes the probability of change in a certain coyote population over one year: 0.2 0.4 0    P =  0 0.4 0.4  0 0 0.8  

a. Explain what P2 3 means for this problem.

b. If the categories ‘pup’, ‘yearling’ and ‘adult’ are distinct (so you cannot be more than one of them at a time), and are based on age and maturity, explain why some of the entries in the matrix have value 0.

c. What would be a realistic reason (within the context of this problem) why P3 3 does not have a value of 1?

d. Would you expect each of the rows of this matrix to add up to 1? Explain.

3. Smallville is made up of three separate (and smaller) regions: OldTown, DownTown and NewTown. In any given year, 3% of the people living in OldTown move to DownTown 8% of the people living in OldTown move to NewTown 1% of the people living in DownTown move to OldTown 6% of the people living in DownTown move to NewTown Once someone lives in NewTown, they never want to move away from there.

a. Draw a state diagram describing the movement of Smallville’s population during a one-year time period.

Evolution By Substitution Student 19

b. Create a matrix that describes the movement of Smallville’s population during a one-year time period.

4. The partial state diagram below details the probability of having a peanut butter sandwich (PB), a tuna fish sandwich (TF) or a grilled cheese sandwich (GC) tomorrow, depending on what was eaten today.

PB 0.35 0.10 0.15 0.20 0.25 TF GC 0.30 Partial state diagram for lunch options.

a. Add in arrows and probabilities to complete the state diagram.

b. Use the diagram to develop a matrix that describes the same set of conditions. (Assume the rows are what you ate today, and the columns are what you are going to eat tomorrow.)

1 2 5. Describe how each multiplication affects the contents of the matrix M =   . 3 4 1 2 1 0 a.      3 4 0 1

1 2 0 1 b.      3 4 1 0

1 2 1 0 c.      3 4 0 0

1 2 1 0  d.      3 4  0 1

Evolution By Substitution Student 20

Lesson 3 The Power of a Matrix

So far, the lessons have shown how proteins change over time by substitution of one amino acid with another amino acid. Probabilities describe the likelihood that the amino acids will undergo a substitution during some established amount of time. As a result, the probabilities can be used to predict how many of each amino acid would be present at the end of a time period, if the original numbers are known. Matrices are a convenient way to keep track of the numbers involved and the computations required.

Predicting Future Change

At the end of Lesson 2, the substitution matrix was applied over two successive intervals of time. Figure 3.1 illustrates this idea.

times B times B

Matrix A: Values Values Initial Values after 1 e.u. after 2 e.u.

Figure 3.1: Calculation over two evolutionary units.

The substitution matrix B describes the changes that take place over one e.u. To understand the evolution of living things though, it is necessary to describe these changes over many e.u.’s. This lesson focuses on using the substitution matrix to examine change over extended periods of time.

Matrix Multiplication Review

In algebra, when an expression like a  x  x is used, exponents are used to write the same expression in simplest form as a  x2. For the calculation shown in Figure 3.1, the ‘a’ would represent the initial amounts of the various amino acids (which we had assumed to be [50000 40000 30000 20000 10000]) and the ‘x’ would represent the substitution matrix containing all the probabilities.

Recall the order of operations have a hierarchy of importance: 1) Parentheses, 2) Exponentiation, 3) Multiplication or Division (from left to right) and 4) Addition or Subtraction (from left to right). The same order of operations holds for matrices and so in the expression a  x2 the exponentiation would be done first.

When the variables represent individual numbers, the expressions a  x  x and a  x2 are always possible to compute. This is not always true if a and x are matrices. The problems arise in the dimensions of the matrices.

Evolution By Substitution Student 21

Questions for Discussion

1. Multiply the following matrices. 2 1 2 1 2 1 3 2 1 3  3 2 3 2  0 3 1     0 3 1   1 0 1 0      

2. If a and x are matrices, is it not always possible to compute a  x or x2. To experiment with when this is possible, try multiplying the following matrices. If the multiplication is possible, determine its product; if not, indicate why not? 2 5 10 3  1 a.   3

6 4 b. 10 20 30    4 6

2  1 3  c.    2 1     

2  2 1 3 d.    0 3 1   

2    1 1 2    e.  1 2 1       2 1 1   

f. What conclusion can be drawn about squaring a matrix?

Tracing Amino Acid Substitutions with Matrices

Recall that B is the 5  5 matrix containing the probabilities of five amino acids (A, R, N, D and O) undergoing a substitution over one e.u.

Evolution By Substitution Student 22

0.9867 0.0001 0.0004 0.0006 0.0122 0.0002 0.9913 0.0001 0.0000 0.0084   B = 0.0009 0.0001 0.9822 0.0042 0.0126 0.0010 0.0000 0.0036 0.9859 0.0095   0.0180 0.0069 0.0092 0.0083 0.9576

We know we can square matrix B because it is a square matrix (same number of rows as columns). B2 =

Recall the initial number of amino acids: A = 50000 40000 30000 20000 10000

In Lesson 2, we found A• B and multiplied B by the result and obtained:

Find A• B2 using the A and the B2 matrices above. Does your answer match the A• B • B matrix from Lesson 2? The B2 matrix describes the likelihood of any of the five amino acids undergoing substitution over two e.u.’s. It seems reasonable that powers of the substitution matrix can be used to predict what happens to amino acids over multiple e.u.’s. However, it is not true that each element of the matrix is simply raised to that power. The calculation is much more complex than that. We need to look more closely at where the numbers in B2 come from to have greater confidence in the process.

ACTIVITY 3-1 Investigating and Interpreting Powers of Matrices

Objective: Investigate and interpret results of matrix multiplication. Materials:  Calculator or computer  Handout ES-H6: Investigating and Interpreting Powers of Matrices Worksheet

Use the B2 matrix to complete this Activity.

2 2 1. The notation B 11 indicates the entry in Row 1 Column 1 of the matrix B 2 a. What does the number in B 1 1 represent?

b. There are five different pathways that begin with amino acid A and end with amino acid A after two e.u.’s. One of them is ARA. Using that same notation, what are the other 4 possibilities?

2. It is assumed that each situation involving substitution is independent of each other. For that reason, the probability over a “chain” of events involves multiplying the various probabilities that describe the “links”. For example, P(ARA) = (0.0001)(0.0002) = 0.00000002. a. Calculate the probability for each of the remaining four pathways identified in part 1.b.

Evolution By Substitution Student 23

b. Each of the five pathways is disjoint. In other words, if you go down one path, you cannot be going down a different path at the same time. Therefore, the probability of any of the events taking place is found by adding together the probabilities of all the paths. Using this addition rule and the answer from part (a), find the probability of starting with amino acid A and ending up with amino acid A at the end of the second e.u.?

2 2 3. The notation B 3 5 indicates the entry in Row 3 Column 5 of the matrix B . 2 a. In the context of amino acids undergoing substitution, what does B 3 5 mean, and what is its value?

b. Draw a state diagram for the answer to part a. that represents all possible amino acid substitutions over two e.u.’s, and the probabilities of each substitution.

c. In the state diagram in part b., identify the five pathways in which amino acid N is replaced by amino acid O in 2 e.u.’s, and include the expression that calculates the probability for each of those events.

2 d. Write a single expression that calculates B 3 5 and verify that that this expression generates the same value as recorded in part a.

4. In general, the matrix multiplication A  B = C is defined in such a way that Ci j is found by multiplying the ith row of A by the jth column of B and pairing terms, multiplying them together and then adding all the products. Take another look at our matrix B2 and how the entries are determined. The square notation B2 represents the matrix multiplication B  B:

0.9867 0.0001 0.0004 0.0006 0.0122 0.9867 0.0001 0.0004 0.0006 0.0122 0.0002 0.9913 0.0001 0.0000 0.0084 0.0002 0.9913 0.0001 0.0000 0.0084      0.0009 0.0001 0.9822 0.0042 0.0126 0.0009 0.0001 0.9822 0.0042 0.0126 0.0010 0.0000 0.0036 0.9859 0.0095 0.0010 0.0000 0.0036 0.9859 0.0095     0.0180 0.0069 0.0092 0.0083 0.9576 0.0180 0.0069 0.0092 0.0083 0.9576

a. Write down the elements in the first row of B, and the elements in the first column of B. Then identify how the elements of the first row pair up with the elements in the 2 first column when computing B 1 1.

2 b. Write a single expression that shows how you would compute B 1 1 according to the definition of matrix multiplication. Compute the value and check if it is the same as in the B2 matrix.

2 c. What row and column are multiplied together to determine B 3 5?

2 d. Write a single expression that shows how to compute B 3 5, and then compute the value of the expression.

Evolution By Substitution Student 24

2 e. What does the value for B 3 5 represent?

5. Higher powers of the matrix B are used to project farther than 2 e.u.’s into the future. For example, probabilities of amino acid substitutions over 5 e.u.’s are contained in the matrix B5. Using this new distribution of the same five amino acids used in previous problems: 20000 35000 80000 45000 40000 , how many of each amino acid are expected after 5 e.u.’s?

6. Use the 55 substitution matrix B and new initial numbers of amino acids A = 50000 40000 30000 20000 10000. a. Project from 5 e.u.’s to 100 e.u.’s into the future to complete the table. Determine how many molecules of each amino acid are expected at the end of each of the following time intervals. Round each number to the nearest integer.

Time interval A R N D O 5 e.u.’s 10 e.u.’s 50 e.u.’s 100 e.u.’s

b. Initially, which of the amino acids had the least and most molecules? Are those numbers increasing or decreasing over time?

7. A typical graphing calculator cannot compute much higher powers of B than B100. However, these limitations are overcome by considering what happens over multiples of hundreds of e.u.’s. a. Create a new matrix C whose entries are equal to that of B100. What is the numeric content of C?

100 b. Interpret the meaning of C1 1 in this context. How does it relate to B 1 1?

c. Predict what the result of calculating A•C will be. Verify this prediction using the available technology. What is the result?

8. In algebraic expressions, (xa )b  xab . For example, (x 2 )3  x 6 . a. Experiment with the matrix B to see if the same property holds for matrices. Compute (B2)3 and B6. Are these two matrices the same?

b. Using matrix C, which contains probabilities for undergoing change over one hundred e.u.’s, and the exponent property you verified in part (b), determine the distribution of amino acids expected after 200, 500, 800 and 1000 e.u.’s. Round each number to the nearest integer. Remember: the initial distribution is matrix A: [50000 40000 30000 20000 10000]. Hint: C2 = (B100)2.

Evolution By Substitution Student 25

Time interval A R N D O 200e.u.’s 500 e.u.’s 800 e.u.’s 1000 e.u.’s

c. Mathematically, does the distribution of amino acids become constant over time? If so, estimate how long it takes to reach a constant distribution (called steady state) and what the final distribution of amino acids will be.

d. Examine the contents of the matrix C10. What can you tell about the distribution of amino acids from looking at this matrix near the steady state time period?

Steady State

The probability that any amino acid remains the same after one e.u. can be found along the diagonal of the original matrix B. Because those values are approximately equal to 1, one may think that the relative numbers of amino acids would be fairly stable. Over long periods of time that are typical of evolutionary processes, the system moves toward its own “steady state”, where substitutions for one amino acid are being offset by the combined substitutions in the others. For example, at steady state, the number of molecules of A that change into another amino acid is offset by the number of molecules of other amino acids that change into A. After a system reaches steady state, the same distribution is calculated for any time interval into the future. Systems with long-term behaviors that tend toward a steady state are said to be stable. Stability is the state or quality of being resistant to change. Using our example, the number of molecules of A stays the same after the system reaches steady state. As long as conditions remain constant, there is good reason to continue using the same probability numbers until the point where steady state is reached. This is a mathematical consequence of the model that has been built, and depends on the assumption that conditions remain constant. However, if conditions change (and they do!), so will the probabilities in B, and evolution by substitution will continue with these new probabilities. For this reason, there is a limit to the ability of a model to accurately predict the future.

Practice

1 2 Assume that M =   . Determine the matrix for each of the following: 3 4 1. M2

2. M3

3. M5

Evolution By Substitution Student 26

Challenge

1. How would you find M150? M225?

2. Consider the 5 x 5 matrix B in this lesson representing the amino acids A, N, R, D and O. a. In 1 e.u., how many different ways can amino acid A undergo substitution by amino acid R?

b. Over the course of 2 e.u.’s, how many different ways can amino A acid undergo substitution that results in amino acid R? Hint: One way is ANR.

c. Over the course of 3 e.u.’s, how many different ways can amino acid A undergo substitution that results in amino acid R? How is this answer determined?

d. Predict the number of different ways amino acid A can undergo substitution that results in amino acid R in just 10 e.u.’s? How is this answer determined?

Extension

Definition: A transition matrix is a square matrix whose dimension is determined by the number of discrete outcomes for a dynamic situation. A transition matrix contains all the probabilities of going from any specific state to another (or possibly the same one) in some fixed time interval. Because all possible outcomes are described by the probability of that event taking place, each row of the matrix must add up to 1.

Definition: A Markov chain consists of two initial conditions: a transition matrix T (whose probabilities are assumed to remain constant over time) and an initial distribution A. When combined together by the matrix multiplication operation ATn, they represent a “chain” of outcomes repetitively over n successive time intervals as well as the predicted result at the end of the nth time period.

1. A pizzeria serves up three kinds of pie: pepperoni, salami and cheese. Company records show the following trend: 60% of the time, a customer orders the same type of pizza the next day. 20% of the time, a customer orders one of the other types of pizza the next day. 20% of the time, a customer orders the other remaining type of pizza the next day. a. Create a transition matrix T for this situation.

b. The day that the pizzeria opened a new store in a nearby location, they sold 500 pepperoni, 200 salami and 300 cheese pizzas. If the customer preferences remain the same at this new location, how many pizzas do you expect they will sell the next day?

c. How many pizzas do you expect the new store to sell five days after they open?

Evolution By Substitution Student 27

2. For an automated assembly line, consistent performance is a critical issue. A machine that has worked correctly 80% of the time on average is brought in for repair. After being fixed, whenever it does a job correctly, the machine will do the next job correctly 90% of the time. When it does not do a job correctly, it will do the next job correctly only 70% of the time. We are interested in whether the repair will improve the long-term performance of the machine. a. Create a transition matrix for this situation.

b. What percent of the tasks will be successfully completed at the end of one time period?

c. What percent of the tasks will be successfully completed after five time periods has elapsed?

d. Did the repair improve the long-term performance of the machine? Explain.

3. Two rival cable companies, TellyTV and SaddleLite, are in hot competition in one town. Researchers found the current yearly conditions as shown in the state diagram:

60% 15% 80% TellyTV SaddleLite 30% 10% 10% 50% 5% No Cable Service

40%

a. Create a transition matrix for this situation.

b. Currently, 35,000 people subscribe to TellyTV, 15,000 to SaddleLite, and 50,000 have no cable service. Under these conditions, what can TellyTV expect as the long- term share of the market in this town?

c. The marketing director for TellyTV determines that an aggressive advertising campaign can influence the people who do not have cable yet. Her figures indicate that 80% would go to her company, with 10% going to SaddleLite and 10% remaining without cable. What can TellyTV expect as the long-term share of the market if the advertising campaign is started?

d. The sales director for TellyTV vetoes the advertising campaign. He has his own plan: offer greater services for a slightly reduced rate. He estimates that 80% of the TellyTV customers will stay with the company, and only 10% will switch to SaddleLite. If this plan is enacted, what can TellyTV expect as the long-term share of the market?

Evolution By Substitution Student 28

4. Trends in recent national elections are studied intensely as a way to understand how voters might behave in the future. In one such study, the party affiliation of the voters in one state was examined. The result of that study is summarized in the following table (transition matrix): Next Election

Republican Democrat Neither n Republican 0.75 0.05 0.20 Democrat 0.20 0.60 0.20 Current Electio Neither 0.40 0.20 0.40

For example, there was a 75% chance that a registered Republican in one election would remain a Republican in the next election, while there was a 5% chance that a Republican would switch to being a Democrat. Does this situation ever reach steady state? If so, what will be the voter distribution along party lines?

5. The Acme Rent-A-Car Company owns and maintains many cars in its business. Every car is inspected each week, and assigned a letter G(good), F(fair) or P(poor), depending on its current condition. Acme also keeps track of how those conditions change from one week to the next, with the probabilities given in the following table (transition matrix): Next week G F P G 0.60 0.30 0.10 F 0.20 0.60 0.20 This Week P 0.10 0.40 0.50

a. What is the likelihood that a good car will become a poor car over a five-week time period?

b. If the Acme Rent-A-Car Company currently has 1200 cars that are rated ‘G’, 400 that are rated ‘F’ and 200 that are rated ‘P’, how many of each rating will the company have at the end of the five-week time period?

c. Does this situation reach steady state? If so, how many weeks will it take?

6. A study was done by professors at Brock University on the land usage of the Niagara region in Canada. A total of 1886 acres were examined in 1976, with 241 acres classified as wooded, 1340 acres as agricultural and 305 acres as urban. The researchers reported

Evolution By Substitution Student 29

their data in the table below showing how many acres of each type underwent change, and what it became.

Land Use in 1981 (acres) wooded agricultural urban wooded 198 38 5 agricultural 29 1301 10 1976 (acres) 1976 (acres) Land Use in Land Use urban 6 49 250

Assume that the development of each acre is independent, and that all decisions on their land use were random (i.e., which acres were affected, and how they were changed). In that case, the percentage of the total land in each category that underwent change is a good estimate of the probability that a particular acre of that type will undergo the corresponding change. a. Create a matrix, similar to the one shown, which contains the various probabilities associated with the change in land use over the period from 1976 to 1981. To do that, take each row, and divide it by the total number of acres of that type that was present in 1976.

b. How many acres of each type of land were remaining at the end of 1981? Explain how to use both matrix multiplication and the given table to answer that question.

c. Assume the same changes in land use as what the researchers found in their 6-year study continues. How many acres of each type of land would be remaining at the end of 2005? (Note: that is the end of a thirty-year period from the beginning of the original study.)

d. Projecting current demands into the future often leads to bad estimates, since it is unlikely the conditions will remain the same for long. However, assume that those conditions do not change. Will this situation ever reach steady state? Explain.

e. Given typical land use changes over time, do the results from part (d) make sense? Explain.

Evolution By Substitution Student 30

Lesson 4 The Big Picture

In earlier lessons, evolution by substitution was made simpler by assuming that there were only five categories of amino acids. The reality is that the ‘O’ group actually represents sixteen more amino acids, and there is much interaction among them that must be described exactly. In this lesson, the “big picture” is finally revealed, as the previous lessons are applied to modeling problems that biologists face regularly.

ACTIVITY 4-1 Investigating All 20 Amino Acids

Objective: Explore interactions among all 20 amino acids. Materials:  Handout ES-H1: Substitution Matrix  Handout ES-H7: Investigating All 20 Amino Acids Worksheet  Calculator/Computer (optional)

Let M be the 20 x 20 probability matrix in Table 1.2 (ES-H1).

1. Use matrix M to answer the following. a. What is P(ML) for a 1 e.u. time interval?

b. How would you determine P(ML) for a 5 e.u. time interval?

c. How would you determine P(SA) for a 10 e.u. time interval?

d. How would you determine which is a more likely event: HQ after 20 e.u.’s, or ST after 15 e.u.’s?

2. Consider the following distribution of amino acids found in a protein sequence from a fossilized remain (listed in the same order as they appear in matrix M):

A = [50 20 30 8 15 25 2 40 4 12 28 22 15 5 17 32 6 2 4 10]

Suppose another sample is taken from a related species that lived 50 e.u.’s later.

How would you determine the number of amino acids in the more recent species, and the percent change for each amino acid?

3. In Lesson 1, this partial protein sequence from a domesticated dog was introduced: MAASPRNSVLLAFALLCLPWPQEVGAFPAMPLSSLFANAVLRAQHLHQLAADTYKEFERA

a. Based only on this sample, explain how would you predict the first amino acid substitution that the protein will undergo? HINT: First, make a frequency table to determine the initial distribution. Amino Acid A R N D C Q E G H I L K M F P S T W Y V Frequency

Evolution By Substitution Student 31

b. Assume the following chart resulted from your calculations in part a. Based on this sample, how many e.u.’s will it be before the protein undergoes evolution by substitution? Amino Acid A R N D C Q E G H I L K M F P S T W Y V 0 e.u.’s 12 3 2 1 1 3 3 1 2 0 101 2 4 5 4 1 1 1 3 1 e.u. 12 3 2 1 1 3 3 1 2 0 101 2 4 5 4 1 1 1 3 4 e.u.’s 12 3 2 1 1 3 3 1 2 0 101 2 4 5 4 1 1 1 3 5 e.u.’s 11 3 2 1 1 3 3 1 2 0 101 2 4 5 4 1 1 1 3 13 e.u.’s 11 3 2 1 1 3 3 1 2 0 101 2 4 5 4 1 1 1 3 14 e.u.’s 11 3 2 1 1 3 3 2 2 0 101 2 4 5 4 1 1 1 3

Predicting Future Protein Make Up

This unit has explored the fact that a protein may change over time by amino acid substitutions. From a biological perspective, the changes result in proteins that have a slightly altered chemical structure and may function somewhat differently. From a mathematical point of view, probabilities have been used to describe the likelihood that the amino acids of one protein may undergo a substitution during some established amount of time. As a result, it is possible to predict the number of each amino acid expected at the end of a time period if the original numbers are known. Matrices are a convenient way to keep track of the numbers of amino acids involved and to make the necessary computations.

Questions for Discussion

1. According to the Table 1.2, the Substitution Matrix, P(GR) = 0. a. Does that mean that the number of G amino acids remains constant?

b. What would be the necessary condition for a particular amino acid to remain exactly constant? Explain.

2. In earlier lessons, one property of the matrix containing the probabilities for substitution is that each row adds up to 1. a. Verify that this property holds for Row 5 of the 20  20 substitution matrix (the ‘C’ amino acid).

b. If you try to do the same thing for Row 2 (the ‘R’ amino acid), you will find that the sum is 0.9999, not 1. Suggest a reason why this situation appears to contradict the property about the sum of the probabilities equaling 1.

3. Think about probabilities of independent and disjoint events to answer the following. a. What is P(NDE) over 2 consecutive e.u.’s?

b. What is P(QE or H or K) over one e.u.?

Evolution By Substitution Student 32

Practice

1. Use Table 1.2, the substitution matrix, and any available technology to answer the following questions.

a. What is P(TT) for a 5 e.u. time interval?

b. What is P(ND) for a 20 e.u. time interval?

2. Given the following initial distribution of amino acids: A = [1000 0 900 0 800 0 700 0 600 0 500 0 400 0 300 0 200 0 100 0]. Determine the distribution that will remain after a time interval of 100 e.u.’s.

3. Humans, along with other species, contain a protein called “cytochrome c.” Scientists have determined that it contains the following amino acid sequence of length 104: GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAAN KNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE

Assuming that the probabilities in Table 1.2 apply to substitutions of the amino acids in this protein, predict what the distribution of those 104 amino acids would be after a time period of 25 e.u.’s. Hint: First determine the initial distribution of the amino acids. Amino Acid A R N D C Q E G H I L K M F P S T W Y V Frequency

Extension

In Lesson 3, it was discovered that the long-term behavior of amino acids undergoing substitution was to reach steady state, a condition in which all substitutions were balanced out, and amino acid distributions became constant. It took a little over 1000 e.u.’s of time to reach steady state in the simplified problem that used a 5  5 matrix. a. Does the generalized problem of modeling the evolution of all twenty amino acids by substitution also reach steady state in the same amount of time? Explain how you determined this.

b. If the evolution of all twenty amino acids reaches steady state by 1000 e.u.’s, how would you determine a more exact time (e.g. after 850 e.u.’s)? If it does not reach steady state by 1000 e.u.’s, when does it reach steady state?

Evolution By Substitution Student 33

Lesson 5 Project - Dating Evolution

Applying the Matrix

Thanks to the research of many scientists, the amino acid sequences of many different proteins are now known. The number of sequenced proteins is expanding daily. Hemoglobin is an important protein in many animals since it is a protein in red blood cells that transports both oxygen and carbon dioxide. It is also the protein that gives blood its characteristic red color. Hemoglobin is actually made from two distinct globin chains. For the purpose of this project, the beta globin chain will be used. Beta globin is a chain of 147 amino acids that has undergone many amino acid substitutions throughout time.

Consider a hypothetical evolutionary tree showing possible relationships among a rat, a mouse, a chicken, and a fish.

Possible evolutionary tree.

The point on the tree that connects the paths to the rat and mouse represents their common ancestor. Assume that this ancestor to both the rat and mouse has the following beta globin amino acid sequence: MVHLTDAEKAAVNCLWGKVNPDEVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNAK VKAHGKKVINAFNDGLNHLDNLKGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLG KEFSPAAQAAFQKVVAGVATALAHKYH

One species of mouse (Mus musculus) has this beta globin amino acid sequence: MVHLTDAEKAAVSCLWGKVNSDEVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNAK VKAHGKKVITAFNDGLNHLDSLKGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLG KDFTPAAQAAFQKVVAGVATALAHKYH

A particular species of rat (Rattus norvegicus) has this beta globin amino acid sequence: MVHLTDAEKAAVNGLWGKVNPDDVGGEALGRLLVVYPWTQRYFDSFGDLSSASAIMGNPK VKAHGKKVINAFNDGLKHLDNLKGTFAHLSELHCDKLHVDPENFRLLGNMIVIVLGHHLG KEFSPCAQAAFQKVVAGVASALASKYH

Evolution By Substitution Student 34

A comparison of proteins between organisms can be used to make inferences as to the time elapsed since the divergence. Scientists match up the sequence for the ancestor with the sequence for one of the animals linked to it in the evolutionary tree. This alignment, where two or more proteins are lined up for comparison, reveals which substitutions took place. A third line, right below the two sequences, uses symbols to show whether the amino acids are identical or different. ‘*’ represents where two amino acids are the same; ‘:’ represents where an amino acid has changed. That work has been done for you, comparing both the ancestor to the mouse and to the rat separately, and is provided in Table 5.1.

The substitutions are random events, so a single incident may not be truly representative of the natural pattern or the physical principles that govern it. However, for a fairly large amount of data, the relative frequency of an individual substitution can be a reasonable estimate for the probability that the substitution will take place. Relative frequency is the number of times an event occurs out of the total possibilities.

Questions for Discussion

1. Consider the beta globin amino acid sequence for the ancestor of the rat and the mouse. a. What is the number of occurrences of amino acid N?

b. What is the number of occurrences of amino acid N in the ancestor being substituted by S in the mouse?

c. Calculate the relative frequency of N being substituted by S as: Relative frequency = number of N to S substitutions/total number of N’s.

2. Consider the beta globin amino acid sequence for the ancestor of the rat and the mouse. a. What is the number of occurrences of amino acid S?

b. What is the number of occurrences of amino acid S in the ancestor being substituted by H in the mouse?

c. Calculate the relative frequency of S being substituted by H as: Relative frequency = number of S to H substitutions/total number of S’s.

3. If the relative frequency of S to H substitutions was calculated between the rat and the last common ancestor of the fish and rat, predict how this value would compare to the value calculated in 2(c).

Project Question

For the beta globin amino acid sequences, the AP substitution can “date” the evolution of the mouse and rat from their common ancestor. Work with the data provided in Table

Evolution By Substitution Student 35

5.1, and also with the matrix of substitution probabilities contained in Table 1.2. Use these handouts to estimate the time interval (in e.u.’s) between when the common ancestor lived and modern times (with mice and rats instead).

Evolution By Substitution Student 36

Comparison Rat Ancestor Comparison Rat Ancestor Comparison Rat Ancestor Comparison Mouse Ancestor Comparison Mouse Ancestor Comparison Mouse Ancestor Theoretical ancestor and Rat ( * V V * S S * M M Theoretical ancestor and Mouse ( * V V * S S * M M * D D * S S * V V * D D * S S * V V * P P * A A * H H * P P * A A * H H * E E * S S * L L * E E * S S * L L * N N * A A * T T * N N * A A * T T * F F * I I * D D * F F * I I * D D * R R * M M * A A * R R * M M * A A * L L * G G * E E * L L * G G * E E * L L * N N * K K * L L * N N * K K * G G : P A * A A * G G * A A * A A * N N * K K * A A * N N * K K * A A * M M * V V * V V * M M * V V * V V * I I * K K * N N * I I * K K : S N * V V * A A : G C * V V * A A * C C Rattus norvegicus * I I * H H * L L * I I * H H * L L Mus muculus * V V * G G * W W * V V * G G * W W * L L * K K * G G * L L * K K * G G * G G * K K * K K * G G * K K * K K * H H * V V * V V * H H * V V * V V * H H * I I * N N * H H * I I * N N * L L * N N * P P * L L : T N : S P ) beta-globin protein, each with 147 amino acids. * G G * A A * D D * G G * A A * D D ) beta-globin protein, eac * K K * F F : D E * K K * F F * E E * E E * N N * V V : D E * N N * V V * F F * D D * G G * F F * D D * G G * S S * G G * G G : T S * G G * G G * P P * L L * E E * P P * L L * E E : C A : K N * A A * A A * N N * A A * A A * H H * L L * A A * H H * L L * Q Q * L L * G G * Q Q * L L * G G * A A * D D * R R * A A * D D * R R * A A * N N * L L * A A : S N * L L * F F * L L * L L * F F * L L * L L h with 147 amino acids. * Q Q * K K * V V * Q Q * K K * V V * K K * G G * V V * K K * G G * V V * V V * T T * Y Y * V V * T T * Y Y * V V * F F * P P * V V * F F * P P * A A * A A * W W * A A * A A * W W * G G : H S * T T * G G * S S * T T * V V * L L * Q Q * V V * L L * Q Q * A A * S S * R R * A A * S S * R R : S T * E E * Y Y * T T * E E * Y Y * A A * L L * F F * A A * L L * F F * L L * H H * D D * L L * H H * D D * A A * C C * S S * A A * C C * S S * H H * D D * F F * H H * D D * F F * K K * K K * G G * K K * K K * G G * Y Y * L L * D D * Y Y * L L * D D * H H * H H * L L * H H * H H * L L

Table 5.1: Aligned Data for Project

Evolution By Substitution Student 37

Glossary Alignment – a way of arranging two or more amino acid from an organism or organisms to identify regions of similarity that may show relationships between the sequences. The degree of relatedness between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms.

Amino acid – a building block of proteins. There are 20 amino acids used to build proteins in living things, each of which is coded for by three adjacent nucleotides in a DNA sequence.

Dimension of a matrix – the number of rows (m) and columns (n) in a matrix expressed with the notation m  n (in that order).

DNA – abbreviation for deoxyribonucleic acid, the molecule that contains the for all life forms except for a few viruses. It consists of two long, twisted chains made up of nucleotides. Each nucleotide contains one base, one phosphate group, and the sugar deoxyribose. The bases in DNA nucleotides are adenine, thymine, guanine, and cytosine.

Disjoint events – events that cannot occur at the same time.

Evolution – the process by which living organisms’ traits change over very long periods of time perhaps resulting in the production of new species. Mathematically, evolution is defined as a change in the frequency of alleles in a population over time.

Evolutionary relationships – relationships between two or more organisms developing over the long period of time of their evolutionary processes.

Evolutionary unit (e.u.) – the average amount of time it takes for 1% of the amino acids to change.

Independent events – events in which the occurrence of one does not affect the whether or not the other will occur.

Markov chain – a model for a process that has a certain number of states at a given time. The Markov chain is determined by the probability of the system moving from one state to another.

Matrix – a rectangular array of numbers, symbols or expressions. In mathematical notation, an m  n matrix is an arrangement of numbers into m rows and n columns, commonly denoted by the symbol M, and written in the general form:

M =

Evolution By Substitution Student 38

Matrix multiplication – a process in which the entry Ci j is found by summing the product of each element of the ith row of the first matrix with the corresponding element from the jth column of the second matrix. A necessary condition for performing matrix multiplication is that the number of columns in the first matrix must equal the number of rows in the second matrix. In mathematical notation, if A is an m  n matrix and B is an n  p matrix, they can be multiplied together to produce an m  p matrix, C = AB, where m ci j =  ai,k bk, j . k 1

Mutation – a change in a DNA sequence.

Nonpolar molecule – a molecule that has either all nonpolar bonds or symmetrical polar bonds. It does not exhibit negatively or positively charged regions. Amino acids may be nonpolar molecules.

Polar molecule – a molecule that has regions that are partially negatively charged and regions that are partially positively charged. Water is a polar molecule and some amino acids are polar.

Power of a matrix – process of multiplying a square matrix by itself a specified number of times, as determined by the exponent selected.

Probability – the chance that a particular event will occur, expressed as a number between 0 and 1, inclusive.

Protein – a molecule or complex of molecules consisting of subunits called amino acids. Proteins are the cell's main building materials and do most of a cell's work.

Relative frequency – the ratio of the number of times an event happens to the total number of observations made.

Simplifying assumptions – assumptions made in problem solving that reduce the complexity of the problem to facilitate analysis and finding a solution.

Stability – the state or quality of being resistant to change. The quality of being stable.

State diagram – a diagram illustrating all components of a system and the transitions or changes occurring within the system.

Steady state – a stable condition that does not change over time. A condition in which change in one direction is continuously balanced by change in another.

Substitution – in the context of this unit, it is the evolutionary process of replacing one R group (the side-chain group that determines which amino acid a particular molecule is) with a different one.

Evolution By Substitution Student 39

Transition matrix – the matrix which contains the probabilities for each possible state of a system to either retain that condition or undergo change to a new state. In this context, it is the probability of undergoing substitution during some fixed time interval.

Evolution By Substitution Student 40