ii

iv

Abstract The human PMS1 is a protein that functions in DNA mismatch repair. PMS1 is part of the High Motility Group Protein family (HMG ). Using homology modeling in

YASARA, a 3-dimensional structure of the PMS1 protein was produced and the structure was verified as realistic using molecular dynamics. Using Evolutionary Analysis in an online program called ConSurf identified the high and low conservative regions of the PMS1 protein.

Sequences with high conservation scores indicate important structural and functional aspects of the proteins. Using the GNOMAD database, human variants of the protein were found with a focus placed on those that caused missense, loss of function and frameshift mutations which can be found in the 3-dimensional structure. Where these proteins are found will be at the sites of phenotypical consequences from mutation. Using the Human Protein Atlas, PMS1 was found in all cells. However, it is most common in cells with a high reproductivity rate, like cells in the digestive tract. Malfunctioning of PMS1 leads to genome instability and more frequent mutations, which can cause genetic defects or . The Catalogue of Somatic Mutation in

Cancer (COSMIC) was used to identify related mutations of PMS1, such as colorectal cancer. It is hoped that this sequence to structure to function to phenotype approach will contribute to the future of genomic medicine.

v

Acknowledgments

I would like to thank Dr. Hawkins and the entire honors committee for the riveting opportunity to partake in Walsh University’s Honors Program. This faculty has pushed me past my comfort level for the past four years to become the successful student and person I am today.

Conducting and presenting original research is an opportunity I am thankful for and would have not had without this program. Being a part of the honors program has allowed and prepared for my future in veterinary medicine, as well as life outside of the classroom. Next, I would like to thank Dr. Freeland for being my thesis advisor. He had several advisees, but still was still dedicated to my success in this research. I am thankful for the many hours he spent discussing, reading, editing, and researching with me. He has made an impact on my education as an advisor, a professor, and a friend. I would also like to thank my fellow honors peers, who have contributed to my success in the program with unending support. Most importantly, I would like to thank my family. My family has been my greatest support in my journey as an honors student.

vi

Table of Contents

List of Figures……………………………………………………………………………………vii

List of Graphs…………………………………………………………………………………...viii

Introduction………………………………………………………………………………………..1

Limitations...…………………………………………………………………………………….2

Why is this work relevant?...... 2

Literature Review………………………………………………………………………………….3

Genomic Medicine……………………………………………………………………………..3 Understanding DNA……………………………………………………………………………4

MMR Through Evolution………………………………………………………………………10

Early Discoveries………………………………………………………………………………13

Medical Consequences…………………………………………………………………………16

Research Statement……………………………………………………………………………...18 Research Methods……………………………………………………………………………….18 NCBI…….…………………………………………………………………………………….18

YASARA Homology Model…………………………………………………………………….20 Molecular Dynamics…………………………………………………………………………...22 ConSurf…………………………………………………………………………………………22

COSMIC………………………………………………………………………………………...22

Human Protein Atlas…...………………………………………………………………..…….23 GNOMAD; Looking for variants……………………………………………..23 Results and Discussion………………………………………………………………………….23

Yasara Molecular Dynamics & Consurf Dimer Report……………………...………………..25

vii

Missing Residues………………………………………………………………………………29

BLOSUM Scoring Matrix…………………………….………………………………………..33

COSMIC Results…………………………………...…………………………………………..35

Human Protein Atlas Results…………………………..……………………………………....36

Conclusion……………………………………………………………………………………….37

Works Cited……………………………………………………………………………………...39

List of Figures

Figure 1. Illustrates where are located in the cell and their role in protein expression…….5 Figure 2. Shows the 5 prime and 3 prime ends on a DNA strand………………………………...6 Figure 3. Shows where exons and introns are located in the structure of a and how the introns interrupt the gene sequence and must be removed………………………………..6 Figure 4. Shows the structure of the PMS1 gene on 2……………………………...7 Figure 5. Shows and explains the levels of protein folding……………………………………….7 Figure 6. Illustrates base pairing within DNA…………………………………………………….9 Figure 7. Shows the involvement of PMS1 protein in various forms of DNA repair, and how the DNA repair proteins often function as heterodimers…………………………………….12 Figure 8. Shows a representation of the relationships between Eukaryotic, Bacterial, and Archaean MMR proteins, including their common ancestor…………………………….13 Figure 9. A picture of the PMS1 structure after homology modeling has been run.………….....24 Figure 10 shows beta sheath formed on residues 117-136 of the PMS1 homodimer……………28

Figure 11 shows alpha helix on amino acid residues 270-290 on the PMS1 homodimer……….28

Figure 12. The structure of PMS1, for residues 353-932, as a homodimer……………………..33 Figure 13. Showing PMS1 residues 353-932 as a monomer……………………………………33 viii

Figure 14. Shows the HMG box for PMS1. The three helices comprising the HMG box are labeled 1, 2, and 3………..………………..…………………………………………….34 Figure 15. Shows the BLOSUM scoring matrix………………………………………………..34 Figure 16. Shows a graph of the percentage of different types of mutations that occur in PMS1……………………………………………………………………………………37 Figure 17. Shows where PMS1is located in the cell. PMS1 is an MMR protein so we hypothesized it would be found in the nucleus where DNA is replicated………………38

List of Graphs

Graph 1. Root Mean Square Deviation (RMSD) vs. Time……………………………………...25

Graph 2. This graph shows the total movement of each amino acid in the sequence over 20ns..26

Graph 3. This graph also demonstrates the relationship of the RMSD for every amino acid to a smaller scale……………………………………………………………………………..27

Graph 4. Running Avg Conservation score and RMSD for each residue……………………….29

Graph 5 shows the comparison of BLOSUM scores between residues 353-932 of PMS1……..36

Graph 6 shows where PMS1 protein is expressed in parts of the body…………………………39

1

Introduction

There are 22,000 genes in the human genome. There are 3.2 billion nucleotide pairs in one set of , which equals 6.4 billion nucleotide pairs in every human cell.

Individual humans differ in our DNA sequences in 1 out of 1000 nucleotides. Therefore, each person’s genome contains about 6 million variants from what is considered the “normal” human genome. In the genomic medicine of the future, a physician will look at variants in a patient’s genome, the goal being to detect increased risk of diseases or determining which drugs are suitable for that patient, based on the patient’s genomic profile. Detailed knowledge about gene sequences will be necessary for using genomic data in any predictive way. The doctors engaging in genomic medicine will not have to look at 6 million different gene variants in order to make decisions about the best treatments for a patient, because most of these variants are in DNA sequences that will not affect the function of the protein encoded by the gene. The current research is the analysis of gene sequences to determine protein structures, evolutionary comparisons to identify the critical gene sequences, and filtering of human gene variants to identify which ones will be likely to alter protein structure or function. Analysis of one gene at a time is the only way to acquire the detailed knowledge of genes that may affect human health.

In this research we will explore the structure and function of the human Post Meiotic

Segregation protein 1 because it is important in the repair of DNA damage. Malfunctioning of

PMS1 leads to genome instability and more frequent mutations, which can cause genetic defects or cancers. PMS1 has not been extensively studied, although evolutionarily related proteins and heterodimers (a complex of two different proteins joined together) with similar function have been studied. This research can fill the knowledge gap concerning the three-dimensional structure of PMS1, the evolutionary conservation of amino acids within the protein, and which 2 human variants are likely to have phenotypical consequences. Biologists know that genomic variants may change amino acids in a protein sequence; however, the goal of this research is to find which amino acid changes are most likely to affect the structure or function of PMS1.

Limitations

This research will come with limitations. The three-dimensional structure we arrive at will be a realistic approximation, but it will be hard to verify that it is the correct structure of

PMS1 protein in the cell. We will use techniques to validate the quality of the structure we produce, but that still falls short of knowing whether we have arrived at the correct structure. We will use a realistic cellular environment simulation where the protein will be able to react to its environment. However, because it is a simulation and not a real-life cell in the lab, this limits exact results. When we analyze human variants in the genome, we focus on human variants that affect the amino acid sequence, which may alter function or structure of the protein. This means we will not detect genomic variants that alter expression of the gene itself, for example the promotor region of the gene (the DNA sequences that determine the expression of the gene, which do not contain any of the amino acid coding sequence).

Why is this work relevant?

This research is relevant because other people have not done work in this manner. This is new to the medical field, but is where the future of medicine is heading. Genome wide association studies (GWAS) are a common method for determining which small variants in DNA sequence may be associated with a disease condition. The gene variants within a group of people are mapped, then researchers the subjects that have a certain disease in common and look for gene variants that are common in this group, but not in the unaffected subjects. One problem with GWAS is that the number of subjects in the study is not always adequate to evaluate 3 numerous rare gene variants (Hong and Park). Another problem is that the gene variants found do not always show a function and may be false positives. For example, if a group of subjects chosen with heart disease share a similar gene sequence, it is not always the case that the variant is a contributor to heart disease. The difference between our research and GWAS is we start with the gene. We have a method called SSFP; sequence, structure, function, and phenotype. We start with proteins we know have importance although we may not know the function. Just having a sequence will tell us nothing, so we use the sequence to find the structure. By finding the structure, it may give us answers as to what its function is. The function of the gene will be expressed as the encoded protein, and this functional protein is what we mean by the term phenotype. If the function is not known, we are still able to tell if that portion of the sequence/structure is important by looking at the conservation scores. A high conservation score means that all organisms have this sequence, so it must be important. Likewise, if a sequence variant falls within a highly conserved part of the protein, then we may expect that this variant may cause a phenotypical problem. This is how the data provided will be linked to public health problems.

Literature Review

Genomic Medicine

Genomic medicine is the push to bring basic knowledge of genes and proteins into the practice of medicine. The fundamental concept of genomic medicine, sometime called precision medicine, is that healthcare providers can look at the variations of a patient's genes and assess their health risks based on these observations. Genomic medicine allows healthcare providers to take preventative measures and provide treatment if necessary. Genomic medicine is gearing 4 towards getting away from the “one size fits all” treatment the medical field uses today (Freeland

2018). This treatment is a problem because of the variation in the human genome including: body composition, state of health, and liver and kidney functions (Freeland 2018). Another problem is some treatments that work for most people may not work for others, based on there genome. Genomic medicine will allow a person’s genome to detect which treatment will be the most successful to their health. The ultimate goal of genomic medicine is to understand the variants of genes and the proteins they code for, as well as to understand where DNA changes may lead to change in protein function. Some difficulties of genomic medicine specifically include: minimum knowledge in this field, having enough genomic information, privacy issues, and uncertainty about reimbursement, and if this is not the right path for a patient (Addie). A universal goal of genomic medicine mentioned by Addie includes practicing genomic medicine daily in healthcare facilities to provide ‘positive population health outcomes’ (Addie). Despite these challenges, genomic medicine should provide health benefits for patients.

Understanding DNA

An important factor for understanding the research of the PMS1 protein is understanding how genes code for proteins. Human genes include two kinds of DNA sequences.

Exons are the DNA sequences that carry the code for amino acid sequence of proteins (yellow bars in Fig. 2). Introns are DNA sequences that interrupt the protein coding of the gene (blue bars in Fig. 2). After a gene is transcribed into an RNA copy, introns are removed by use of the spliceosome (composed of RNA molecules and protein enzymes). The resulting mature messenger RNA (mRNA) contains only the protein coding sequences, plus the upstream and downstream noncoding sequences (5’ and 3’ untranslated regions, the red bars in Fig. 2). Failing to properly remove an intron can lead to mutated proteins or human genetic disorders (Bali 62). 5

Ribosomes translate the mRNA into the correct amino acid sequence, according to the sequence encoded in the exons, but proteins must fold into a correct three-dimensional structure in order to function. The folding of the protein and interactions among amino acids in the chain determines their secondary, tertiary, and quaternary structures (see figure 3). Chaperones, enzymes that enact proper protein folding, are also necessary in the cell to help protein folding, so it is important that chaperones perform correctly, otherwise mutations could be expressed as nonfunctional three-dimensional protein folding.

Figure. 1 illustrates where genes are located in the cell and their role in protein expression.

6

Figure 2 shows the 5 prime and 3 prime ends on a DNA strand.

Figure 3 shows where exons and introns are located in the structure of a gene and how the introns interrupt the gene sequence and must be removed.

7

Figure 4 shows the structure of the PMS1 gene on .

The vertical lines and boxes represent the exons, and the pale green areas are untranslated regions (UTR). The arrows indicate the direction of transcription.

Figure 5 shows and explains the levels of protein folding.

Mutations that alter DNA, protein structure, and protein functions show relationship to the development of diseases including cancer and some rare diseases. Additionally, mutations have been linked to the development of drug resistance in cancer cells, which is dangerous when attempting treatment with medicinal therapies (Jubb 3). All humans have very similar DNA; however, each person has unique DNA sequence variations. The ability to take samples from 8 thousands of humans allows for a better understanding of how these mutations affect the structure, function, and formation of proteins and protein complexes, and provides opportunities for correcting them (Jubb 3).

Mutations can lead to health consequences in humans. Protein-protein interactions are abundant at the molecular level and they perform important cellular processes, such as metabolism, cell signaling, and cell death (Jubb 3). These are just a few processes that can be affected by DNA mutations, but without these functions previously mentioned that are vital to life there may be less drastic health consequences, such as increased risk for some disease, or the failure to respond to some drugs. In some protein-protein interactions, there is a large binding site for binding one protein to another, the binding site being the place on the proteins where chemical attractions may form. However, it is hypothesized that smaller binding pockets exist for a single amino acid. In this case, if there is a mutation changing that single amino acid, it will completely affect the binding affinity of the interacting proteins and their stability. The current research is concerned with determining which DNA coding changes will reduce the function of

PMS1 protein, so understanding the interaction of PMS1 with its heterodimer partner proteins will be an important component.

Many cancers are the effect of gene mutations, and a handful of Mendelian diseases as well (Jubb 4). Mendelian diseases are diseases that follow a dominant or recessive inheritance pattern and can occur due to a single gene. Somatic mutations are mutations that happen to cells in the body and are not passed down to offspring. Cancers are known to develop in cells that have undergone somatic mutations. However, genetic mutations passed down from parents are another possible cancer risk factor. It is for this reason that initiatives in genomic medicine be taken to help deal with the effects of these mutations. 9

When damage alters DNA, DNA mismatch repair (MMR) prevents mistakes from turning into mutations, which are repair process fails that cannot be fixed. These mismatches

(incorrect nucleotides brought in during replication) occur at the nitrogenous base level of which there are four different bases. Each nucleotide has a complementary match based on its ability to hydrogen bond. The four bases are adenine (A), thymine (T), guanine (G), and cytosine (C). A and T are complementary base pairs and G and C are complementary base pairs as well (see figure).

Figure 6 illustrates base pairing within DNA.

The objective of DNA MMR is to fix mistakes in DNA synthesis that would lead to mutations, were they allowed to persist (Guarne). Defects in MMR result in a variety of errors in

DNA. Point mutations are changes of a single nucleotide. Strand slippage errors result in deletions and insertions, and microsatellite instability (MSI) represents phenotypic evidence that

MMR is not functioning normally (Guarne). Phenotypic evidence has to do with the expression 10 of genetic information and is always based on protein function. Blood type or the enzymes that digest food in the digestive GI tract would be good examples of phenotypic expression. A defect in the MMR pathway could result in a high rate of mutation, birth defects, and infertility

(Guarne). All of these DNA mutations listed previously can lead to serious health problems, including various cancers. Lynch syndrome is a hereditary form of colorectal cancer known to be caused by mutations in MMR genes. Mutations most likely arise during the DNA replication process that must occur before cell division, due to errors by the DNA polymerase enzymes and uncorrected nucleotide mispairings (Lin). There are two known proteins in bacteria that must target and repair these mismatches: MutS and MutL. These two proteins work together to solve this problem; MutS locates the mismatch and MutL helps with the repair (Guarne). This process is done using an endonuclease that recognizes and cuts the mistake on the DNA strand.

Endonuclease is an enzyme that cleaves the DNA strand in which a mismatch or other damage has been detected. Most of the research on MMR has been based on bacteria, but in eukaryotes

(complex cells that have the DNA enclosed in a nucleus, including all animals, plants, fungi, and many single cell organisms), a version of MutL is alpha-MutL. It is a heterodimer, which is a macromolecule formed by two non-identical proteins; these proteins are MLH1 and PMS1. This thesis will focus on the importance of PMS1 and its three-dimensional structure and human genetic variants, including potential medical consequences due to failure of DNA MMR.

MMR Through Evolution

The DNA mismatch repair system is a complex pathway that developed over time through evolution. Evolution is a change in heritable characteristics over successive generations.

We can understand the functions of PMS1 by studying previous research on the evolutionary precursors of PMS1. MMR has been studied in the bacteria E.coli, where the MutS and MutL 11 interact to activate the endonuclease MutH which will cut the mistake in the base pairing and repair the DNA by inserting the correct complementary nucleotides (Lin). This example is specific for E.coli; however, in each bacterial species there are different versions of these MMR proteins. There is a phylogenetic analysis by Lin of the key components of the MMR pathway,

MutS and MutL (Lin). Both of these MMR proteins are rarely present in Archaea, but they have been found in Bacteria and Eukarya (Lin). Archaea, Bacteria, and Eukarya are the three domains of life. There are subfamilies of MutS, MutS1 and MutS2, and MutS1 is present in most bacteria.

Also rare, MutS3 and MutS4 exist in distantly related bacterial species (Lin). When genes are duplicated accidentally, the new copies often mutate into new functions, mutate into functionless pseudogenes, or maintain a similar function despite the sequence differences. These genes that are the result of duplication followed by mutation are called homologs, meaning they are related by descent from a common ancestor. Mut proteins are the bacterial homologs of PMS1. They have similar function; therefore, we learn about PMS1 from the study of the Mut proteins.

In eukaryotes, the MutS1 and MSH have undergone evolution. These are both found in major eukaryotic lineages. Eukaryotes are cells that have a nucleus and can be multicellular

(Lin). MSH genes were found to have multiple gene duplication events take place, which allowed these genes to evolve specializations in function. Once there were duplicate genes with differing sequence, this allowed for the formation of heterodimers in the MMR system (Lin).

Additionally, versions of the MutS gene came about by endosymbiotic relationships, which is when a prokaryote engulfed another prokaryote and they lived endosymbiotically together and eventually formed a eukaryote, with the engulfed cells turning into the mitochondria. The mitochondria are where many biochemical processes occur and where energy metabolism takes place. MutL genes have an interesting evolution as well. MutL genes are only present in species 12 where MutS genes existed (Lin). This is logical because these MMR proteins interact with one another to get the repairs done. PMS1 and PMS2 were found to be present in plant, fungal, and animal genomes (Lin). In conclusion, MutL and MutS genes coexist and coevolve. Without one, the other has no role; if there is a change to one of the genes, evolutionary pressure (natural selection) will favor changes to the other gene as well (Lin).

Figure 7 shows the involvement of PMS1 protein in various forms of DNA repair, and how the

DNA repair proteins often function as heterodimers (Kolodner). 13

Figure 8 shows a representation of the relationships between Eukaryotic, Bacterial, and Archaean

MMR proteins, including their common ancestor (Kolodner).

Early Discoveries

The discovery of DNA repair mechanisms dates back to the 1960s. Kolodner has provided an article titled A Personal Historical View of DNA Mismatch Repair with an Emphasis on Eukaryotic DNA Mismatch Repair. This follows the researcher's own history of his involvement in the study of DNA repair. In the 1960s, research was beginning to find mismatched nucleotide bases and found that MMR would help correct the errors; however, later it was discovered the MMR is involved in mutagenesis, which is the creation of mutations, and it is important for DNA replication fidelity (Kolodner). This first lab he worked in was where the first endonuclease was studied, an enzyme that cuts out the damaged portion of DNA. He practiced new methods of mutagenesis by making purposeful mutation in the MMR genes such that the endonuclease would be inactive. This allowed him to find which amino acids were important for structure and function of these proteins (Kolodner). From the 1960s until 1980s 14 research continued on E.Coli methyl-directed MMR structure and E. coli MMR proteins

(Kolodner).

The study of E. coli suggested functions of MutS, MutL, and even MutH genes; those functions being mispair recognition and strand discrimination, respectively. Although these studies have been important to the overall mismatch repair pathway, the methyl directed post replication repair only exists in bacteria. This is when methyl groups, (a carbon with three hydrogens) are added to the DNA molecule and can change the activity of a DNA segment without changing the sequence. For example, DNA methylation typically acts to repress gene transcription, this is commonly referred to as gene-silencing and is a very important component of epigenetics.

Next, Kolodner wanted to compare eukaryotic MMR and prokaryotic MMR using S. cerevisiae and E. coli. One difference is the eukaryotic MMR did not use DNA methylation, whereas prokaryotic MMR does. Another comparison includes understanding of the double strand break repair, which occurs when both strands of the double helical strand of DNA break and requires repair. The phenomenon of gene conversion occurs when a heterozygous gene takes on the identity of its allele and this is the result of double strand break repair (Kolodner). Soon after, endonucleases that could cleave the DNA strand at a site of damage were discovered, including repair in mitochondrial DNA (see the work by Freeland and Deering), which helps explain the small likelihood of mutation (Kolodner).

Kolodner states that many labs were doing research on the relationship between MMR defects and risk of cancer, and his lab was doing work to further their knowledge on the

Saccharomyces cerevisiae MMR (Kolodner). After these important discoveries the study of 15

MMR excelled in all directions. So much has been learned about these MMR proteins, but he states that are still many unanswered questions when it comes to MMR mechanisms (Kolodner).

There are three important MutL heterodimers involved in MMR. These heterodimers are: hMLH1–hMLH3, hMLH1–hPMS1 and hMLH1–hPMS2 (Kondo). Heterodimers are proteins composed of two polypeptide chains differing in composition pertaining to amino acid sequence.

Kondo presents a study of these specific heterodimers and how they interact, including human

PMS1 protein (Kondo). Kondo compared and discovered their results, and he found the same answers as a similar study. Kondo determined the domain of hMLH1 interacting with hPMS2 occur at the same residues of the hMLH1 partners interacting with hMLH3 and hPMS1 (Kondo).

This means they are the same 'interacting target' with hPMS2, which means they are forming very similar bonds and structures and may have to compete to bind. PMS1 is the focus of this thesis, however all these MMR proteins are closely related. Furthermore, the fact these three proteins compete for the same target on hMLH1 means they may partake in a process that incorporates three different heterodimers with shared domains (Kondo). This helps determine their importance in relation to function. If a function of MutL needed expressed and all of these three heterodimers shared a domain then they must partner and make the function happen

(Kondo). They suggested that there are about 36 amino acids needed for the interactions of these proteins with each other (Kondo). These 36 amino acids are similar in all the homologs of MutS, including PMS1, suggesting that we should be alert for amino acid changes that may disrupt this important functional part of the protein. Not only is it important to understand the functions and interactions of these proteins, knowing what can go wrong is important, as well. Kondo states that when mutations occur in these genes, the proteins may not bind correctly to each other

(Kondo). 16

Medical Consequences

The job of PMS1 protein is to help fix a mutation in DNA sequence that has been missed by the DNA synthesis proofreading process. DNA replication is an extremely accurate process; however, some errors do occur. If an error happens in DNA replication, it is fixed by the DNA exonuclease function of the replication enzymes during replication. Sometimes these errors are overlooked, and the MMR uses its proteins to rectify this malfunction. However, sometimes the

MMR repair pathway can be defective, which leaves the mistake unable to be corrected. Lynch syndrome is a cancer caused by mutations due to a defective heterodimer of PMS1/MLH1 MMR protein. Lynch syndrome is an inherited disease which causes colon cancer (Smith; Silva; Zhao).

This defect leaves the MMR proteins unable to carry out their function of rectifying any errors made during DNA replication. Unrepaired DNA errors are then able to persist as mutations.

There are several different versions of this Lynch syndrome mutation, but all may cause health issues (Smith). Missense mutations cause incorrect amino acids to be built into the protein during protein synthesis. This incorrect sequence of proteins may inhibit the endonuclease from doing its job of fixing the mutation, so more mutations become permanent in the cell (Smith;

Silva; Zhao). Overall, this causes MMR and endonuclease defects because the MMR pathway is not able to do its job of correcting the mistake. The mutation occurs in the EXO1, which is a major repair protein that is an endonuclease (an enzyme that cuts the damaged strand of DNA so that damaged nucleotides may be removed) and it is important to the function of the

PMS1/MLH1 heterodimer. When mutations in EXO1 eliminate the function of that protein, the endonuclease function of PMS1 can perform its function of nicking the damaged DNA strand

(Smith). This article explains that if functional EXO1 protein is not present, the important metal binding sites in the PMS1 endonuclease must be present (no mutations are tolerated at these 17 sites) so that the PMS1 protein may take over and cut the DNA strand at the damaged site. This shows that there is more than one MMR pathway.

A study done by Yan and Zhao showed microsatellite instability was detected in colorectal cancer cells. Microsatellite instability is an indicator of numerous strand breaks in

DNA, due to defective MMR (Yan; Zhao). This study was done by testing the tumors of patients with this type of cancer. The protein was tested by doing the polymerase chain reaction (PCR) for each sample. PCR is a laboratory technique using DNA replication. It functions by taking apart the two strands of DNA, making new complementary strands of both, and then repeating this process for 30 or so cycles. This creates a large amount of the target DNA sequence for study. The tumors in this research had deficient MMR protein expression.

Another mutant mismatch repair protein heterodimer contributes to breast cancer, as well.

There is a correlation between resistance to endocrine therapy and defective MutL protein, which is the heterodimer of MLH1/PMS1 (Haricharan). Endocrine therapy is a hormone therapy used to treat breast cancer. Essentially, it inhibits the hormones estrogen and progesterone from stimulating growth of breast cancer cells. Removal of the ovaries may become necessary to adequately stop this process. The MMR protein complex of MLH1/PMS1 causes defectives of

CHK2-mediated inhibition of CDK4, which is necessary for the therapy to be successful

(Haricharan). CHK2 is a tumor suppressor gene and it is involved in DNA repair and damage response and apoptosis. CDK4 is a cell division protein in eukaryotic cells. When the MMR protein complex does not allow the CDK4 to be inhibited, the cancer cells continue to grow. The endocrine therapy is supposed to stop this growth; however, the mutation prevents the MMR from doing its job.

18

Research Statement

In this research, Dr. Freeland and I will explore the structure and function of the human

Post Meiotic Segregation protein 1 (PMS1 protein). PMS1 is important in the repair of DNA damage including base mispairs, and small insertions or deletions. Malfunctioning of PMS1 leads to genome instability and more frequent mutations, which can cause genetic defects or cancers. We will use homology modeling to obtain a likely three-dimensional structure for PMS1 using the YASARA program. Also, in YASARA, molecular dynamic simulation will provide validation of the structure and analysis of amino acid mobility in the three-dimensional structure.

Evolutionary analysis using ConSurf can identify the amino acid residues that are highly conserved across species. These residues are therefore important for proper structure or function.

Using available human genome databases, known variants of human PMS1 gene will be analyzed for gene differences that are likely to create a functional difference in the PMS1 protein. This can lead to predictions as to which human genome variants can result in health consequences, such as cancer. Using Catalogue of Somatic Mutations In Cancer (COSMIC) and human protein atlas (HPA) databases, we can predict which tissues are likely to be affected by malfunctions in PMS1 protein.

Research Methods

NCBI

The first step was to obtain the amino acid sequence of PMS1 using on online database called

NCBI (National Center for Biotechnology Information). By searching ‘PMS1 Human’ under

‘Proteins’ in the database we were able to obtain the amino acid sequence for PMS1. When searching for an amino acid sequence you may not always know the best choice because the 19 database will give you many sequences to pick from. Our reasoning for picking what we thought the best sequence was due to sequence length. Below is the amino acid sequence of the human

PMS1 protein that was obtained from NCBI and it is 932 amino acids long. When we save any sequences from NCBI they must be saved as a FASTA file so that we can use it in YASARA.

The FASTA format is the ‘greater than’ sign followed by the name of the sequence and a hard return. When we use this format, bioinformatics programs understand that the sequence after the hard return is what needs to be analyzed.

>sp|P54277|PMS1_HUMAN PMS1 protein homolog 1 OS=Homo sapiens GN=PMS1 PE=1 SV=1

MKQLPAATVRLLSSSQIITSVVSVVKELIENSLDAGATSVDVKLENYGFDKIEVRDNGEG

IKAVDAPVMAMKYYTSKINSHEDLENLTTYGFRGEALGSICCIAEVLITTRTAADNFSTQ

YVLDGSGHILSQKPSHLGQGTTVTALRLFKNLPVRKQFYSTAKKCKDEIKKIQDLLMSFG

ILKPDLRIVFVHNKAVIWQKSRVSDHKMALMSVLGTAVMNNMESFQYHSEESQIYLSGFL

PKCDADHSFTSLSTPERSFIFINSRPVHQKDILKLIRHHYNLKCLKESTRLYPVFFLKID

VPTADVDVNLTPDKSQVLLQNKESVLIALENLMTTCYGPLPSTNSYENNKTDVSAADIVL

SKTAETDVLFNKVESSGKNYSNVDTSVIPFQNDMHNDESGKNTDDCLNHQISIGDFGYGH

CSSEISNIDKNTKNAFQDISMSNVSWENSQTEYSKTCFISSVKHTQSENGNKDHIDESGE

NEEEAGLENSSEISADEWSRGNILKNSVGENIEPVKILVPEKSLPCKVSNNNYPIPEQMN

LNEDSCNKKSNVIDNKSGKVTAYDLLSNRVIKKPMSASALFVQDHRPQFLIENPKTSLED

ATLQIEELWKTLSEEEKLKYEEKATKDLERYNSQMKRAIEQESQMSLKDGRKKIKPTSAW

NLAQKHKLKTSLSNQPKLDELLQSQIEKRRSQNIKMVQIPFSMKNLKINFKKQNKVDLEE

KDEPCLIHNLRFPDAWLMTSKTEVMLLNPYRVEEALLFKRLLENHKLPAEPLEKPIMLTE

SLFNGSHYLDVLYKMTADDQRYSGSTYLSDPRLTANGFKIKLIPGVSITENYLEIEGMAN

CLPFYGVADLKEILNAILNRNAKEVYECRPRKVISYLEGEAVRLSRQLPMYLSKEDIQDI

IYRMKHQFGNEIKECVHGRPFFHHLTYLPETT

20

YASARA Homology Model

Yet Another Scientific Artificial Reality Application, YASARA, is a software package for analysis of protein structures. From the human genome we have obtained genomic information about PMS1 protein including amino acid sequence. This sequence is entered into

YASARA to produce a likely structure of the protein, based on existing, experimentally- determined structures of homologous proteins (or parts thereof). YASARA also allows manipulation of the structures and sequences and simulation of the behavior of molecules in a cellular environment.

Steps to using YASARA homology modeling

1. Open YASARA 2. Options 3. Experiment 4. Homology modeling 5. Select target (in files- PMS1.fasta) 6. Template-ok

The PSI Blast is the Basic local alignment search tool using a position specific iteration process. It is used in new scoring to find sequences that match in important parts. It uses the position specific scoring matrix (PSSM) which was created from sequences that are similar or related to the PMS1 protein sequence. Templates are experimentally determined structures of proteins with similar sequence. The PSSM shows which amino acid regions are highly conserved. The E-value stands for expected value. The E-value is important because it tells us if this sequence match happened purely by chance, and with 0.3 being the default we never use anything higher than that.

 Modeling speed (slow = best): Slow

 Number of PSI-BLAST iterations in template search (PsiBLASTs): 3 21

 Maximum allowed (PSI-)BLAST E-value to consider template (EValue Max): 0.1

 Maximum number of templates to be used (Templates Total): 15

 Maximum number of templates with same sequence (Templates SameSeq): 1

 Maximum oligomerization state (OligoState): 4 (tetrameric)

 Maximum number of alignment variations per template: (Alignments): 5

 Maximum number of conformations tried per loop (LoopSamples): 50

 Maximum number of residues added to the termini (TermExtension): 10

The output given by YASARA depends on the parameters that are set. When running homology modeling on a protein for the second or third time, you may change the parameters to receive different results. Parameters can affect length, if there are missing residues, and other protein properties. The E-value shows the probability that this sequence similarity happened at random. The E-value is set at 0.1, which means that the structures that we received were because of evolutionary relationships and not at random or very little chance of random. The oligomerization state controls how much the protein interacts with copies of itself. We set the number of used templates to 15 as the maximum for how many structures YASARA will compare our sequence to. We like to make this 15 or 20 so that we are likely to find all structures that match out sequence. The number of loop samples was set to 50, this means that YASARA was set to fix poorly defined loops on the structure by trying 50 different possible structures. This would be if there was a poorly-matching section in the sequence that YASARA was unable to model, this loop sample helps find the most energetically stable structure. Lastly, the terminal extension controls how many amino acids YASARA can add to the structure. Further into this research, this option helped me realize that when I first ran homology modeling for PMS1 I was not provided with the full structure.

22

Molecular Dynamics

We will use our target protein and its three-dimensional structure and run a molecular dynamics program to simulate the behavior of this protein model in a water environment and allow atoms to interact according to the forces that surround them, also known as a force field.

The force field ensures that every particle is acting and reacting with each other. The simulation will run on the computer for 20 nanoseconds (nanoseconds= 10-9 seconds), which could take up to a month for a larger protein.

ConSurf

We will load a three-dimensional structure in form of a PDB file, or file, into the web-based ConSurf server. ConSurf finds homologous proteins in protein databases, aligns them, and identifies which amino acid residues are consistent, which indicates evolutionary conservation. Evolutionary conservation in turn indicates the importance of the conserved amino acids for structure and function of the protein. ConSurf gives a conservation score for each amino acid, which can be displayed as color codes for the amino acids in the three-dimensional protein structure.

COSMIC

COSMIC is web-based server that stands for Catalogue Of Somatic Mutations In Cancer.

We will search for PMS1 in the COSMIC database and it will return information on which cancers have abnormal function of PMS1 protein.

23

Human Protein Atlas

Human Protein Atlas, also known as HPA, is also a web-based server. Entering PMS1 in

HPA will show which human tissues show expression of this protein and how much. Using this method will allow us to determine which tissues may undergo consequences from harmful DNA sequence variations.

GNOMAD; Looking for human genome variants

Thousands of human genome variants found in the human PMS1 gene from the

GNOMAD database will be narrowed down to variants which are in the coding region of the gene, which actually change the amino acid sequence of the protein, and which affect vital amino acids that are shown to be conserved in evolution.

Results and Discussion

In short, the PMS1 sequence is entered into YASARA using homology modeling to produce a likely structure of the protein, based on existing, experimentally-determined structures of homologous proteins. Fig. 8 is a screenshot of the homology modeling results. When running homology modeling, YASARA will give a composite model and it is understood to be the best choice. However, when doing homology modeling on PMS1, this did not happen. The composite model was not in dimer form, so a better model was chosen. This is an example of how preexisting knowledge about the protein helped us pick the best results. The human PMS1 protein is homodimer. So, we expected it to have two molecules, which every structure did. We 24 know that since PMS1 is a dimer that works in DNA mismatch repair we knew the molecules must work together and therefore be attached. This was the most stable looking structure with both molecules connected.

Figure 9 is a picture of the PMS1 structure after homology modeling has been run. This is

believed to be the best structure of the human PMS1 protein residues 1-352.

Something important to note about the above structure is the coloring of it. When this structure was produced it was multicolored based on the amino acids. Here, I color-coded the structure by conservation numbers of the amino acids. The purple and the darker blue colors are the most highly conserved parts of the sequences in the PMS1 structure. The lighter blues and teal colors are the lower end of the conservation scores. In the results we will discuss how the conservation scores may affect the stability of the structure. Color-coding the structure by conservation is a good way to recognize the most important and relevant points in the structure.

25

Yasara Molecular Dynamics & Consurf Dimer Report

Graph 1: Root Mean Square Deviation (RMSD) vs. Time.

In running molecular dynamics, RMSD is a check for stability because it measures how much each atom in the molecule has moved from its starting point. This time was measured on a

20-nanosecond scale (20 billionths of a second). The unit of length used to talk about atoms moving around 1/10 of a nanometer is measured in angstrom (A). RMSD is measured in angstroms. In molecular dynamics, if the structure wiggles away from starting shape and does not return it may not be a stable structure. Notice above there is a plateau in the graph. This plateau in the RMSD score over time means that overall the structure is stable. Despite some instability in the first portion of the graph, this plateau indicates that although there was some movement and instability in the structure during the simulation, the curve leveled out in the end.

26

Graph 2: RMSD vs Amino Acid Sequence. This graph shows the total movement of each amino acid in the sequence over 20ns.

The peaks in the graph demonstrate instability in that section in the chain, which are hypothesized to be parts of the sequence that are not highly conserved. The stable parts of the sequence are shown by low RMSD peaks on the graph and we hypothesize that these amino acids may be highly conserved. The PMS1 protein is a homodimer so the peaks should be the same at the same places in the graph. At amino acids 109, 289, 343 and 351, the peaks show very high mobility in the structure. Another similarity in stability can be found where serine occurs on the structure at amino acids 253, 135, and a small peak at 315. Individual amino acids in the structure do not determine mobility. However, the way they interact with each other does, called intramolecular forces. PMS1 is found to be a very stable structure overall except at both ends of each structure in the dimer. We hypothesize that the high levels of mobility on the graph will be expressed as unstructured loops in the model because these parts of the structure are not involved in the dimer interactions and may not be stable. 27

Graph 3: RMSD vs Amino Acid Sequence (zoomed in)

This graph also demonstrates the relationship of the RMSD to every amino acid to a smaller scale. The y-axis only goes to 8 whereas in Graph 2 it goes to 27. This smaller scale shows more clearly the stability of each amino acid and a closer look at the RMSD for the amino acid sequence. This graph was important to include because it gives us a better idea of which amino acid regions are causing the peaks of mobility in the graph. We compared these high mobility regions in the graph to the regions on the structure to see if they correspond in protein shape. We hypothesize areas of high mobility in proteins to be loops if they are unstable.

However, due to a protein’s function it could contain a high mobility region of alpha helix or beta strand depending on the function of the protein. We looked at residues 117-136 and found a hinged Beta-strand, which could be a part of this proteins function where it has to be mobile. We also found mobility in regions 270-290 and on the corresponding regions of the structure found an alpha helix. We hypothesize this alpha helix exists here because it is not a part of the dimer interaction, which may give it more mobility. 28

Figure 10 shows beta sheath formed on residues 117-136 of the PMS1 homodimer.

Figure 11 shows alpha helix on amino acid residues 270-290 on the PMS1 homodimer.

29

Graph 4: Running Avg Conservation score and RMSD for each residue

Graph 4 shows the relationship between the running average of the conservation scores of the amino acids in the PMS1 protein sequence provided by ConSurf and the RMSD movement for each residue. A higher conservation score means that these amino acids are important to the structure or function of the protein. Also, a low RMSD indicates which residues are more stable in the structure. This graph shows an inverse relationship between the running average of the conservation score and the RMSD score. The inversion on the graph occurs in several parts and is important to understanding the relationship between conservation score and stability. A running average of the conservation score was used as a way to smooth the curve and indicate short regions of higher and lower conservation, rather than the conservation of individual amino acids. Individual residues are unable to move independently of neighboring residues (RMSD), but each residue can have a conservation score independent of its neighbors. A running average for conservation score allows comparison of the two data lines. 30

It is a reasonable hypothesis that highly conserved amino acids in the sequence represent those residues that are important for structure and/or function. It was hypothesized that low

RMSD scores (minimal movement) would correspond to high conservation scores throughout the sequence, and this is true in general. Instances where this inverse relationship does not hold may indicate some regions of the protein in which movement is required for function. This includes the region near His81 and the amino acids between Val191 and Ser201. The structure of the protein is critical to its function so this is a conclusion.

Missing residues

PMS1 is an HMG protein, as stated before. The HMG Box did not show up in the initial structure. The human proteins that the research group works from at this time all include this sequence. The HMG box is part of the structure that includes three alpha helices which make a binding site for DNA, and this is why the HMG Box proteins are transcriptional regulators, and why they are an important set of proteins to understand. Due to this exclusion, we modeled the

PMS1 protein again.

The first time we modeled PMS1, YASARA only modeled amino acid residues 1-352.

This portion of the protein forms a homodimer in the model. The native PMS1 is known to form a dimer with a similar protein, MLH1, so the homodimer is a good approximation of the native structure. To model the C-terminal portion of PMS1, we had to cut the first 352 amino acids off the FASTA file, forcing YASARA to find templates for the rest and build a structure for us because we were looking for the part of the sequence YASARA did not give us a structure on the first attempt. By modeling the missing residues only, we forced YASARA to find additional templates that allowed modeling of the later portions of the protein, including the HMG box 31 which was previously missing. We were able to use what we previously knew about the structure and ability of this protein to form a dimer to realize the HMG box was missing.

Additionally, we attempted to do homology modeling on the PMS1-MLH1 heterodimer and the MLH1-MLH1 homodimer. YASARA was unable to provide a sequence that we felt was correct, by using what we know about proteins. It gave us two separate structures but had no way to connect them. This is something that will be researched in the future. The same thing happened for the homodimer of MLH1, we were unable to find a correct structure. Similarly, with the MLH1 homodimer we were unable to locate the HMG box on this structure. We hypothesized it may have one because its function is similar to PMS1 but we could not confirm this. Since PMS1 is a homodimer and PMS1-MLH1 is a heterodimer with similar function, we are hoping that further research on this topic will help us deepen our understanding of the function of PMS1. Below is the sequence for the C-terminal end of PMS1 and the sequence for the homodimer MLH1 that we used in YASARA.

>PMS1 7.31.18

VSAADIVLSKTAETDVLFNKVESSGKNYSNVDTSVIPFQNDMHNDESGKNTDDCLNHQIS

IGDFGYGHCSSEISNIDKNTKNAFQDISMSNVSWENSQTEYSKTCFISSVKHTQSENGNK

DHIDESGENEEEAGLENSSEISADEWSRGNILKNSVGENIEPVKILVPEKSLPCKVSNNN

YPIPEQMNLNEDSCNKKSNVIDNKSGKVTAYDLLSNRVIKKPMSASALFVQDHRPQFLIE

NPKTSLEDATLQIEELWKTLSEEEKLKYEEKATKDLERYNSQMKRAIEQESQMSLKDGRK

KIKPTSAWNLAQKHKLKTSLSNQPKLDELLQSQIEKRRSQNIKMVQIPFSMKNLKINFKK

QNKVDLEEKDEPCLIHNLRFPDAWLMTSKTEVMLLNPYRVEEALLFKRLLENHKLPAEPL

EKPIMLTESLFNGSHYLDVLYKMTADDQRYSGSTYLSDPRLTANGFKIKLIPGVSITENY

LEIEGMANCLPFYGVADLKEILNAILNRNAKEVYECRPRKVISYLEGEAVRLSRQLPMYL

SKEDIQDIIYRMKHQFGNEIKECVHGRPFFHHLTYLPETT

32

>NP_000240.1 DNA mismatch repair protein Mlh1 isoform 1 [Homo sapiens]

MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIVKEGGLKLIQIQDNGTGIRK

EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPK

PCAGNQGTQITVEDLFYNIATRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNA

STVDNIRSIFGNAVSRELIEIGCEDKTLAFKMNGYISNANYSVKKCIFLLFINHRLVESTSLRKAIETVY

AAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLP

GLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDIS

SGRARQQDEEMLELPAPAEVAAKNQSLEGDTTKGTSEMSEKRGPTSSNPRKRHREDSDVEMVEDDSRKEM

TAACTPRRRIINLTSVLSLQEEINEQGHEVLREMLHNHSFVGCVNPQWALAQHQTKLYLLNTTKLSEELF

YQILIYDFANFGVLRLSEPAPLFDLAMLALDSPESGWTEEDGPKEGLAEYIVEFLKKKAEMLADYFSLEI

DEEGNLIGLPLLIDNYVPPLEGLPIFILRLATEVNWDEEKECFESLSKECAMFYSIRKQYISEESTLSGQ

QSEVPGSIPNSWKWTVEHIVYKALRSHILPPKHFTEDGNILQLANLPDLYKVFERC

Table 2: YASARA homology modeling parameters for missing residues 353- 932

 Modeling speed (slow = best): Slow

 Number of PSI-BLAST iterations in template search (PsiBLASTs): 3

 Maximum allowed (PSI-)BLAST E-value to consider template (EValue Max): 0.1

 Maximum number of templates to be used (Templates Total): 15

 Maximum number of templates with same sequence (Templates SameSeq): 1

 Maximum oligomerization state (OligoState): 4 (tetrameric)

 Maximum number of alignment variations per template: (Alignments): 5

 Maximum number of conformations tried per loop (LoopSamples): 50

 Maximum number of residues added to the termini (TermExtension): 10

33

Figure 12 is the structure of PMS1, for residues 353-932, as a homodimer

Figure 13 is showing PMS1 residues 353-932 as a monomer. PMS1 protein is part of the group of

HMG proteins, which include an HMG box. 34

Figure 14 shows the HMG box for PMS1. The three helices comprising the HMG box are labeled

1, 2, and 3.

BLOSUM Scoring Matrix

Figure 15 shows the BLOSUM scoring matrix. 35

The BLOSUM scoring matrix is something that is used in bioinformatics research.

BLOSUM stands for BLOcks SUbstitution Matrix. As mentioned before, there are twenty main amino acids that can make protein sequences. In this matrix, each of the twenty amino acids are lined up horizontally and vertically with each other. BLOSUM scoring matrix is used to score amino acid comparisons in protein sequences. Mistakes happen during DNA synthesis and when these mistakes are not corrected, they lead to mutations. A mutation in a sequence can cause an incorrect amino acid at that position. The matrix will give a score to each amino acid comparison on a scale of negative four to positive eleven.

In ConSurf, amino acids on the protein sequence PMS1 were given a conservation score based on their conservation across evolutionary lineages. I collected this data and focused on the most highly conserved amino acids in sequences which were given a score of either 8 or 9 on residues 353-932 of PMS1. These amino acids scored 8 and 9 are most important to the structure and of the function of the protein because they are found to be the same amino acid at the same position across evolutionary comparisons. Graph 5 shows the comparison of these BLOSUM scores. There were 54 amino acids that had a BLOSUM score of 0 or higher, which means that change in amino acid was less likely to be harmful. There were 15 amino acids in the sequence with a BLOSUM score of -1. This score is a score that might indicate a harmful change. There were 23 amino acids in the sequence with a low BLOSUM score of -4, -2, or -1. This score is suggestive of a change that causes structural instability or functional change, or both.

Having the BLOSUM scores of the amino acids in the PMS1 protein indicates that we were successful in the structure we used. YASARA gives the best model based on templates structures which are validated by molecular dynamics and the BLOSUM scoring matrix, so the most important part of this research was to be able to interpret the data. YASARA does not give 36 answers, but it gives options. The researcher’s job is to understand what she is looking for, understand what she receives, and understand what is missing. The large number of positive scores on the BLOSUM matrix gives this research the validation needed because amino acid functionality was the foundation for understanding the results.

The graph below shows a comparison of the BLOSUM scores and conservation scores of the 8’s and 9’s of residues 352-932 of PMS1. Out of these 92 amino acids, 54 of them had a score of 0 or higher The GNOMAD data validates the ConSurf data. Human variants are constrained by evolution to allow mostly conservative changes.

Comparison of BLOSUM Scores for Conservation Scores of 8 and 9 60

50

40

30

20

10

0 ¯4,¯3,¯2 ¯1 0+

Graph 5 shows the comparison of BLOSUM scores between residues 353-932 of PMS1.

37

COSMIC Results

Figure 16 shows a graph of the percentage of different types of mutations that occur in PMS1

(Catalogue Of Somatic Mutations In Cancer).

This figure is important to note because when mutations occur in the PMS1 protein it can lead to cancer. Missense substitution mutations make up for almost three-fourths of the mutations that occur in PMS1 protein. This makes sense because PMS1 is a mismatch repair protein and a missense substitution is when there is wrong base pairing. PMS1 is not able to do its job correctly when there is a mutation so if there is a mistake in base pairing, it will turn to a mutation. However, substitution is the most common type of mutation. Synonymous substitution does not change the amino acid sequence, which makes it harmless to the structure. Missense mutations are most commonly seen in cancer cells because they disturb the structure. Since missense substitution mutations make up for 75% of the mutations in PMS1 we can conclude 38 that this gene is involved in cancer. Since PMS1 is involved in MMR, mutations in PMS1 cause genome instability which is a hallmark to cancer.

Human Protein Atlas Results

Figure 17 shows where PMS1is located in the cell. PMS1 is an MMR protein so we hypothesized

it would be found in the nucleus where DNA is replicated (Human Protein Atlas)

Using the human protein atlas, we were able to find where the PMS1 is highly expressed.

PMS1 shows high expression in the RNA of the testes, which is a reproductive organ that undergoes meiosis. Only the male reproductive organ shows high expression for PMS1.This is because the testes are constantly undergoing mitosis and meiosis to produce sperm. The female 39 reproductive organs show lower levels of expression for this protein because when females are born, they are born with a set amount of eggs that they have into their adulthood and do not undergo more meiosis. We know that almost every tissue in the body has base pairing that requires correction during DNA replication, therefore MMR proteins like PMS1 are required in tissues where mistakes take place. It is not clear why colon cancer is the main risk when PMS1 is expressed in all tissues that use mitosis. We hypothesize that maybe other cancer risks have not been found yet.

Graph 6 shows were PMS1 protein is expressed in parts of the body.

Conclusion

To summarize this research on the PMS1 protein, we used the SSFP technique. This stands for sequence, structure, function, and phenotype. The main goal is to understand a mutation in an amino acid sequence, how the protein structure is formed, which affects the function of the protein, which can affect the expression of the phenotype. By finding a correct structure for PMS1, we were able to assess the stability of the protein and the conservation, as well. In the medical profession today, most patients are treated based on symptoms and disease. 40

However, the field of medicine is slowly changing. In the future, preventative medicine will become a larger part of the medical field and this research will contribute to this. Now that we completed the steps of SSFP and were able to address the medical consequences of a mutation in this protein, future scientists will be able to use this locate this sequence in a patient’s genome. Knowing if a patient has a PMS1 mutation in their genome would mean they are susceptible to colon cancer. Further research will be able to determine other health risks a mutation in the PMS1 protein will cause. This patient will be treated preventatively to avoid these future health issues with possible lifestyle changes and the doctors will also be able to monitor this person specifically for colon cancer, etc. Medical consequences of PMS1 including colon cancer are hereditary, so monitoring family members of a patient with this genome variant will be a future form of genomic medicine. Knowing how to prevent a medical consequence, such as cancer, will be helpful in providing the best medical care possible.

This research allowed us to know what changes in the PMS1 protein mean and what are the risks to a patient with this genomic variation. We modelled and understood the three- dimensional structure so that it is available to future scientists. By analyzing the ConSurf report we were able to determine which amino acids are important to this structure. This allows future medical providers to know what they need to look for in patients that may have this variant.

The PMS1 protein is involved in DNA MMR. Due to its name (post meiotic), we hypothesized it would be present in cells in testes and ovaries. It also is expressed in cells that divide quickly, like cells in the digestive or reproductive systems. When a mutation occurs in the

PMS1 protein, it is unable to do its job of mismatch repair resulting in genomic instability and possibly resulting in cancer. Support from the literature review shows that mutations in the

PMS1 protein are associated with colon cancer. 41

In the future it will be important to know about this topic of genome variants when treating a patient with medical consequences caused by a mutation or defect in the PMS1 protein.

A patient’s genome will affect how successful and responsive a medication or treatment will be to the patient’s condition.

42

Works Cited

1. Addie, Siobhan, et al. Applying an Implementation Science Approach to Genomic Medicine: Workshop Summary. National Academies Press, 2016. EBSCOhost.

2. Benhabiles, Hana, et al. "Optimized Approach for the Identification of Highly Efficient Correctors of Nonsense Mutations in Human Diseases." Plos ONE, vol. 12, no. 11, 13 Nov. 2017, pp. 1-23. EBSCOhost, doi:10.1371/journal.pone.0187930

3. Freeland, T.M., Guyer, R.B., Ling, A.Z., and Deering, R.A. (1996). Apurinic/apyrimidinic (AP) endonuclease from Dictyostelium discoideum: cloning, nucleotide sequence, and induction by sublethal levels of DNA damaging agents. Nucleic Acids Research 24, 1950- 1953.

4. Freeland, T.M., and R.A. Deering (1992). Bleomycin-induced single- and double-strand breaks in extrachromosomal, mitochondrial, and nuclear main band DNA in Dictyostelium discoideum. Special meeting of the American Association for Cancer Research, "Chemicals, Mutations, and Cancer," December 1992, Banff,Alberta, Canada.

5. Freeland, T.M., Guyer, R.B., and Deering, R.A. (1994). Exploring the induction of DNA repair in Dictyostelium discoideum by mRNA analysis, cloning, and PCR. International Dictyostelium Conference, August 1994, University of California, San Diego.

6. Freeland, Thomas M. “Pharmacogenetics and Pharmacogenomics,” Ch. 6 in Craig and Stitzel’s Illustrated Pharmacology, 7th Ed., Leah Fink, Karen Woodfork, and Elizabeth Davis, Editors. Jaypee Brothers, 2018.

7. Guarné, Alba, and Jean-Baptiste Charbonnier. “Insights from a Decade of Biophysical Studies on MutL: Roles in Strand Discrimination and Mismatch Removal.” Progress in Biophysics and Molecular Biology, vol. 117, no. 2-3, 2015, pp. 149–156., doi:10.1016/j.pbiomolbio.2015.02.002.

8. Haricharan, Svasti, et al. “Loss of MutL Disrupts CHK2-Dependent Cell-Cycle Control through CDK4/6 to Promote Intrinsic Endocrine Therapy Resistance in Primary Breast 43

Cancer.” Cancer Discovery, vol. 7, no. 10, Nov. 2017, pp. 1168–1183., doi:10.1158/2159- 8290.cd-16-1179.

9. Hong, Eun Pyo, and Ji Wan Park. “Sample Size and Statistical Power Calculation in Genetic Association Studies.” Genomics & Informatics, vol. 10, no. 2, 17 May 2012, p. 117., doi:10.5808/gi.2012.10.2.117.

10. Jubb, Harry C., et al. "Mutations at Protein-Protein Interfaces: Small Changes over Big Surfaces Have Large Impacts on Human Health." Progress in Biophysics & Molecular Biology, vol. 128, Sept. 2017, pp. 3-13. EBSCOhost, doi:10.1016/j.pbiomolbio.2016.10.002

11. Kolodner, Richard D. “A Personal Historical View of DNA Mismatch Repair with an Emphasis on Eukaryotic DNA Mismatch Repair.” DNA repair 38 (2016): 3–13. PMC. Web. 3 May 2018.

12. Kondo, E. “The Interacting Domains of Three MutL Heterodimers in Man: hMLH1 Interacts with 36 Homologous Amino Acid Residues within hMLH3, hPMS1 and hPMS2.” Nucleic Acids Research, vol. 29, no. 8, 2001, pp. 1695–1702., doi:10.1093/nar/29.8.1695.

13. Lin, Zhenguo, Masatoshi Nei, and Hong Ma. “The Origins and Early Evolution of DNA Mismatch Repair Genes—multiple Horizontal Gene Transfers and Co-Evolution.” Nucleic Acids Research 35.22 (2007): 7591–7603. PMC. Web. 3 May 2018.

14. Silva, Felipe Carneiro Da, et al. “Clinical and Molecular Characterization of Brazilian Patients Suspected to Have Lynch Syndrome.” Plos One, vol. 10, no. 10, May 2015, doi:10.1371/journal.pone.0139753.

15. Smith, Catherine E., et al. “Dominant Mutations in S. Cerevisiae PMS1 Identify the Mlh1-Pms1 Endonuclease Active Site and an Exonuclease 1-Independent Mismatch Repair Pathway.” PLoS Genetics, vol. 9, no. 10, 2013, doi:10.1371/journal.pgen.1003869.

16. Yan, Wen-Yue, et al. “Prediction of Biological Behavior and Prognosis of Colorectal Cancer Patients by Tumor MSI/MMR in the Chinese Population.” OncoTargets and Therapy, Volume 9, 2016, pp. 7415–7424., doi:10.2147/ott.s117089.

44

17. Zhao, Lihua. “Mismatch Repair Protein Expression in Patients with Stage�II and III Sporadic Colorectal Cancer.” Oncology Letters, 2018, doi:10.3892/ol.2018.8337.