ABSTRACT

Hereditary ataxias are complex, rare autosomal recessive diseases that receive limited funding and public attention. While it is known that the most common types of hereditary ataxia are caused by in a single , the extent to which molecular pathways, like DNA repair, are changed remains largely unknown. In order to learn more about how the five main DNA repair mechanisms are altered in the disease state and what changes are common across hereditary ataxia, four microarray dataset representing Friedreich’s Ataxia, Ataxia Telangiectasia, and Spinocerebellar

Ataxia Type 2 were obtained from NCBI. Using R and three Bioconductor annotation packages, the fold change level of each probe was calculated and mapped to corresponding gene symbols. Discriminative motif finding was performed on promoter regions of of interest, which represented possible transcription factor binding sites. In order to understand the interactions of each DNA repair pathway, the STRING database tool was employed, and the connections established here were combined with all other results to produce an informative network image for each DNA repair pathway. Our findings showed that DNA repair mechanisms in each form of ataxia shared three similarities, but that each disease had unique differences that may have implications for the differences in disease presentation. In all ataxias investigated in this study, OGG1 and RAD50 are underexpressed, while PMS1 is overexpressed. Furthermore, each ataxia has at least one form of DNA ligase that is underexpressed, which likely hinders the ability to fully fix DNA breakage. These results have the potential to be used by future researchers as targets for therapy or in

1 the development of diagnostic tests. We conclude that there is shared differential expression of key DNA repair genes among hereditary ataxias, and these similarities may help us understand why the presentation of these diseases are so similar.

AUTHOR SUMMARY

Orphan diseases, including hereditary ataxia, are so rare that they are not often studied. As a result, very little is understood about the underlying changes in the biological mechanism. The rarity of these diseases also complicates the ability for quick, affordable and direct diagnosis of the disease. Many hereditary ataxias have an age of onset during childhood, and the average time to obtain a correct diagnosis is approximately 18 years. My hope for this study was to shed light on the differences in the ataxic and healthy DNA repair pathways, as well as any similarities between ataxias, in order to uncover any informative changes that could be researched further to develop possible therapies or to simplify the diagnosis of these diseases. If nothing else, my research aims to spread awareness and generate public interest in regards to these diseases so that they may gain more funding to promote future research and to provide hope to the families that are stricken with these diseases.

This research found that the genes in the disease repair pathways were generally underexpressed compared to wildtype repair pathways, and it was found that there are similarities in the disregulation of certain DNA repair genes across all three forms of ataxia investigated in this study.

2 ACKNOWLEDGEMENTS

The motivation for my research of ataxia stems from learning about the local

McCollister family that has two children affected by a rare hereditary ataxia called

Ataxia Oculomotor Apraxia Type 1 (AOA1). Their story, strength, and hope inspired me to learn more about hereditary ataxia, with a goal of uncovering new information about the disease, as well as spreading awareness of orphan diseases, like AOA1. It was my desire to find similarities and differences between ataxias that could be used by researchers as targets for future studies so that families like the McCollisters will have more options to heal their children and improve their quality of life.

I would like to thank the McCollister family for their support and insight,

Veronica Liang, Robert Schmidt, and Rami Al-Ouran for their expertise regarding bioinformatic techniques, Lorie LaPierre for her contribution to biological relevance of my findings, and Lonnie Welch for being my thesis mentor, Soichi Tanda for his constant encouragement and guidance throughout the last four years. Without the help out these men and women, this research would not have been possible.

3 TABLE OF CONTENTS

Introduction…………………………………………………………………………6

I. Central Dogma……………………………………………………………..6

II. DNA Repair and Disease………………………………………………….6

III. What is Ataxia?…………………………………………………………...8

IV. DNA Repair………………………………………………………………9

V. Hereditary Ataxia………………………………………………………….10

VI. What is Bioinformatics?………………………………………………….16

VII. Ataxia and Bioinformatics………………………………………………18

VIII. Gathering of Data………………………………………………………20

IX. R Techniques……………………………………………………………..21

X. DNA Microarray………………………………………………………….23

XI. Motif Finding…………………………………………………………….25

XII. STRING Database………………………………………………………27

XIII. Methodology Pipeline………………………………………………….32

XIV. Network Map Generation………………………………………………33

Materials and Methods……………………………………………………………...34

Results………………………………………………………………………………..35

Discussion……………………………………………………………………………63

I. Hypothesis One…………………………………………………………….63

II. Hypothesis Two……………………………………………………………64

Conclusions………………………………………………………………………….67

4 Significance of Work………………………………………………………………..68

Future Directions……………………………………………………………………69

Limitations…………………………………………………………………………..70

Bibliography…………………………………………………………………………72

5 INTRODUCTION

All walks of life are made from blueprints, which come in the form of genetic material. For , this material is deoxyribonucleic acid, commonly known as

DNA. If one thinks of DNA as letters, genes would be considered the words made up of these letters. The genome is the story of a person, strung together with the genetic words. However, these stories are not read from left to right and start to finish, as one normally reads. The reading of our individual genetic stories is controlled by genetic regulatory networks, which dictate when, where, and for how long a gene will be turned on. When these networks don’t work correctly, the results can be genetic disease.

I. Central Dogma

A pivotal concept in molecular biology is the central dogma, which details how genetic information is used within a biological system. This idea was proposed by

Francis Crick and states that DNA is used as the template for RNA, which is translated into a series of amino acids that fold to become .[1] When studying mutations in proteins, it is most important to examine this process of protein creation starting from the source, DNA.

II. DNA Repair and Disease

Mistakes in DNA replication occur frequently, and every person has them in their genome. Mistakes are made at an approximate rate of 1 incorrect nucleotide in every 100,000 nucleotides synthesized. This may seem like a small number, but there are over 6,000,000,000 nucleotides making up the genome in a single diploid

6 cell, each of which is synthesized when a cell divides. This means that around 120,000 errors are made in the DNA every time a cell replicates.[2] These errors are called mutations if they go unfixed. Most people never develop a disease due to these mutations, but there are mutations that can be highly deleterious and lead to disease.

Because the life of the organism can depend on the DNA being replicated and maintained properly, several biological mechanisms exist to fix DNA mistakes. These are known as DNA repair pathways, of which there are five.[3] The first of these is the base excision repair (BER) pathway, which removes a single, non-bulky damaged or incorrect base from single strand DNA that could hinder the structural integrity of the strand during replication. The nucleotide excision repair (NER) pathway fixes single stranded DNA by removing bases damaged by ultraviolet light. UV damage causes bulky lesions, and the NER pathway detects these, removes the damaged bases, and replaces them with the correct ones. The third mechanism to repair single strand DNA is the mismatch repair (MMR) pathway. The MMR pathway is responsible for detecting and fixing incorrect insertions and deletions along the newly synthesized

DNA strand. (HR) is a pathway that fixes errors in double stranded DNA. HR fixes the DNA by using the undamaged sister chromatid to correct the mistakes. Non-homologous end-joining (NHEJ) is a pathway that is designed to ligate the damaged ends of double stranded DNA molecules. These five repair pathways are crucial to maintaining a functioning, healthy organism.

What if the genes involved in these critical repair pathways are mutated themselves? There are two outcomes, depending on the time the occurs. If

7 the gene is mutated later in life, it is possible that organism will develop , as there are key oncogenes, like TP53 and BRCA1, in the DNA repair pathways. Should the organism be born with the mutation, they run more risks than just cancer, but additional diseases. When an organism is born with a deleterious inheritable mutation, their ailment is known as a hereditary disease, one that is passed from the parents to their offspring.[4] These genetic diseases are congenital and can be dominant or recessive, meaning a dominant disorder will always present itself if the mutation is present, while a recessive one will only occur if two copies of the mutation exist.

Additionally, some diseases will not manifest until later in life, while others are apparent at birth; the time of manifestation is known as the age of onset.

III. What is Ataxia?

The literal Greek meaning of ataxia translates into “without coordination”.[5]

Simply put, those who suffer from ataxia lose the ability to coordinate their muscle movements and can be widespread or limited to specific regions of the body like the fingers, toes, limbs, or eye muscles. Cerebellar ataxia refers specifically to ataxia in the muscle coordination brain region, and is the region most commonly affected in this disease. Ataxia can be a symptom of other diseases, such as multiple sclerosis or a stroke. This makes diagnosing ataxia difficult, as it could be the manifestation of an underlying disease, or the core problem. This can cause patients to be misdiagnosed by the most seasoned of doctors, adding undue hardship to their lives. Ataxia can also be a stand-alone disease, and can be either inherited or sporadic. Sporadic ataxia refers to the occurrence of the disease in a person who has no family history of the disorder.

8 Generally, it occurs later in life, and if the disease is categorized as ‘cerebellar plus’, meaning that they experience cerebellar ataxia along with neuropathy or dementia, the disease will often progress more rapidly. In hereditary ataxias, the common symptoms at onset include a lack of balance and a marked increase in klutziness of the individual. Soon after, the ability to walk becomes hindered, due to poor balance, and patients are often confined to a wheelchair. As the disease progresses, patients can lose their ability to speak coherently, experience issues with respiration and peristalsis, as well as hallmarks of the certain ataxia they experience. Whether the disease is genetically based or occurs sporadically, ataxia results in a marked decrease in the quality of life for those affected by it. For both types of ataxia, there are limited treatment and therapy options for patients, resulting in frustrating medical ventures.

Figure 1. CT Scan of Cerebellar Atrophy. The cerebellum in both pictures is shown inside the red circle. The right picture in this figure shows the brain of non-affected person with a normal cerebellum. The left shows the cerebellum of a patient who is affected by Friedreich’s ataxia. Note the shrunken cerebellum and reduction of white matter, an indication of reduced neuronal communication. (Figure from [6])

IV. DNA Repair

As previously mentioned, organisms have developed mechanisms to detect and fix mutations in the DNA, known as DNA repair pathways. Each pathway consists of

9 several interacting genes that rectify the errors in genetic material DNA repair pathways can be split into two categories: single strand break repair (SSBR) and double strand break repair (DSBR). The single strand repair pathways are BER, NER, and MMR, while double strand repair pathways are HR and NHEJ (See Table 1).

Table 1. Essential Genes of the 5 Major DNA Repair Mechanisms. (Table from [7])

DNA repair pathways work to identify the specific kinds of damage that they are meant to respond to, and in this sense, they are self-contained. Each pathway can identify errors, bind to the DNA, excise the inappropriate bases, insert the correct

DNA sequences, and ligate, or reseal, the DNA strand. The cooperation and behavior of all proteins in these pathways are essential for proper DNA repair.[7] A loss of function in a single protein can result in the loss of function for the pathway as a whole. In fact, disregulation and mutations these pathways are often associated with disease, including cancer and genetic disorders.

V. Hereditary Ataxias

In the world of genetic disorders, there are some that occur with such infrequency that those plagued with these severe diseases have extremely limited

10 accessibility to treatment and very little public awareness. The type of disease is referred to as an orphan disease. The exact parameters of what constitutes and orphan disease are disputed, but the Rare Diseases Act of 2002[8] qualify an orphan disease as one that effects 1 in every 1,000 people, while other health plans only consider those with a low prevalence of 1 in 200,000 people. However, all agree that many of these orphan diseases are severe and lead to a lower quality of life for those affected. Ataxia is a rare disorder, affecting only 18.5 people in every 100,000.[9] An even smaller subset consists of those affected by the hereditary ataxia. While there are several different kinds of hereditary ataxia, the three that are most frequently studied are

Friedreich Ataxia (FA), Ataxia Telangiectasia (A-T), and

(SCA). While inherited ataxia is rare, these three are the most prevalent, thusly receiving the majority of the research focus. As the rarity of a disease increases, less research is conducted on it. Such is the case with Ataxia Oculomotor Apraxia Type 1

(AOA1), another heredity ataxia that reportedly afflicts less than two-dozen families worldwide. Due to its miniscule presence in the population, the only types of research that have been done on AOA1 establish the phenotype and analyze biochemistry of the disease, and nothing for treatment. Currently, the only available way to help patients manage the disease is physical therapy, which provides few benefits, as the disease progression worsens over time, making continued therapy and improvement impossible. While both phenotypic and biochemical studies are vital to create base knowledge in the field, they do very little for those who are diagnosed with AOA1. As such, there is no information in the public repositories that contain the genomic

11 sequences of AOA1 patients, due to the lack of research conducted. However, the

DNA of those with Friedreich Ataxia, Ataxia Telangiectasia, or Spinocerebellar

Ataxia do reside in these repositories and are available to be studied. Because these ataxias are hereditary diseases, each with a single mutated gene, there could be valuable information on disease presentation in the rest of their genomes that could be used to create treatments or therapies for ataxia.

Even though these four ataxias are separate entities, there are many overlapping symptoms between the diseases. For AOA1, FA, and A-T, the age of onset is in early childhood, between 5 and 15 years of age. SCA’s typical onset is in around 30 years old, but it has been documented in children under 15.[10] The earliest symptoms appear as severe klutziness in children who had no prior issues with motor coordination. They develop unsteady gait, followed by uncoordinated movements of the arms and torso.[11] The most obvious of these shared symptoms is the development of loss of motor control in patients, known by the same name as the disease: ataxia.

Ataxia is defined as the hindrance of voluntary muscle movement, and it is often difficult to diagnosis because young children are often clumsy and lack full control of their movements. However, the ataxia becomes readily identifiable when a child that used to run around can no longer walk short distances without trouble. In most patients diagnosed with ataxia, they become wheelchair bound less than 10 years after the ataxia becomes apparent. The ataxic symptoms found in these four types of ataxia all result from neuropathy, which is essentially the disintegration of neurons. In normal individuals, the peripheral axons relay information between the brain and the body,

12 allowing the brain to direct movements in response to various sensory inputs. In axonal peripheral neuropathy, these long axons die off and sever the communicative ties between the brain and the body. This is the reason why multiple sclerosis patients can also exhibit ataxia as a symptom of their disease. In the case of SCA, the sensory neurons are also affected, leading to feelings of tingling or pain in the extremities.[12]

In AOA1 patients, this neuropathy begins with the absence of reflexes, known as areflexia, and then rapidly and steadily disintegrates motor control further, additionally causing a lack of sensation in the extremities.[13] Before neuropathy becomes apparent, AOA1 patients can usually use crutches to move around, but are later confined to a wheelchair. Due to the immobilization caused by axonal peripheral neuropathy, muscles swiftly begin to weaken and waste away.[11] This progression is mirrored in FA, A-T, and SCA. The average life span for an FA patient is 37.5 years, though patients have been known to wither away much earlier, or live into their 70s, but they can have severe cardiac complications, which is another possible symptom of

FA.[14] Those affected with A-T generally make it to 25 years of age and into middle adulthood, and they frequently experience immunodeficiency and abnormal endocrine functions.[15, 16] SCA patients have the most variable life expectancy, with those who are diagnosed young dying before they reach their 30s and others living well into late

60s.[17] Patients with AOA1 can survive anywhere from 15 to 60 years following onset.[11] Regardless of the type of hereditary ataxia a patient is diagnosed with, the mortality rates run the gambit from young to old, depending on the rapidity and

13 severity of an individual’s progression. Those afflicted with hereditary ataxia all suffer from a greatly diminished quality of life.

While the symptoms of these diseases all appear very similar, their root causes are share no true overlap. FA, A-T, and AOA1 are autosomal recessive, meaning that two parents must both be carriers of the disorder, and for the child to be afflicted by the disorder, they must inherit the mutation from mother and father. SCA is autosomal dominant, meaning that only one parent needs to have the mutation for their children to inherit it. FA is caused by a GAA expansion mutation in the frataxin (FXN) gene on 9, primarily causing mitochondrial dysfunction because the mutation causes a disruption in the regulation of iron transport and respiration in the mitochondria. FXN is a key protein-coding gene in the mitochondrial iron-sulfur cluster biogenesis pathway, the mitochondrial protein import pathway, the HIF-2- alpha transcription factor network, and metabolism.[18] A-T is the result of a mutation in the ataxia telangiectasia mutated (ATM) gene, affecting the double stranded DNA break repair pathway.[16] SCA has over 60 different forms, with different mutated alleles in the ATXN genes. For example, SCA Type 1 patients have between 6 and 44

CAG repeats in the ataxin 1 (ATXN1) gene, causing issues in the single stand break repair pathway [19], while SCA Type 2 patients have 33 or more CAG repeats in the ataxin 2 (ATXN2) gene.[17] There are over 30 different ataxin proteins, and mutations in these or other genes are what leads to the various forms of SCA.[20] AOA1 is caused by mutations in the aprataxin (APTX) gene, which is key in orchestrating the base

14 excision repair and single strand base repair pathways.[11] All four of the hereditary ataxias are caused by mutations in a single gene that causes pathology.

No two ataxic diseases share the same mutations, but there is some overlap between diseases in terms of affected pathways and disease presentation. In some manner, the mutated gene of each disease affects the DNA repair pathways. The mutated APTX of AOA1 is directly involved in the BER pathway, and this mutation has profound effects on the ability of the pathway to perform, causing the disease. The function of the ATXN genes, which are mutated in SCA, are completely unknown.[20]

However, there is a form of SCA, known as spinocerebellar ataxia with neuropathy type 1, that exhibits a mutation in tyrosyl-DNA phosphodiesterase 1 (TDP1). In the

BER pathway, TDP1 and it hydrolyzes the 3’-phospodiester bonds of several different forms of covalent adducts.[21] Because of the close relation of the different types of

SCA, other forms of SCA may have BER deficiencies as well. The ATM gene affected in A-T acts as a checkpoint gene for several genes that are known to be in both BER and HR repair pathways, including TP53, BRCA1, NBN.[22] The causative agent of FA, the FXN gene, is not directly involved with DNA repair and its function is not fully understood, but it is believed to be involved with the formation of iron- sulfur clusters.[23] Cells of FA patients experience higher levels of oxidative stress, which may impact both the genes of the DNA repair pathway and cerebellar atrophy.

It is thought that the mutation causes mitochondrial dysfunction, which I believe will cause reduced availability of adenosine triphosphate (ATP) for any DNA repair proteins that require energy to function. While the cause of the cerebellar atrophy seen

15 in ataxia patients has not been discovered yet, it is possible that this region of the brain is more sensitive to DNA repair mutations. Without correctly working DNA repair mechanisms, mutations can accumulate. When a cell accrues a massive amount of mutations, it can no longer function properly and will signal apoptosis, or cell death, in the cerebellar neurons. The death of these neurons will result in the wasting away of the cerebellar tissue, the atrophy. It was my hope to find differential expression levels of genes that were shared across the board, and determine if the ataxias that did not have mutations in DNA repair pathways did have an effect on the genes in those pathways. Due to AOA1’s rarity, no relevant microarray experiments have been conducted. Microarray experiments determine a gene’s level of expression, which indicates how frequently it has been transcribed to create the protein it codes for. In order to shed more light on this ataxia, I had to look at commonalities between ataxias to gain insight. This approach provides a way to find new genes that could be targets for creating diagnostic criteria or even developing possible therapies for ataxia patients. It was not feasible for me to conduct my own microarray experiments because of limitations in funds, machinery, time, and available human samples. In order to obtain truly relevant data on AOA1, brain tissue would have been required, jeopardizing the safety and wellbeing of participants.

VI. What is Bioinformatics?

At this point, the bulk of the research on hereditary ataxia has been done through observation of disease progression and by analyzing the protein deficiencies in patients or mouse models of the disease. The biological researchers have gathered

16 the data from these experiments, but the vast majority of it their results cannot be interpreted visually immediately after the data has been obtained. The datasets, like long sequences of the genome, are overwhelming in their size and detail, and it would take months for a researcher to make sense of all the information by hand. In order to glean as much knowledge from the data as possible, many researchers are turning to a budding field known as bioinformatics. Bioinformatics is a field that utilizes the techniques and technology of computer programming to answer biological questions.

A computer program does the majority of the grunt work efficiently and accurately, while researchers are tasked with analyzing the output of these tools. The main goal of bioinformatics is to provide more insight into the underlying biological processes using computational techniques. These approaches include recognizing patterns in

DNA or amino acid sequences, data mining to discover patterns in datasets, machine learning algorithms, and visualization of results.[24] Using these techniques allows researchers to find previously-unidentified genes, predict protein structure, align of both gene and protein sequences to allow for easy comparison, predict levels, model evolution and population dynamics, and infer protein-protein interactions.

Using bioinformatic techniques as opposed to traditional biological experiments provide a host of benefits. One of the largest is that it provides the ability to analyze enormous datasets easily and efficiently. Much of the data gathered through microarray analysis is not interpretable by a human researcher. The data gathered in these studies are simply probes mapped to a level of expression. Matching the probe

17 identification number to its corresponding gene by hand would take months, and the risk of mismatching due to human error is prevalent. By using computer programs to normalize, match probes, and explore the data, researchers can avoid the issue of human error, maximize efficiency, and significantly reduce data processing time from weeks or months to a matter of minutes. One important aspect of the world of computational biology is that bioinformaticians create tools to analyze experimental data. There are a slew of tools, most of which are freely available to the public, designed to answer specific scientific questions or to perform a certain function. These tools can find patterns between DNA sequences, determine repeating elements known as motifs in a string of characters, and generate informative visualizations, among many other capabilities. Bioinformatic researchers often string together tools and processing methods to form a pipeline. Pipelines allow for easy repeatability and understanding of how the data is digested in order to produce the researcher’s end results. The use of tools and scripts are the heart of bioinformatics, as they enable scientists to mine data and glean meaningful information from strings of letters or numbers.

VII. Ataxia and Bioinformatics

It is becoming a common practice for complex and difficult-to-diagnosis disorders to be determined using next-generation sequencing.[25] Next-generation sequencing is a method of high-throughput DNA sequencing that allows for millions of sequences to be generated simultaneously, as opposed to sequencing the genome as a single sequence. This method greatly reduces the amount of time needed to obtain a

18 subject’s full genome. Next-generation sequencing is becoming useful because the process of determining uncommon disease using conventional and traditional methods is often lengthy and convoluted, requiring patients to undergo numerous costly and sometimes invasive tests. Even after this extensive testing, patients may still be left without a diagnosis or even misdiagnosed. The delay in diagnosis presents a serious problem; patients will benefit from physical therapy and would have the option to participate in experimental studies if they so choose. Using bioinformatic techniques to diagnose hereditary ataxias is becoming more common because of this method’s ability to detect specific mutated genes in the DNA, removing the guesswork in diagnosing a patient. Nemeth and her colleagues found that the amount of time between the onset of ataxia and the correct diagnosis ranged from three to thirty-five years, with an average time of eighteen years.[25] Out of their fifty patients in their study, they were able to correctly diagnose nine of the fifty patients. While this may seem small, their results were still significant and many of their patients had novel pathogenic variants that limited their ability to compare their mutations to the established causes of ataxia. Nonetheless, known mutations resulting in hereditary ataxias can become targets for screening, allowing for a diagnostic strategy that is quick, relatively inexpensive, efficient, and accurate. Nemeth alludes to the fact that there could be gene mutations that result in ataxia that we are currently unaware of.

My research could be used in tandem with this sort of study to add more genes as possible screening targets for disease identification. In doing so, the patients can receive a fast and accurate diagnosis. For example, an AOA1 patient and their doctors

19 would know the exact mutations in their genome. This would allow researchers and health care providers to work together to determine if there are any therapies available to the patient to alleviate some of their ataxia symptoms. Without the bioinformatic technique, next-generation sequencing, this would not be possible.

VIII. Gathering of Data

During the study of hereditary ataxias, several key tools and scripts were employed to interpret microarray data. Before any investigation can begin, it is necessary to procure the data. The National Center for Biotechnology Information, or

NCBI, serves as a resource for scientists to store and obtain data in order to study molecular and genomic subjects. It is their mission to “develop, distribute, support, and coordinate access to a variety of databases and software for the scientific and medical communities”[26] with the hope of fostering and increasing the knowledge base for the benefit of society. Many researchers do not have the equipment or funds needed to carry out experiments in their own labs, and NCBI provides an avenue to relevant datasets that researchers can analyze to answer their hypotheses. Additionally,

NCBI can be used to gather test data to ensure that a tool someone has created is functioning as it should. The datasets used in Unraveling the Nexus were all obtained through NCBI by sleuthing about the database for the relevant microarray experiments. After obtaining four datasets to represent the three most common hereditary ataxias, it needed to be processed. There were two human A-T datasets, one human FA dataset, and one mouse SCA dataset. The mouse dataset for SCA was one of a very limited number of datasets that was related to what I wanted to study, and

20 mice and humans share extremely similar versions of the genes, proteins, and organ systems, known as [27], making it possible to use this dataset. My research methods generate genetic maps for each DNA repair pathway for each dataset, which illuminate the relationships between genes in each pathway at the molecular level in the diseases I studied. The maps needed to incorporate the protein-protein interactions between elements in the pathways, transcription factors, motifs representing possible binding sites, and the level of gene expression. These genetic maps provide a succinct way to easily compare differences between ataxias, any key transcription factors, and any commonalities among all ataxias. In order to create these images, there were many vital tools, packages, and scripts that were necessary.

IX. R Techniques

In order to begin processing the data, a highly versatile computer language was needed. The R programming language and software environment was selected for its statistical computing abilities and plethora of packages that can be installed to analyze certain types of data. R provides “an integrated suite of software facilities for data manipulation, calculation, and graphical display, … [providing an] environment within which statistical techniques are implemented”.[28] In order to maximize R’s capabilities, packages from the Bioconductor project were installed. Bioconductor is an open-source, collaborative effort meant to lessen the learning curve for those interested in using computational biology and bioinformatic techniques in their research.[29] Bioconductor practices several software engineering strategies in order to create a solid software infrastructure that can support the research of others. The first

21 strategy used is known as designing by contract. This uses a metaphor of a business contract to create a program, first outlining what the program is expected to do, as well as defining all variables, data types and conditions necessary for the program to function[30]. The next approach is to employ object-oriented programming to create reusable software that can be applied to a variety of types of data, which divides the programming into two main areas: objects and methods. Objects hold information related to a piece of data (like data type, size, and other parameters), while methods manipulate the objects as described by the functions. Using object-oriented programming allows for easier communication between program developers and the users, simplifies maintenance of programs, and enables the code to evolve as the requirements set by the user change over time.[31] A key strategy of Bioconductor is modularization, where a software module is specifically designed to work on its own and in tandem with other modules so that users can reap the benefits of multiple tools simultaneously, as opposed to having to repeat certain processing steps. Data structure, R functions and R packages are all modularized to ensure seamless interactions and operation. In order to make sense of all the code, Bioconductor requires in-depth and precise documentation, known as multiscale and executable documentation. The idea of multiscale documentation means that both small- and large-scale documentation are implemented. Small-scale documentation focuses on the specific behavior of a single function or a cluster of related functions. In order to make the understanding of interacting modules easier, Bioconductor uses a novel concept known as a vignette for large-scale documentation, which discusses the steps in which

22 the data must be processed to carry out a specific operation. A vignette can describe the cooperation between functions or entire packages to perform a task. Keeping careful documentation maintains the reproducibility of any research conducted using

Bioconductor. The final key strategy is automated software distribution. This refers to

Bioconductor’s ability to update any packages as changes are made, insuring that the user is granted access to the newest version of programs, including fixed bugs and added functionalities. These five main concepts are the crux of Bioconductor software creation, and they maintain the standards set by researchers to carry out reliable, repeatable, informative exploration of datasets.

X. DNA Microarray

In order to create a DNA microarray, a machine deposits thousands of short

DNA sequences onto a single slide, known as the chip.[32] When a gene is turned on, corresponding mRNA of that gene is generated as the gene segments are transcribed.

Because mRNA is the blueprint for protein creation, it gives an accurate estimate of how much gene product is being created in the cell. To determine exact expression levels, the mRNA molecules are extracted and labeled with reverse transcriptase. This will create cDNA that is corresponds to the specific mRNA molecule it is derived from. Fluorescent nucleotides are affixed to the cDNA, which will hybridize with the short DNA sequences on the microarray chip and fluoresce when bound. Highly overexpressed genes will light up very brightly, while underexpressed genes will on fluoresce dimly, if at all. The expression levels are quantified and recorded with their specific probe ID so that the gene can be distinguished later on. For my research, I

23 used basic R scripts to normalize data, retrieve the expression levels, and partner those expression levels with probe IDs, along with three Bioconductor annotation packages.

Annotation packages connect manufacturer probe IDs on the microarray chip to the real gene names and symbols. Once the genes were identified and organized with their fold change levels, the first stage of my pipeline was complete.

The lists of genes in each DNA repair pathway had to be compiled next. The gene list of BER included thirty-two genes, the gene list of NER included twenty-eight genes, the gene list of MMR included thirteen genes, the gene list of HR included twenty-one genes, and the gene list of NHEJ included eleven genes. In total, there were one hundred and five genes incorporated into my study of ataxia.

To do this, I extensively reviewed the human DNA repair literature, and included genes that were found by other researchers in my DNA repair gene lists. In addition to genes found in article, genes that were found in the Kyoto Encyclopedia of Genes and

Genomes (KEGG) pathway database were also incorporated.[33] The KEGG pathway database is an online resource that contains the pathway information on hundreds of biological pathways in a variety of organisms, including the DNA repair pathways of

Homo sapiens. KEGG has schematics diagrams for each of the five DNA repair pathway in my study, illustrating all involved proteins, as shown below. It includes the pathway for both prokaryotes (like ) and eukaryotes (like mammals) so that they may be compared side-by-side. These diagrams contain established protein interactions for each DNA repair pathway, and any genes that were found in these diagrams were added to the gene lists created from the literature. Illustrations like

24 Figure 2 can provide biological insight, but do not show all the intricate relationships that exist between the genes in the pathways.

Figure 2. Base Excision Repair Pathways Illustration from KEGG Pathway Database. This schematic representation was useful for understanding the genes involved in the pathway, along with protein complexes that are formed during the DNA repair process. (Figure from [33]).

XI. Motif Finding

The gene lists were extremely important, as they determined which expression values I would be using in my research, as well as finding motifs contained in the

25 promoter sequences of each gene. Motifs are short patterns found repeated in DNA sequences. Motifs can be indicative of binding sites for transcription factors that modulate the expression of genes and the functionality of proteins. Once I had established a list of genes that comprised each pathway, I used Ensembl to gather the promoter regions. Ensembl is a genome browser created by the collaborative effort between the European Bioinformatics Institute and the Wellcome Trust Sanger

Institute that serves as a genome browser for researchers. The entire , along with the genomes of several other species, is freely available to the public. For my purposes, I chose a length of 1000 base pairs upstream of the start site for each gene. While transcription factors are known to bind throughout the length of a DNA sequence, many of those that alter the expression of a gene by binding in the promoter region. I collected the start site, the end site, and the chromosome on which that gene is located for later use. With this data, I now had input for motif discovery tools. There are several different varieties of motif discovery tools, but I chose the Discriminating

Matrix Enumerator (DME) tool to generate a collection of significant motifs. DME discovers the most discriminative motifs in a set of foreground genomic sequences.

DME is similar to other motif discovery tools: it produces several possible motifs and reports a concise list of interesting or meaningful motifs. Multiple Em for Motif

Elicitation (MEME) was a second motif finding tool that I used to visualize the locations of the motifs found on the promoter region. I implemented a set of sequence coverage algorithms that choose the most significant motifs based on their sequence coverage across the foreground sequences. Sequence coverage refers to how much of a

26 DNA sequence is covered by the motifs, in this case referring to the covering of the promoter regions of my genes of interest. In this project the Greedy algorithm was used to select the most significant motifs.[34] These motifs can by later compared to the

ChIP-Seq database. The ChIP-Seq Database takes the results of chromatin immunoprecipation (ChIP) experiments, which analyze how proteins like transcription factors interact with DNA. The ChIP-seq Database was explored to see if the motifs match any existing transcription factors. If the motifs found were associated with a protein, that protein was included in the network map, connected to the gene whose promoter it interacted with.

XII. STRING Database

Arguably, the most important tool that I used is the STRING Database.

STRING is an acronym for ‘Search Tool for the Retrieval of Interacting

Genes/Proteins’. STRING generates an interconnected web, where proteins are nodes and the connection between nodes are edges.[35] STRING can incorporate information from numerous types of sources to give extremely comprehensive networks. STRING proved to be an extremely valuable tool. Sources include previous knowledge from research papers (where information about interactions are collected through text- mining), the KEGG database, high-throughput experiments, conserved coexpression, and genomic context. After obtaining the interaction networks for each DNA repair pathways, the images were downloaded as an XML file that can by used by Perl scripts that draw the networks. While previous studies were able to use a similar technique, they used information from the KEGG pathway database, the KEGG

27 database did not have useable DNA repair pathway images from which I could extract the connection to be used in my research. STRING allowed for a work-around, as it includes all the interactions in the KEGG database and also included other interactions that are not noted in KEGG, but are established via research or experimentation. The

STRING connections are the basis for my maps, and motifs, transcription factors, and expression level coloring are later incorporated.

Figure 3. Homologous Recombination Gene Network Generated by STRING. Shows all proteins of HR gene list and their interactions. Blue edges represent physical binding between two proteins; black edges represent interactions that result in some type of reaction between proteins (kinase activity, catalysis, etc).

28

Figure 4. Non-Homologous End-Joining Gene Network Generated by STRING. Shows all proteins from NHEJ gene list and their interactions. Blue edges represent physical binding between two proteins; black edges represent interactions that result in some type of reaction between proteins (kinase activity, catalysis, etc).

Figure 5. Mismatch Repair Gene Network Generated by STRING. Shows all proteins from MMR gene list and their interactions. Blue edges represent physical binding between two proteins; black edges represent interactions that result in some type of reaction between proteins (kinase activity, catalysis, etc).

29

Figure 6. Base Excision Repair Gene Network Generated by STRING. Shows all proteins from BER gene list and their interactions. Blue edges represent physical binding between two proteins; black edges represent interactions that result in some type of reaction between proteins (kinase activity, catalysis, etc).

30

Figure 7. Nucleotide Excision Repair Gene Network Generated by STRING. Shows all proteins from NER gene list and their interactions. Blue edges represent physical binding between two proteins; black edges represent interactions that result in some type of reaction between proteins (kinase activity, catalysis, etc).

A STRING interaction network was generated for each one of the five DNA repair pathways, using the gene lists for each pathway as input, shown in figures 3 through 7. Regardless of the form of ataxia I was investigating, these DNA repair pathways remained constant. The only changes were the levels of expression of the genes in those pathways. The connections from these gene networks were extracted using the DrawNetwork.pl Perl script.

31 XIII. Methodology Pipeline

Figure 8 – Methodology Flowchart for Generation of Images. Shows how the data flows and is processed in order to get the end results.

32 XIV. Network Map Generation

The STRING interactions are extracted from the XML file using a script designed to make images of protein networks, then that data is combined with the binding sites of motifs, transcription factors associated with those motifs, and whether a gene is upregulated, downregulated, or at the same expression level as the when compared to the wildtype. These maps provide an easy way to compare and contrast the same DNA repair pathways between FA, A-T and SCA. The underlying intricacies of a complex pathway can be quickly evaluated by interpreting its genetic map. If the edge between two genes is red, represents the STRING connection that was extracted.

If the edge is black, it represents the interaction between a gene and a transcription factor. When a node in the map is a pentagon it represents a gene. When the node is a rectangle, it represents a transcription factor. Overexpressed elements are show in red, while underexpressed elements are shown in green. Finding differences in a single pathway may have implications for the differential presentations of these ataxias, while discovering similarities may suggest genes that are responsible for the large number of similarities between the different forms of hereditary ataxia. Indentifying informative genes, that are either unique among ataxias or shared, provide targets for future studies on hereditary ataxias. Using the maps generated from my pipeline, I found a few key genes that are altered in the same manner between the diseases. The first gene is RAD50, which was consistently underexpressed in all of the ataxia datasets. The second is DCLRE1C, which was underexpressed in all of the diseases.

The third and fourth were an overexpression of OGG1 and MSH4.

33

Gene Name AT1 AT2 FA SCA RAD50 down down down down OGG1 up up up up DCLRE1C NA down down down MSH4 up same up up

Table 2. Gene Expression Similarities Betweens Ataxias.

MATERIALS AND METHODS

My research was conducted using a variety of bioinformatic tools and techniques. Datasets were obtained through NCBI. In order to select the datasets used in this study, it was vital to look at all available Gene Expression Omnibus (GEO)

DataSets in all organisms that might have been pertinent to my research, and this query returned 36 different datasets. From this list of 36, I narrowed the datasets to be used down to four. The four datasets used needed to be microarray analysis, provide both raw and processed data (CEL files), and had to have a wildtype condition in its experimental design. Originally, it was desired to have all the datasets used be from humans, but this was not possible, as SCA contained no human experiments that fit my criteria. I wanted to have two datasets for each disease, but there was no second applicable dataset for FA that fit all criteria. As a result, I wound up with two human

A-T datasets, one human FA dataset, and one mouse SCA dataset. Accession numbers for those experiments are as follows: E-TABM-321 for the first A-T dataset,

GSE45030 for the second A-T dataset, GSE5040 for the FA dataset, and GSE39640 for the SCA dataset. After data was gathered, it was normalized using R statistical

34 analysis. R scripts were also used to determine fold change and p values.

Corresponding Bioconductor annotation packages mapped gene symbols to probe IDs.

Promoter sequences from gene lists were extracted using Ensembl, and we taken 1000 base pairs upstream of the transcriptional start site. Discriminative motif discovery tool DME was used to find repeated patterns in these promoter sequences, and sequence coverage analysis determined which of those motifs found where the most interesting. The STRING database tool was used to map the networks as a web of nodes and edges. These connections were later extracted by the DrawNetwork.pl Perl script written by Veronica Liang that incorporates the expression levels, transcription factors, and motifs into the final network images.

RESULTS

This study found that the five DNA repair pathways present differential expression in each form of ataxia when compared to the wildtype. The general trend was that genes were underexpressed compared to the wildtype pathways, with some exceptions. I felt that underexpression had more potential to cause disease symptoms in the case of DNA repair pathway genes compared to overexpressed genes. Lacking sufficient amount of a protein involved in fixing DNA may impact the number of mutations left in the DNA, while having too much of a protein that helps to fix those mistakes would most likely not impact repair pathway function. Table 2 shows the list of DNA repair genes that share similar expression levels across all ataxias. OGG1,

MSH4 and PMS1 are overexpressed in all ataxias, while RAD50, LIG1, LIG 3, LIG4,

DCLRE1C, XRC4 and TP53 are all underexpressed.

35

Gene Symbol (Pathway) AT1 AT2 FA SCA

RAD50 (HR) 0.987 (0.0671) 0.952 (.0528) 0.995 (0.6921) 0.996 (0.7202) OGG1 (BER) 1.021 (0.8495) 1.002 (0.7497) 1.044 (0.8391) 1.02 (0.2655) PMS1 (MMR) 1.054 (0.3809) 1.003 (0.2348) 3.94 (0.4049) 1.015 (0.4439) LIG4 (NHEJ) 0.992 (0.4835) 0.925 (0.5249) 1.147 (0.234) 1 (0.3618) LIG3 (BER) NA 1.061 (0.156) 0.256 (0.1886) 1.015 (0.1677) LIG1 (BER, NER, MMR) 0.965 (0.0305) 1.101 (0.1564) 0.92 (0.0879) 0.986 (0.1095) DCLRE1C (NHEJ) NA 0.929 (0.2028) 0.921 (0.5891) 0.973 (0.936) MSH4 (MMR) 1 (1.000) 1.062 (0.0863) 1.084 (0.4226) 1.073 (0.3266) XRCC4 (BER, NHEJ) NA 0.907 (0.3415) 0.918 (0.8004) NA TP53 (BER) 0.989 (0.0121) 0.951 (0.0536) 0.919 (0.5564) NA

Table 3. Shared Expression. List of Genes that display similar expression levels. If the fold change is less than 1, it indicates that the gene is underexpressed, while a fold change indicates that the gene was overexpressed. ‘NA’ signifies that there were no probes on the microarray chip that mapped to that gene, so no expression values could be determined. Corresponding p values are in italics next to the fold change values.

The fact that these ataxias have common alterations to their DNA repair pathways may be part of the reason for their similar disease presentation and symptoms. Involved in the BER pathway, OGG1 is needed to cleave the glycosidic bond at the site of the error and to break the sugar backbone of the DNA strand so that other machinery can repair the damage.[36] Increased expression of this gene should not cause any issues to the DNA repair mechanisms, as the breakage does not occur unless signaled to do so by other proteins that detect damage. PMS1 is a part of the MMR pathway, and while

36 not much is known about its function, it is believed to form heterodimers with

MHL1.[37] The overexpression of this gene is also not likely to have too much impact on the functionality of the MMR pathway, as increased PMS1 expression will still permit the formation of MHL1 heterodimers.

Gene AT1 Fold AT2 Fold FA Fold SCA Fold Symbol Change Change Change Change PNKP 1.024 (0.4982) 1.045 (0.1816I) 0.924 (0.8855) 0.992 (0.8461) APEX1 1.006 (0.3098) 1.016 (0.6441) 0.996 (0.9442) 1.009 (0.0205) NEIL1 NA NA NA 0.992 (0.7818) APEX2 NA 1.058 (0.1736) 0.921 (0.3867) 0.994 (0.6769) APEX2 0.998 (0.1994) 1.021 (0.2341) 1.001 (0.7758) 1.013 (0.8609) APTX NA 1.032 (0.1385) 1 (1.000) 1.014 (0.7535) APTX NA 1.003 (0.1959) 0.92 (0.6206) NA APTX NA 0.947 (0.4651) 1.002 (0.6191) NA DNA2 NA 1.078 (0.9306) 0.918 (0.4809) 0.923 (0.0189) FEN1 1.021 (0.6843) 1.076 (0.0488) 0.921 (0.0612) 0.988 (0.1264) FEN1 NA 1.022 (0.515) 0.921 (0.0816) 0.99 (0.1461) LIG1 0.965 (0.0305) 1.101 (0.1564) 0.92 (0.0879) 0.986 (0.1095) LIG3 NA 1.061 (0.156) 0.256 (0.1886) 1.015 (0.1677) LIG3 NA 1.022 (0.0806) 0.999 (0.3721) 1.014 (0.2716) LIG3 1.017 (0.2003) 1.021 (0.1305) 0.922 (0.4226) NA MBD4 NA 0.992 (0.0225) 0.921 (0.4272) 0.998 (0.1688) MBD4 1.05 (0.9683) 0.986 (0.1269) 0.919 (0.8330) 1.013 (0.2117) MBD4 NA 0.966 (0.3028) 0.92 (0.9187) NA MBD4 NA 0.896 (0.6988) 0.838 (0.9255) NA MPG 0.999 (0.7892) 0.947 (0.870) 0.92 (0.7341) 1.001 (0.0976) MUTYH 1.012 (0.4325) 1.065 (0.8692) 1 (0.8714) 0.993 (0.1080) NEIL2 NA 1.052 (0.1151) 1.001 (0.7118) NA NEIL3 NA 1.059 (0.0046) 1.348 (0.8131) 0.978 (0.1585) NTHL1 0.994 (0.2351) 1.011 (0.0112) 0.999 (0.5725) 1.036 (0.0902) NUDT1 0.96 (0.6786) 1.091 (0.4016) 0.919 (0.3178) NA

37 OGG1 NA 1.068 (0.0077) 0.998 (0.4263) 0.983 (0.1294) OGG1 1.021 (0.8495) 1.002 (0.7497) 1.044 (0.8391) 1.02 (0.2655) OGG1 NA 0.973 (0.7682) 0.884 (1.000) NA PARP1 1.009 (0.7385) 1.066 (0.4215) 0.996 (0.2872) 0.981 (0.4787) PARP2 NA 1.058 (0.0362) 0.998 (0.3341) 0.992 (0.3948) PARP2 0.985 (0.2463) 1.052 (0.1551) 0.919 (0.4318) NA PARP2 NA 1.049 (0.1761) 0.999 (0.6780) NA POLB NA 0.967 (0.1963) 1.001 (0.7911) 0.996 (0.1529) POLB 0.983 (0.5702) 1.048 (0.8378) 0.996 (1.000) 0.995 (0.1557) SMUG1 0.959 (0.3879) 1.011 (0.1149) 0.992 (0.4227) 0.962 (0.1969) SMUG1 NA 0.963 (0.2939) 1 (0.9011) NA SMUG1 NA 0.959 (0.3947) 1 (1.000) NA TDG 0.955 (0.2576) 1.025 (0.9562) 0.92 (0.4378) NA TDP1 NA 1.068 (0.6671) 0.999 (0.2663) 0.985 (0.4415) TDP1 NA 1.086 (0.0693) 0.919 (0.6844) NA UNG 1.06 (0.6639) 1.045 (0.5451) 0.918 (0.8707) 1.009 (0.3723) XRCC1 1.003 (0.4523) 1.053 (0.6787) 0.995 (0.7094) 1.007 (0.3087) XRCC4 NA 1.014 (0.2861) 0.915 (0.4327) 1.002 (0.1141) XRCC4 NA 0.933 (0.3174) 0.917 (0.513) 1.025 (0.9526) XRCC4 NA 0.907 (0.3415) 0.918 (0.8004) NA XRCC4 0.984 (0.4226) 0.868 (0.4676) 0.996 (0.8132) NA APLF NA 0.907 (0.5831) 1.087 (0.4111) 1.012 (0.1246) TOP1 1.02 (0.6037) 1.066 (0.3958) 0.999 (0.5268) 0.996 (0.0782) TP53 0.989 (0.0121) 0.951 (0.0536) 0.919 (0.5564) NA UBTF 0.98 (0.3018) 0.957 (0.0364) 1 (0.2849) 1.003 (0.3496) UBTF NA 0.912 (0.0525) 1 (0.3942) NA UBTF NA 0.85 (0.0896) 0.922 (0.4143) NA

Table 4. All Base Excision Repair Pathway Genes. Shows the expression levels (fold change) for each gene in BER. The associated p value is shown next to the fold change in italics.

38

Figure 9A – AT1

39

Figure 9B – AT2

40

Figure 9C - FA

41

Figure 9D – SCA

Figure 9. Base Excision Repair pathway images for all datasets. Upregulated elements are shown in red, downregulated elements are shown in green. 9A is the first of two datasets for Ataxia Telangiectasia. 9B is the second of two datasets for Ataxia Telangiectasia, and shows only underexpression. 9C is the network image for Friedreich’s Ataxia. 9D is the network image for Spinocerebellar Ataxia (type 2). It should be noted that there were multiple probes for some genes, and occasionally these genes were underexpressed in one probe and overexpressed in another.

42 Gene AT1 Fold AT2 Fold FA Fold SCA Fold Symbol Change Change Change Change XPC 0.924 (0.2352) 1.025 (0.0248) 0.918 (0.4308) 1.002 (0.4651) RAD23B 1.009 (0.4461) 0.961 (0.1308) 0.918 (0.4231) 1.062 (0.1083) RAD23A 1.015 (0.5236) 0.966 (0.2837) 0.997 (0.0972) 1.032 (0.0111) CETN2 0.979 (0.7044) 0.967 (0.4425) 0.919 (0.6293) 1.004 (0.3303) XPA 1.032 (0.3699) 0.943 (0.1602) 0.997 (0.8170) 1.043 (0.2943) DDB1 0.993 (0.8241) 1.015 (0.2152) 0.998 (0.2738) 0.999 (0.6661) DDB2 0.932 (0.1888) 1.07 (0.1123) 0.922 (0.6021) 0.997 (0.2031) RPA1 1.008 (0.2717) 0.974 (0.2316) 0.92 (0.8280) 0.984 (0.2498) RPA1 NA 0.99 (0.62304) 1 (0.8323) 0.993 (0.2749) RPA2 1.024 (0.8251) 0.982 (0.6302) 0.918 (0.9601) 0.991 (0.4009) ERCC3 1.007 (0.4924) 1.011 (0.8308) 0.92 (0.4214) 0.997 (0.4612) ERCC2 NA 0.984 (0.0921) 1.004 (0.4166) 0.955 (0.3438) GTF2H1 0.944 (0.5694) 0.932 (0.0033) 0.999 (0.6519) 1 (0.3767) GTF2H1 NA 0.847 (0.2138) 0.996 (0.7802) NA GTF2H2 NA 0.919 (0.06601) 0.996 (0.7551) 1.016 (0.6508) GTF2H3 NA 0.872 (0.0707) 0.922 (0.6166) NA GTF2H3 0.981 (0.8999) 1.044 (0.2315) 0.999 (0.7021) 1.002 (0.9851) GTF2H4 0.997 (0.8365) 1.022 (0.8686) 1 (0.5392) 0.983 (0.2451) GTF2H5 0.986 (0.9983) 0.959 (0.2214) 1.897 (0.4219) 1.004 (0.2069) CDK7 0.968 (0.5068) 0.948 (0.3252) 0.919 (0.3785) 0.997 (0.1393) CCNH 0.989 (0.2608) 0.956 (0.2429) 0.921 (0.7418) 0.999 (0.1459) MNAT1 0.993 (0.3425) 0.963 (0.2968) 0.998 (0.7234) 0.997 (0.3569) ERCC5 0.994 (0.1609) 0.976 (0.7145) 0.998 (0.2231) 0.986 (0.2497) ERCC1 1 (0.4671) 0.961 (0.1321) 0.999 (0.3752) 0.99 (0.3470) ERCC4 1.005 (0.5727) 0.921 (0.2236) 3.864 (0.8776) 0.995 (0.1833) LIG1 0.965 (0.0305) 1.101 (0.1564) 0.92 (0.0879) 0.986 (0.1095) ERCC8 0.985 (0.3992) 0.999 (0.0666) 1.061 (0.3899) 0.918 (0.3383) ERCC8 NA 0.954 (0.1158) 0.839 (0.4226) NA ERCC8 NA 0.944 (0.4689) 1.192 (0.8746) NA ERCC6 1.005 (0.4226) 1.011 (0.8381) 0.927 (0.3229) 1.054 (0.2783) ERCC6 NA 0.964 (0.1938) 1 (0.8725) NA UVSSA NA 1.07 (0.2638) 1.089 (0.7221) 1.013 (0.0351) XAB2 NA 1.054 (0.4305) 0.999 (0.3583) 0.976 (0.2642) MMS19 0.992 (0.2938) 1.03 (0.1373) 0.999 (0.3144) 0.997 (0.9336)

Table 5. All Nucleotide Excision Repair Pathway Genes. Shows the expression levels (fold change) for each gene in NER. The associated p value is shown next to the fold change in italics.

43

Figure 10A – AT1

44

Figure 10B – AT2

45

Figure 10C – FA

46

Figure 10D - SCA

47 Figure 10. Nucleotide Excision Repair pathway images for all datasets. Upregulated elements are shown in red, downregulated elements are shown in green. 10A is the first of two datasets for Ataxia Telangiectasia. 10B is the second of two datasets for Ataxia Telangiectasia, and shows only underexpression. 10C is the network image for Friedreich’s Ataxia. 10D is the network image for Spinocerebellar Ataxia (type 2). It should be noted that there were multiple probes for some genes, and occasionally these genes were underexpressed in one probe and overexpressed in another.

Gene AT1 Fold SCA Fold Symbol Change AT2 Fold Change FA Fold Change Change 1.032 MSH2 (0.8881) 0.984 (0.8623) 0.921 (0.9319) 0.937 (0.1878) 0.99 MSH3 (0.1771) 1.004 (0.0602) 0.999 (0.2996) 1.055 (0.0944) MSH3 NA 1 (0.1731) 1 (0.4226) 1.044 (0.5155) MSH6 NA 1 (0.8824) 0.919 (0.5787) 0.988 (0.1757) 1.056 MSH6 (0.3998) 0.993 (0.0719) 1.007 (0.6855) NA MSH6 NA 0.851 (0.1821) 0.917 (0.6935) NA 0.991 MLH1 (0.2079) 1.05 (0.9991) 0.92 (0.4609) 0.988 (0.4344) PMS2 NA 0.965 (0.0333) 0.92 (0.6104) 0.972 (0.1629) MSH4 1 (1.000) 1.062 (0.0863) 1.084 (0.4226) 1.073 (0.3266) MSH5 NA NA NA 1.027 (0.0451) 1.013 MLH3 (0.284) 0.994 (0.0003) 0.996 (0.2345) 1.002 (0.7565) MLH3 NA 0.983 (0.7140) 0.916 (0.6672) NA 1.054 PMS1 (0.3809) 1.003 (0.2348) 3.94 (0.4049) 1.015 (0.4439) 1.008 RPA1 (0.2717) 0.99 (0.2316) 0.919 (0.8280) 0.984 (0.2498) RPA1 NA 0.974 (0.6231) 0.92 (0.832) 0.993 (0.2749) 1.014 POLD1 (0.4993) 1.057 (0.8383) 0.918 (0.2482) 0.97 (0.1837) 0.965 LIG1 (0.0305) 1.101 (0.1564) 0.92 (0.0879) 0.986 (0.1095)

Table 6. All Mismatch Repair Pathway Genes. Shows the expression levels (fold change) for each gene in MMR. The associated p value is shown next to the fold change in italics.

48

Figure 11A – AT1

49

Figure 11B – AT2

50

Figure 11C – FA

51

Figure 11D – SCA

Figure 11. Mismatch Repair pathway images for all datasets. Upregulated elements are shown in red, downregulated elements are shown in green. 11A is the

52 first of two datasets for Ataxia Telangiectasia. 11B is the second of two datasets for Ataxia Telangiectasia, and shows only underexpression. 11C is the network image for Friedreich’s Ataxia. 11D is the network image for Spinocerebellar Ataxia (type 2). It should be noted that there were multiple probes for some genes, and occasionally these genes were underexpressed in one probe and overexpressed in another.

Gene AT1 Fold AT2 Fold SCA Fold Symbol Change Change FA Fold Change Change 1.043 1.066 RAD51 (0.1275) (0.3051) 1.265 (0.7217) 1.002 (0.2590) 1.018 0.914 RAD51B (0.9276) (0.0304) 3.904 (0.4213) 0.983 (0.4910) 1.026 RAD51D 1 (0.8901) (0.8111) 2.474 (0.3491) 1.019 (0.2657) 1.006 DMC1 0.989 (1.000) (0.8131) 1.097 (0.7907) 1.03 (0.3807) 0.973 DMC1 NA (0.9252) 1 (0.8415) 0.967 (0.9545) 1.039 1.059 XRCC2 (0.6428) (0.5942) 0.998 (0.4407) 1.035 (0.2006) 0.998 1.071 XRCC3 (0.16404) (0.5268) 1 (0.7355) 0.954 (0.0868) 0.997 1.076 RAD52 (0.4226) (0.3382) 1.121 (0.3383) 0.967 (0.1173) 1.062 RAD52 NA (0.5773) 1.043 (0.4208) 0.986 (0.3401) 1.124 RAD54L 1.01(0.7346) (0.0027) 0.998 (0.5263) 0.979 (0.0961) RAD54B NA NA NA 0.999 (0.3379) 1.031 BRCA1 NA (0.0744) 0.999 (0.8841) NA 1.017 1.067 BRCA1 (0.6253) (0.0806) 0.995 (0.9653) 0.905 (0.0505) 1.021 SHFM1 0.989 (0.999) (0.6129) 0.997 (0.9210) 0.999 (0.1121) RAD50 NA 0.853 (0.008) 0.919 (0.0706) 0.991 (0.1527) 0.987 RAD50 (0.0671) 0.952 (.0528) 0.995 (0.6921) 0.996 (0.7202) 0.999 0.997 MRE11A (0.4226) (0.5111) 1.053 (0.2367) 0.995 (0.1361) 0.924 NBN NA (0.0067) 0.996 (0.8638) 1.009 (0.1544) 0.985 0.976 NBN (0.1787) (0.3501) 0.919 (0.8816) NA NBN NA 0.878 0.995 (0.9491) NA

53 (0.4457) 0.986 RBBP8 (0.0499) 1.01 (0.8515) 0.92 (0.9388) 1 (0.1666) 1.034 MUS81 NA (0.6883) 0.92 (0.1854) 0.986 (0.1724) 1.072 EME1 NA (0.1948) 0.92 (0.5787) 1.035 (0.2581) 1.058 EME2 NA (0.0486) 0.983 (0.4231) 0.934 (0.3056) SLX1A NA NA NA NA SLX1B NA NA NA 0.939 (0.0741) 1.038 GEN1 NA (0.0905) 0.916 (0.4749) 1.006 (0.0838)

Table 7. All Base Homologous Recombination Pathway Genes. Shows the expression levels (fold change) for each gene in HR. The associated p value is shown next to the fold change in italics.

Figure 12A – AT1

54

Figure 12B – AT2

55

Figure 12C – FA

56

Figure 12D – SCA

Figure 12. Homologous recombination pathway images for all datasets. Pentagons represent proteins, ovals represent transcription factors, and rectangles represent motifs. Upregulated elements are shown in red, downregulated elements are shown in green. 12A is the first of two datasets for Ataxia Telangiectasia, showing only underexpression. 12B is the second of two datasets for Ataxia Telangiectasia, and shows both under- and overexpression. 12C is the network image for Friedreich’s Ataxia. 12D is the network image for Spinocerebellar Ataxia (type 2). It should be noted that there were multiple probes for some genes, and occasionally these genes were underexpressed in one probe and overexpressed in another.

57 Gene AT1 Fold AT2 Fold SCA Fold Symbol Change Change FA Fold Change Change XRCC6 1 (0.5097) 1.039 (0.0682) 0.997 (0.4739) 0.993 (0.6816) XRCC6 NA 1.034 (0.4477) 0.992 (0.9782) 1.035 (0.7240) 0.996 XRCC5 (0.7695) 1.034 (0.0156) 0.998 (0.3021) 1.009 (0.1306) XRCC5 NA 0.955 (0.0387) 1.059 (0.3831) NA 0.978 PRKDC (0.6684) 1.03 (0.0053) 0.919 (0.4589) 1.019 (0.2907) PRKDC NA 0.957 (0.12004) 0.919 (0.5965) NA PRKDC NA 1.113 (0.1247) 3.331 (1.000) NA 0.992 LIG4 (0.4835) 0.925 (0.5249) 1.147 (0.234) 1 (0.3618) LIG4 NA 0.882 (0.6077) 0.92 (0.0609) NA XRCC4 NA 1.014 (0.2861) 0.915 (0.4327) 1.002 (0.1141) XRCC4 NA 0.933 (0.3174) 0.917 (0.513) 1.025 (0.9526) XRCC4 NA 0.907 (0.3415) 0.918 (0.8004) NA 0.984 XRCC4 (0.4226) 0.868 (0.4676) 0.996 (0.8132) NA DCLRE1C NA 1.014 (0.8997) 0.92 (0.2993) 1 (0.7821) DCLRE1C NA 0.929 (0.2028) 0.921 (0.5891) 0.973 (0.936) DCLRE1C NA 0.908 (0.1882) 1.003 (0.8967) NA NHEJ1 NA NA NA 1.044 (0.0108) 0.987 RAD50 (0.0671) 0.952 (0.00803) 0.919 (0.1706) 0.991 (0.6527) RAD50 NA 0.853 (0.0528) 0.995 (0.9621) 0.996 (0.7202) 1.004 POLG (0.2718) 1.042 (0.0105) 0.973 (0.1194) 0.993 (0.0746) POLG NA 1.013 (0.2288) 1.001 (0.8081) 1.016 (0.2995) 0.985 POLM (0.9909) 0.957 (0.62502) 1.001 (0.8451) 0.972 (0.3141) DNTT NA 1.046 (0.2326) 1.01(0.4226) 0.969 (0.0963) DNTT NA 1.022 (0.3186) 1 (1.000) 1.029 (0.1663) DNTT 1.004 (1.000) 1.012 (0.77502) 1.005 (1.000) 1.047 (0.2193)

Table 8. All Non-Homologous End-Joining Pathway Genes. Shows the expression levels (fold change) for each gene in BER. The associated p value is shown next to the fold change in italics.

58

Figure 13A – AT1

59

Figure 13B – AT2

60

Figure 13C - FA

61

Figure 13D - SCA

Figure 13. Non-homologous end-joining pathway images for all datasets. Pentagons represent proteins, ovals represent transcription factors, and rectangles represent motifs. Upregulated elements are shown in red, downregulated elements are shown in green. 13A is the first of two datasets for Ataxia Telangiectasia, showing only underexpression. 13B is the second of two datasets for Ataxia Telangiectasia, and shows both under- and overexpression. 13C is the network image for Friedreich’s Ataxia. 13D is the network image for Spinocerebellar Ataxia (type 2). It should be noted that there were multiple probes for some genes, and occasionally, these genes were underexpressed in one probe and overexpressed in another.

62 The network images generated provide easy and quick comparison between forms of ataxia. Note that Ataxia-Telangiectasia is represented by two separate datasets, as both were found to be relevant to my research. While there was a lot of overlap between the two datasets, they studied different cell types and had different experimental design.

Before the network maps Figure 9 shows the base excision repair pathway network maps, responsible for single strand break repair. Figure 10 shows the nucleotide excision repair pathway network maps, responsible for single strand break repair when the errors are the result of UV radiation. Figure 11 shows the mismatch repair pathway network maps, responsible for single strand break repair in newly synthesized DNA.

Figure 12 shows the homologous recombination pathway network maps, responsible for double strand break repair. Figure 13 represents the non-homologous end-joining pathway network maps, responsible for double stranded break repair. If the gene node shares the same color as the gene node in another pathway, it indicates that they demonstrate a similar change in expression. A table containing all the genes found in that pathway, containing expression values and p values, precedes each collection of network maps.

DISCUSSION

There were two main research questions that were set about to be answered in conducting my research.

I. Hypothesis One

The first question was whether or not the genes in the DNA repair pathways of hereditary ataxias would be differentially expressed when compared to the wildtype.

63 My hypothesis was that the diseased repair pathways would demonstrate distinct differences in gene expression levels, most frequently correlating with underexpression. When the fold change value of a gene is greater than or less than one, it indicates differential expression from the wildtype expression levels. In each pathway and each form of ataxia, there are fold change values that are not equal to 1, indicating variance from the healthy pathway gene expression levels. Just by looking at the coloring of the nodes in Figures 5-9, it is clear that the genes display differential expression and are indeed frequently underexpressed. Thus, the first hypothesis is supported by my data.

II. Hypothesis Two

The second hypothesis was that there would be common differences from the wildtype pathways that would be shared amongst the different forms of ataxia studied.

Table 1 shows these similarities in expression. OGG1, PMS1 and MSH4 are overexpressed, while DCLRE1C and RAD50 are underexpressed. These shared traits may point to the reason when hereditary ataxias shared many symptoms. The underexpression means that less of the protein is being created in the cells. Because

DNA repair is constantly occurring in the body, it can be assumed that the DNA repair proteins must be in high quantities. Dipping bellow the necessary protein levels, even slightly, could have a profound effect on the organism’s ability to fix its DNA, due to the large number of mutations generated during DNA replication. The fact that these ataxias have common alterations to their DNA repair pathways may be part of the reason for their similar disease presentation and symptoms. Involved in the BER

64 pathway, OGG1 is needed to cleave the glycosidic bond at the site of the error and to break the sugar backbone of the DNA strand so that other machinery can repair the damage.[36] Increased expression of this gene should not cause any issues to the DNA repair mechanisms, as the breakage does not occur unless signaled to do so by other proteins that detect damage. PMS1 is a part of the MMR pathway, and while not much is known about its function, it is believed to form heterodimers with MHL1.[37] The overexpression of this gene is also not likely to have too much impact on the functionality of the MMR pathway, as increased PMS1 expression will still permit the formation of MHL1 heterodimers. MSH4 is also found in the MMR pathway and is primarily responsible for orchestrating reciprocal recombination during .[38] An increase in the amount of protein product likely has little effect on ataxia symptoms.

The DCLRE1C gene codes for the protein Artemis. When cells were found to have lower levels of Artemis, they were more sensitive to irradiation by X-rays, demonstrating greater amounts of DNA strand breakage than normal cells.[39] If ataxia patients are Artemis deficient, it can be assumed that they will have problems fixing their broken DNA strands. Without the fixing of DNA breakages, cells will undergo apoptosis and die. Massive amounts of axonal and neuronal apoptosis cause the cerebellar atrophy in ataxia patients. This result is supported by similar findings by

Morio and Kim.[40] The RAD50 gene is also underexpressed. RAD50 binds to

MRE11A and NBS1 to create the MRN complex. This complex binds to breakages on double stranded DNA before homologous recombination begins and is necessary for signaling further response to the damage.[41] With low levels of RAD50, the

65 functionality of the entire MRN complex is jeopardized, and could result in errors in the DNA being sustained because the MRN complex cannot bind to the DNA to signal when inappropriate nucleotides are present. Another finding was that all of the diseases had at least one DNA ligase protein (LIG1, LIG3, or LIG4) that is underexpressed. This is one of the most interesting findings. DNA ligase is responsible for sealing the nicks in the repaired DNA. DNA ligation is the final step in

DNA repair, and if each disorder is experiencing ligation deficiencies, regardless of the performance of the rest of the repair machinery, the DNA breakages will never be fixed, resulting in unstable DNA strands that may fall apart. This finding also provides a link between FA and DNA repair: mutated frataxin in the disease results in mitochondrial dysfunction. The mitochondria are largely responsible for the generation of adenosine triphosphate (ATP) to be used by the body. When mitochondria do not create enough ATP, it creates an energy shortage. DNA ligase proteins are ATP-dependent[42], and without enough ATP present, ligase cannot seal the nicks in the DNA strand. The FXN mutation acts in a synergistic manner with the low level of ligase (specifically for FA, LIG3) expression, meaning that the cells cannot re-establish the phosphodiester bonds following the replacement of the correct base pairs. Therefore, while FXN is not directly involved in the DNA repair pathways, the mutation of FXN does have an effect on a FA patient’s DNA repair pathways.

Because FA patients have mitochondrial dysfunction, their DNA is exposed to more oxidative stress, which damages the DNA and signaling for repair. This causes an increase in activity for the genes in the repair pathways, but the reduced of DNA

66 ligase activity (due to a limited amount of ATP from the mitochondrial dysfunction) results in a large number of breakages that cannot be patched together by DNA ligase.

The mutations and DNA breaks will cause cell death, specifically in the cerebellum of

FA patients, leading to the cerebellar atrophy that causes the ataxia. The underexpression of all these proteins demonstrate that these changes can have dire repercussions in regards to the ability to fix damaged DNA, which all contribute to the similar presentation found in each hereditary ataxia.

CONCLUSIONS

When I first set out to conduct my research, I wanted to prove that there were differences between the normal and diseased DNA repair pathways, as well as similarities in the gene expression levels of multiple types of hereditary ataxia. The outcome of my research shows exactly this. Genes like OGG1, RAD50, MSH4, and

DCLRE1C have expression levels that are differentiated from the wildtype pathways, but share the change in expression amongst all the diseases. Another interesting finding was that all of the diseases had at least one DNA ligase protein (LIG1, LIG3, or LIG4) that was underexpressed. This is one of the most significant findings. DNA ligase is responsible for making sure that the repaired DNA segments are joined to the rest of the DNA. DNA ligation is the final step in DNA repair, and if each disorder is experiencing ligation deficiencies, regardless of the performance of the rest of the repair machinery, the DNA breakages will never be fixed. Additionally, each disease does have at least one interesting difference that it does not share with the other diseases. A-T shows underexpression in aprataxin and PNKP-like factor (APLF,

67 involved in BER) and mutS homolog 6 (MSH6, involved in MMR) and overexpression of polynucleotide kinase 3’-phosphatase and apurinic/apyrimidinic endonuclease 2 (PNKP and APEX2, both in BER). The underexpression of APLF and

MSH6 may explain the early onset of A-T, as there are multiple pathways that appear to be affected. FA shows extremely low levels of expression for ligase 3 (LIG3, in

BER) and extremely high levels of expression for X-ray repair cross-complementing 4

(XRCC4, in NHEJ). SCA demonstrated overexpression of both RAD23A and

RAD23B (both involved in NER), with underexpression of the BRCA1 gene

(involved in HR). This reduced expression of BRCA1, indicative of reduced BRCA1 protein produce functioning, may be part of the reason why those affected by SCA also have an increased cancer risk.

SIGNIFICANCE OF WORK

The research that I conducted has the potential to fill several gaps in hereditary ataxia research. Due to the small number of people affected by these diseases, there is very little funding available to conduct sagacious research, as it is not a profitable venture for pharmaceutical companies, as is often the case with orphan diseases.

While the National Ataxia Foundation works hard to raise both awareness and money for the research of inherited ataxias, they focus on specific forms of ataxia, like FA,

SCA, and A-T because they are the most prevalent, but do not generally look at hereditary ataxias in conjunction with one another. By studying these three types of ataxia simultaneously, I have provided a venue for easy comparison of their underlying molecular mechanisms. To the extent of my knowledge, no other

68 researchers have carried out research of ataxia in the manner that I have performed: by investigating multiple types of hereditary ataxia via microarray data, using the combination of tools that make up my pipeline. I think that my novel approach has lead to several interesting findings, including information on the similarities between different diseases, which may eventually be used provide relief to those who are affected by extremely rare forms of ataxia, like AOA1, that do not receive much funding. While the numbers of those affected by hereditary ataxia are small, they are not insignificant. The world of orphan diseases is a lonely one. Without a definitive prognosis, parents are flying blind with their children, unsure of what to expect around the corner.[43] If my research can inspire just one other person to study rare ataxia, it will make a difference to those whose lives have been turned upside-down by this ailment. Ultimately, knowledge is the most powerful weapon against disease. By building up a metaphorical armory of information, treatments and preventative measures can be developed. These are genetic diseases and in order to combat them, it is necessary to fully understand what has been changed at the molecular level. I believe that my research findings can act to a solid foundation upon which others can build.

FUTURE DIRECTIONS

As high-throughput sequencing and microarray analysis become more affordable, it is my hope that a dataset on AOA1 becomes freely available. Should I decided to pursue a career in the study of hereditary ataxia, I would find it worthwhile to invest in a mouse model of AOA1 and compare the levels of expression of all DNA

69 repair genes in the cerebellum, using the same methodology pipeline I employed for this research. It would be interesting to see if AOA1 shares the same expression levels found in the FA, SCA, and A-T datasets. Additionally, it would be interesting to use my same pipeline and study the levels of gene expression of other biological mechanisms in ataxia disease models, such as metabolic pathways or cell motility pathways to study changes in muscle activity. This research was conducted using a pipeline made up of bioinformatic tools and techniques. The pipeline designed for this research project can be used with any microarray data to study any pathways of interest in any disease. When relevant microarray datasets for forms of rare ataxia, like

AOA1, become available, the DNA repair pathways can be quickly analyzed using my process. Additionally, if any other biological pathways seem as through they would provide information on disease presentation, those pathways can be investigated in the same manner used in this paper. It is my hope that this research will pique interest for rare hereditary ataxias and that my results can be used as a jumping-off point for future research projects. By researching hereditary ataxias, it is possible for find a collection of target genes that can be used to help effectively, accurately, and quickly diagnosis these diseases, as well as finding possible drug targets for therapeutic or treatment purposes to improve the lives of those affected by AOA1, A-T, FA, or SCA.

LIMITATIONS

This research does have various limitations. Because I did not conduct the microarray experiments myself, the sample sizes, tissue types, microarray chip used, and different testing conditions are not shared between the different datasets. The

70 limited availability of relevant datasets resulted in my use of these datasets. I assume that the researchers from whom I have obtained the data did their best to provide accurate and reliable data. However, there is a monetary barrier, a lack of suitable mouse models for each disease, meant that options for relevant datasets were slim.

Given the lack of lab equipment, funding, and suitable mouse models available to me, a truly perfect experiment setup is not achievable at this point in time. Otherwise, each ataxia would have been studied in mice so that cerebellar tissue could be extracted for study, and all of the samples would have been run on the same type of Affymetrix chip. I do not believe that these limitations refute the findings of my study; instead, I believe that it provides a way for my study to be better repeated in the future.

Another limitation to my study is that we stated expression values that do not fall under the gold standard (either greater than 2.0 for overexpression or less than 0.5 for underexpression) because the majority of the pathways would have displayed no differential expression. With DNA repair genes, I believe that even small changes can have an impact on how well the molecular machinery can function, as these pathways are employed so frequently that deficiencies of any nature would have a profound effect.

71 BIBLIOGRAPHY

[1] Crick, F. (1970). Central dogma of molecular biology. Nature, 227(5258), 561-

563.

[2] Pray, L. (2008). DNA replication and causes of mutation. Nature education,1(1),

214.

[3] Dexheimer, T. S. (2013). DNA Repair Pathways and Mechanisms. In DNA Repair

of Cancer Stem Cells (19-31). Springer Netherlands.

[4] World Health Organization (2014). Genes and Human Disease. Retrieved from

http://www.who.int/genomics/public/geneticdiseases/en/

[5] National Ataxia Foundation (2014). What is Ataxia? Retrieved from

http://www.ataxia.org/learn/ataxia-diagnosis.aspx#what-is-ataxia

[6] Egreter, W. (2002, July 28). Journal Data. Retrieved from

http://www.geocities.ws/libill8962/medical/journaldata.htm

[7] Wood, R. D., Mitchell, M., Sgouros, J., & Lindahl, T. (2001). Human DNA repair

genes. Science, 291(5507), 1284-1289.

[8] “Public Law 107-280: Rare Diseases Act of 2002.” (116 Stat. 1988; Date:

10/6/02). Text from: United States Public Laws. Retrieved from

bulk.resource.org

[9] Rana, A. Q., Khan, O. A., & Akthar, R. (2013). Progressive ataxia associated with

ocular apraxia type 1 (AOA1) with a presence of a novel mutation on the

aprataxin gene. Annals of Indian Academy of Neurology, 16(2), 269.

72 [10] Zoghbi, H. Y., Pollack, M. S., Lyons, L. A., Ferrell, R. E., Daiger, S. P., &

Beaudet, A. L. (1988). Spinocerebellar ataxia: variable age of onset and

linkage to human leukocyte antigen in a large kindred. Annals of

neurology, 23(6), 580-584.

[11] Coutinho P., Barbot C. (2010, June 22). Ataxia with Oculomotor Apraxia Type 1.

GeneReviews™. Retrieved from

http://www.ncbi.nlm.nih.gov/books/NBK1456/

[12] Matilla-Dueñas, A., Goold, R., & Giunti, P. (2008). Clinical, genetic, molecular,

and pathophysiological insights into spinocerebellar ataxia type 1. The

Cerebellum, 7(2), 106-114.

[13] Richardson, J. K., & Hurvitz, E. A. (1995). Peripheral neuropathy: a true risk

factor for falls. The Journals of Gerontology Series A: Biological Sciences and

Medical Sciences, 50(4), M211-M215.

[14] Delatycki, M. B., Williamson, R., & Forrest, S. M. (2000). Friedreich ataxia: an

overview. Journal of medical genetics, 37(1), 1-8.

[15] Genetics Home Reference (2013). Ataxia-Telangiectasia. Retrieved from

http://ghr.nlm.nih.gov/condition/ataxia-telangiectasia

[16] Gatti, R. (2010). Ataxia-Telangiectasia. GeneReviews™. Retrieved from

http://www.ncbi.nlm.nih.gov/books/NBK26468/

[17] Pulst, S-M. (2013). Spinocerebellar Ataxia Type 2. GeneReviews™. Retrieved

from http://www.ncbi.nlm.nih.gov/books/NBK1275/

73 [18] Bidichandani S. I., Delatycki M.B. (2012). Friedreich Ataxia. GeneReviews™.

Retrieved from http://www.ncbi.nlm.nih.gov/books/NBK1281/

[19] Subramony S. H., Ashizawa T. (2011). Spinocerebellar Ataxia Type 1.

GeneReviews™. Retrieved from

http://www.ncbi.nlm.nih.gov/books/NBK1184/

[20] Weizmann Institute of Science (2014). Spinocerebellar Ataxia. MalaCards.

Retrieved from http://www.malacards.org/card/spinocerebellar_ataxia

[21] Weizmann Institute of Science (2014). Tyrosyl-DNA Phosphodiesterase 1.

GeneCards. Retrieved from http://www.genecards.org/cgi-

bin/carddisp.pl?gene=TDP1

[22] Weizmann Institute of Science (2014). Ataxia-Telangiectasia Mutated.

GeneCards. Retrieved from http://www.genecards.org/cgi-

bin/carddisp.pl?gene=ATM

[23] Campuzano, V., Montermini, L., Moltò, M. D., Pianese, L., Cossée, M.,

Cavalcanti, F., ... & Pandolfo, M. (1996). Friedreich's ataxia: autosomal

recessive disease caused by an intronic GAA triplet repeat

expansion.Science, 271(5254), 1423-1427.

[24] Nair, A. S. (2007). Computational biology & bioinformatics: a gentle

overview.Communications of the Computer Society of India, 2.

74 [25] Németh, A. H., Kwasniewska, A. C., Lise, S., Schnekenberg, R. P., Becker, E. B.,

Bera, K. D., ... & Ragoussis, J. (2013). Next generation sequencing for

molecular diagnosis of neurological disorders using ataxias as a

model. Brain,136(10), 3106-3118.

[26] National Center for Biotechnology Information (2013). Our Mission. Retrieved

from http://www.ncbi.nlm.nih.gov/About/glance/ourmission.html

[27] DeBry, R. W., & Seldin, M. F. (1996). Human/mouse homology

relationships.Genomics, 33(3), 337-351.

[28] R Core Team (2012). R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://r-

project.org

[29] Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit,

S., ... & Zhang, J. (2004). Bioconductor: open software development for

computational biology and bioinformatics. Genome biology, 5(10), R80.

[30] Meyer, B. (1992). Applying 'design by contract'. Computer, 25(10), 40-51.

[31] Jacobson, I. (1992). Object-oriented software engineering: a use case driven

approach. Pearson Education India.

[32] National Human Genome Research Insitute (2011). DNA Microarray

Technology. Retrieved from http://www.genome.gov/10000533

[33] Kyoto Encyclopedia of Genes and Genomes (2013). KEGG Pathway Database.

Base Excision Repair. Retrieved from http://www.genome.jp/kegg-

bin/show_pathway?map=ko03410

75 [34] Al-Ouran, R. (2014, April 10). Personal Communication.

[35] Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth,

A., ... & Jensen, L. J. (2013). STRING v9. 1: protein-protein interaction

networks, with increased coverage and integration. Nucleic acids

research,41(D1), D808-D815.

[36] Elahi, A., Zheng, Z., Park, J., Eyring, K., McCaffrey, T., & Lazarus, P. (2002).

The human OGG1 DNA repair and its association with orolaryngeal

cancer risk. Carcinogenesis, 23(7), 1229-1234.

[37] Weizmann Institute of Science (2014). Postmeiotic Segregation Increased 1.

GeneCards. Retrieved from http://www.genecards.org/cgi-

bin/carddisp.pl?gene=PMS1

[38] Weizmann Institute of Science (2014). MutS Homolog 4. GeneCards. Retrieved

from http://www.genecards.org/cgi-bin/carddisp.pl?gene=MSH4

[39] Moshous, D., Callebaut, I., de Chasseval, R., Corneo, B., Cavazzana-Calvo, M.,

Le Deist, F., ... & de Villartay, J. P. (2001). Artemis, a novel DNA double-

strand break repair/V (D) J recombination protein, is mutated in human severe

combined immune deficiency. Cell, 105(2), 177-186.

[40] Morio, T., & Kim, H. (2008). Ku, Artemis, and ataxia-telangiectasia-mutated:

signalling networks in DNA damage. The international journal of biochemistry

& cell biology, 40(4), 598-603.

[41] Lee, J. H., & Paull, T. T. (2005). ATM activation by DNA double-strand breaks

through the Mre11-Rad50-Nbs1 complex. Science, 308(5721), 551-554.

76 [42] Martin, I. V., & MacNeill, S. A. (2002). ATP-dependent DNA ligases. Genome

Biol, 3(4), REVIEWS3005.

[43] McCollister, B. (2013, April 8). Personal Interview.

77