UNIVERSITY OF CALGARY

A Novel Neurodevelopmental Disorder within the Hutterite Population Maps to 16p and a

Possible Causative Mutation in THOC6 was Identified

by

Chandree Lynn Beaulieu

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

DEPARTMENT OF MEDICAL SCIENCES

CALGARY, ALBERTA

JANUARY, 2010

© Chandree Lynn Beaulieu

Library and Archives Bibliothèque et Canada Archives Canada

Published Heritage Direction du Branch Patrimoine de l’édition

395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada

Your file Votre référence ISBN: 978-0-494-62049-6 Our file Notre référence ISBN: 978-0-494-62049-6

NOTICE: AVIS:

The author has granted a non- L’auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l’Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. . The author retains copyright L’auteur conserve la propriété du droit d’auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author’s permission.

In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n’y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

ABSTRACT

The Hutterites are a genetically isolated group; their population numbers over

40,000, the majority of whom are descendants of 89 founders. An autosomal recessive neurodevelopmental disorder was identified in four patients from two Hutterite families.

The patients have distinctive facial features, congenital malformations of the heart and genitourinary system, borderline microcephaly, and learning disability.

An identity-by-descent mapping approach was used to identify the locus for this disorder. A 50K-SNP microarray identified a single, homozygous region on 16p13.3 shared between the patients. Microsatellite markers genotyped for all available family members refined the locus to a region of 5.1 Mb containing 173 . No other recessive disorders with similar clinical features are currently mapped to this region. Genes within the region were prioritized using data mining and expression microarrays. Ninety-seven genes were sequenced. A homozygous variant in THOC6 (p.G46R) was identified and evidence suggests that this variant is disease-causing.

ii

ACKNOWLEDGEMENTS

Thank you to everyone for making my Masters studies a wonderful experience.

Thank you to my supervisor Dr. Jillian Parboosingh. You took me into the lab as a quiet summer student with very little previous lab experience and provided me with the support and environment I needed. Your support in all aspects throughout this process has been invaluable. Thank you for the effort you put in to provide me with exposure to a variety of techniques and concepts making this a good learning experience for me. Thank you to the members of my supervisory committee Dr. Micheil Innes and Dr. Paul Mains.

Dr. Micheil Innes thanks for your discussions on the patients’ characteristics, the Hutterite population, and prioritization of genes as well as your encouragement. Dr. Paul Mains thanks for your questions guiding me to look further into concepts and mechanisms. Thank you Dr. Kym Boycott for your support and suggestions in all aspects of the project and for your mentorship. Thank you Dr. Fred Biddle for your role in getting me into this lab, our conversations, your confidence in my abilities, and always pushing me to a higher level.

Thank you to all the students and fellows who I have had the opportunity to work with in the lab. Thank you Catrina Loucks and David Redl; you have been great to work with and thanks for all our discussions and your help. Thank you Dr. Patrick Frosk for suggestions on using genomic databases, mapping, and prioritization. Thank you Dr. Ryan

Lamont for advice and information. Thank you Stephanie Desmarais, Dr. Tanya Gillian,

Megan Black, Danielle Lynch, and Jack Novovic. Thank you to all the medical genetics residents for discussions and opportunities to learn and be involved in other projects. Dr.

David Dyment thanks for showing me statistical linkage analysis programs and Dr. Oana

Caluseriu thanks for the mentorship. Thank you to all the staff at the Alberta Children’s

iii

Hospital Molecular Diagnostic Laboratory for everything you have taught me and for your company. Dana Forrest thanks for training me on experimental techniques. Thanks to everyone in the Molecular and Medical Genetics, Genes and Development Research Group the opportunity to participate in Journal Club, Research in Progress, and Medical Genetics

Rounds and thanks for all the fun events.

Thank you to the Southern Alberta Microarray Facility where I performed the expression microarrays (and Dr. Xiuling Wang for her help). Thank you to Dr. Erik

Puffenberger for performing the SNP microarrays. Thank you to Dr. Carol Ober’s lab for genotyping the Hutterite controls. Thank you to those who had clinical involvement with the patients: Dr. Kym Boycott, Dr. Mike Innes, Dr. Ross McLeod, Dr. Brian Lowry, Dr.

Bea Fowlow, Jackie Morris, Carol Farr, and Rachelle Bistretzan. Thank you to the patients and their families for their enthusiastic participation in this study.

This study was supported by an Alberta Children’s Hospital Foundation Grant and

I was supported by the UofC CIHR Training Program in Genetics, Child Development, and Health. Thanks also to the sponsors of the William H. Davies Scholarship, Queen

Elizabeth II Scholarship, Medical Science Graduate Program Award, Graduate Conference

Travel Grant, and Alberta Graduate Student Scholarship.

Thanks to everyone at Canon US Life Sciences who welcomed me as part of their team for an internship this past summer. Thank you to all of my professors and teachers.

Thank you to my friends; you are all wonderful. And, a very special thanks to my parents and family for their constant support throughout all of my studies.

iv

TABLE OF CONTENTS

ABSTRACT ...... ii

ACKNOWLEDGEMENTS...... iii

TABLE OF CONTENTS...... v

LIST OF TABLES ...... vii

LIST OF FIGURES ...... viii

LIST OF ABBREVIATIONS...... ix

INTRODUCTION...... 1 Importance of recessive disorders ...... 1 The Hutterite population...... 2 Novel neurodevelopmental disorder...... 5 Available patient resources...... 5 Clinical features ...... 7 Comparison to other disorders ...... 10 Identity-by-descent mapping...... 11 prioritization strategies...... 14 SNP arrays ...... 17 A homozygous region of interest was identified by genome-wide SNP analysis ...... 17 Copy number from SNP arrays ...... 20

OBJECTIVES AND RATIONALE...... 21

METHODS ...... 23 DNA extraction...... 23 Fine mapping with the use of microsatellite markers...... 23 Identification of microsatellite markers ...... 23 PCR amplification of microsatellite markers...... 26 Real-time PCR to analyze suspected copy number variation regions...... 27 Prioritization of Genes...... 30 Data mining...... 30 cDNA expression studies...... 31 Microarray ...... 32 RNA extractions...... 33 Microarray hybridization procedure ...... 34 Microarray analysis ...... 37 Confirmation by semi-quantitative PCR ...... 39

v

Sequencing ...... 40 Primer design ...... 40 PCR amplification...... 43 Sequencing ...... 44 Analysis of sequence...... 46

MAPPING RESULTS...... 47 Haplotype and minimal region...... 47 Copy number variation in the region ...... 53

PRIORITIZATION RESULTS ...... 57 Prioritization using functional and expression information ...... 57 Prioritization based on downregulated genes within the candidate region ...... 62 Prioritization using genome-wide microarray data ...... 64 Genes not chosen for sequencing ...... 71

SEQUENCING RESULTS...... 76 Variants identified...... 76 THOC6 Variant ...... 80

DISCUSSION ...... 82 Technical challenges...... 82 Mapping and prioritization ...... 82 Copy number variation detection...... 85 Genotyping and sequencing ...... 86 THOC6 ...... 88

FUTURE DIRECTIONS ...... 101

CONCLUSION...... 103

ONLINE RESOURCES...... 104

REFERENCES ...... 105

vi

LIST OF TABLES

Table 1. Summary of clinical features in the Hutterite patients ...... 8

Table 2. Location score calculation...... 19

Table 3. Microsatellite markers used for genotyping...... 25

Table 4. The extent of sequencing for each of the downregulated genes...... 42

Table 5. Genes chosen with functional and expression information...... 58

Table 6. Genes chosen for sequencing based on genome-wide microarray pathway analysis...... 69

Table 7. Genes that were not chosen for sequencing...... 72

Table 8. Variants that were present in dbSNP...... 77

Table 9. Variants that were not listed in dbSNP...... 78

Table 10. The possible consequences of allele dropout ...... 88

vii

LIST OF FIGURES

Figure 1. A graph of the Hutterite population size in relation to the year...... 4

Figure 2. Partial pedigree illustrating the closest relationships among the parents of the Hutterite patients ...... 6

Figure 3. Characteristic facial features identified in the Hutterite patients...... 9

Figure 4. Diagram illustrating identity-by-descent ...... 12

Figure 5. SNP homozygosity plots of the four patients ...... 19

Figure 6. Copy number variants within the Database of Genomic Variants surrounding the possible four copy SNP rs10500322 seen in the patients...... 21

Figure 7. Location of the SNP that showed increased copy number on the Affymetrix 50K microarray and the primer pairs designed on both sides...... 28

Figure 8. Identifying regions of conservation upstream of a gene...... 42

Figure 9. Sample electropherogram for microsatellite marker genotyping...... 47

Figure 10. Haplotype analysis based on microsatellite and SNP markers ...... 50

Figure 11. Quality control of Real-Time PCR ...... 54

Figure 12. Gene expression plots comparing intensity levels in the patients to unaffected controls ...... 56

Figure 13. Gene Wanderer software prioritization of genes in the region based on comparison to primary microcephaly genes...... 57

Figure 14. Genes within the region with significantly changed expression in the patients ...... 63

Figure 15. Semi-quantitative PCR of genes with altered expression in the region...... 63

Figure 16. Genome-wide analysis of downregulated genes in the patients ...... 65

Figure 17. Genome-wide analysis of upregulated genes in the patients ...... 66

Figure 18. Software prioritization of genes in the region based on a training set of genes that showed a 5-fold or greater change in the patients on the microarray...... 68

Figure 19. Summary of all genes chosen for sequencing...... 71 viii

Figure 20. Variant identified within THOC6...... 81

Figure 21. Determining the affect of the variant on splicing ...... 82

Figure 22. Components of the THO and TREX complex in yeast, Drosophila, and human...... 89

Figure 23. Diagram illustrating components involved in mRNA export from the nucleus to the cytoplasm ...... 90

Figure 24. Apoptosis pathway indicating genes that were upregulated or downregulated in the patients on the microarray...... 96

Figure 25. Alignment of human THOC6 with zebrafish THOC6 in the region surrounding the missense variant ...... 103

LIST OF ABBREVIATIONS

16p short arm of 16 A absorbance bp °C degree Celsius cDNA complementary deoxyribonucleic acid CNV copy number variation dbSNP database of single nucleotide polymorphisms DNA deoxyribonucleic acid dNTP 2’deoxyribonucleotide 5’triphosphate DTT dithiotheitol EDTA ethylenediaminetetracetic acid ex exon EST expressed sequence taq FISH fluorescent in situ hybridization GCOS Gene Chip Operating Software HeLa immortal cervical cancer cell line from Henrietta Lacks IVS intervening sequence IVT in vitro transcription ix

LOD logarithm of the odds low TE Tris with low EDTA Mb megabase min minute mM millimolar mRNA messenger ribonucleic acid NCBI National Center for Biotechnology Information ng nanogram Oligo oligonucleotide OMIM Online Mendelian Inheritance in Man PCR polymerase chain reaction

QH2O purified and deionized water rpm revolutions per minute RT-PCR real time polymerase chain reaction s second siRNA small interfering ribonucleic acid SNP single nucleotide polymorphism STS sequence tagged sites THO/TREX transcription export complex Tris trishydroxymethylaminomethane UCSC University of California Santa Cruz Genome Browser ug microgram ul microliter uM micromolar

x

1 INTRODUCTION

A novel neurodevelopmental disorder with autosomal recessive inheritance was identified in the Hutterite population. The objective of this project was to map the locus and identify the causative gene for this disorder.

Importance of recessive disorders

This study involved looking for the gene responsible for a rare autosomal recessive disorder identified in an isolated population. Although there has been a shift in recent years towards determining the cause of complex disorders, the importance of identifying the molecular basis for Mendelian phenotypes remains (Antonarakis and

Beckmann, 2006). There are estimated to be approximately 25,000 -coding genes in the genome, but currently less than 10% of these genes have been found to cause monogenic Mendelian diseases (Antonarakis and Beckmann, 2006). Although not all genes are expected to be associated with a disorder, there also still exists over 1700

Mendelian phenotypes and over 2000 suspected Mendelian disorders without a known cause (OMIM Statistics, 2009). The actual number is likely larger than this as OMIM is enriched for disorders that have been observed in more than one family and very rare unreported Mendelian disorders, such as the one in this study, are not represented.

Identifying the causative genes of Mendelian phenotypes provides a direct means to determine gene function (Antonarakis and Beckmann, 2006). Linking genes with functions, pathways, and phenotypes is an important step in understanding our genome.

Knowledge gained from monogenic disorders may also provide insight into complex disorders as milder allelic variants or interacting partners may contribute to these disorders. For example, information on BBS4 (Bardet-Biedl Syndrome type 4) and its

2 interacting partners led to the discovery of additional susceptibility genes for schizophrenia (Kamiya et al., 2008). Drugs can be developed targeting genes or pathways with obvious monogenic phenotypes that have related complex phenotypes (Brinkman et

al., 2006).

The identification of the molecular basis of monogenic disorders also has an

immediate clinical impact for the care of patients and their families. Although many of

the recessive disorders are rare they can have a large impact on the families or

communities affected. Identification of a molecular cause can allow for carrier testing

and early diagnosis, reducing the seemingly never ending tests in search for an answer.

A diagnosis may lead to a better prediction of the disease progression in the patients

(important for families and clinicians) and in some cases treatment options. In addition,

as genetic treatments develop that target specific mutation types such as triplet repeats

(Wheeler et al., 2009) or nonsense mutations (Hirawat et al., 2007), rare currently

untreatable disorders may benefit from these new treatment strategies.

The Hutterite population

Population isolates offer a number of advantages for genetic studies that have

been utilized extensively in the past. Two key features of an isolate is the formation of

the isolate by the genetic separation of a limited number of individuals, resulting in a

founder effect and then continued isolation leading to high levels of consanguinity

(Peltonen, 2000). Since there is a limited number of founding individuals, there is a

limited genetic pool that can, and often does, lead to differences in gene frequency

between the isolate and the original population from which the isolate arose. Some

alleles, which were relatively common among founding members, will tend to have much

3 higher frequencies in the isolate than in the original population while other alleles may have lower frequencies or may not be seen at all within the isolate. Continued isolation and consanguinity can contribute to an increase in the frequency of rare alleles and can lead to increases in the incidence of rare recessive disorders (Peltonen, 2000). Disorders within an isolate often arise from a single mutation that has been introduced into the population by a common ancestor, thus individuals with autosomal recessive conditions will be homozygous for the disease locus, due to identity-by-descent. As will be described later, this greatly facilitates the mapping of a new disease locus. This study utilized the characteristics of a genetic isolate, the Hutterite population. Over 30 autosomal recessive conditions have been reported in the Hutterite population and approximately half of these disorders do not yet have their causative gene identified

(Boycott et al., 2008).

The Hutterite history and lifestyle are ideal for identifying and mapping mendelian disorders. The Hutterites are a German speaking Anabaptist group that arose during the Protestant Reformation (1528) (Hostetler, 1985). Since this time, they have

faced frequent persecution and been forced to flee many times. They were expelled from

Moravia in 1621 by the Hapsburg Empire causing a reduction in their numbers from

approximately 20,000 individuals to 19. However, the organization was revived by a

group of Carthinthia Lutherans exiled to Romania. In 1770, the entire group of only 116 individuals fled to, and settled in Ukraine. They grew from 116 to 1265 individuals.

Four hundred and forty three members (including 101 couples) of this group moved to

North America in 1874 (Hostetler, 1985). The current, communal living Hutterites of

North America have as their ancestors these 443 members along with three other large

4 families that joined later (Arcos-Burgos and Muenke, 2002). These drastic reductions in

population size at points in their history (Figure 1) resulted in a substantial founder effect.

Taking the population’s entire history into account, the majority of individuals are

thought to be descendents of 89 founders (Nimgaonkar et al., 2000). When the Hutterites

migrated to North America they settled into three separate colonies: the Schmiedeleut,

Dariusleut, and Lehrerleut. These leuts have since each grown into numerous colonies,

but each of the three leuts maintained separate identities to this day without intermarriage

between them (Hostetler, 1985). Less than 2% of Hutterites have permanently left

colony life and there have been very few converts. These few converts can be easily

identified by their names (Hostetler, 1985). The average relatedness between any two

Hutterites within the same leut is approximately second cousins and thus are expected to

share 1/32 of their genetic material (Hostetler, 1985).

30000 Hutterite Population size 25000

20000

15000 Less than 100 individuals 10000

5000

0

0 0 0 0 0 0 0 0 0 0 0 0 5 0 2 40 5 7 8 90 1 2 8 6 4 6 0 5 7 9 5 5 5 5 5 5 6 6 6 7 8 8 9 9 9 1 1 1 1 1 1 1 1 1 1750 1 1 1 1890 1 1 1 19 Year Figure 1. A graph of the Hutterite population size in relation to the year. The drastic fluctuations in their numbers have resulted in a substantial founder effect.

5 Except for a few brief points in their history, the Hutterites have been a communal living group with common ownership of property. They live in colonies of 60-100 individuals (Hostetler, 1985). Over forty thousand Hutterites live on 460 colonies in

North America (Boycott et al., 2008). There are over 12,000 Hutterites in Alberta (2001

Census). Although the Hutterites are careful to avoid possessions, which they feel may

lead to materialism, they do accept modern technological advancements and this includes

seeking modern medical care. Hutterite couples are monogamous, are encouraged to

have large families, and believe in the value of life, thus increasing the likelihood of

finding families with multiple affected children. The colony preacher keeps extensive

genealogical records allowing accurate relationships to be determined (Hostetler, 1985).

Novel neurodevelopmental disorder

Available patient resources

There has been a clinical effort in the Department of Medical Genetics at the

University of Calgary and other genetic centres that service the Hutterite population to

phenotype Hutterite patients with neurological disorders. Through this undertaking, a

novel neurodevelopmental syndrome within the Hutterite population of Alberta

(Dariusleut) was identified. Four patients (two sets of sisters) were identified as sharing a

unique set of characteristics. Figure 2 shows a seven generational pedigree

demonstrating the most direct relationships between the patients. However, the average

relatedness between any two Hutterite individuals is approximately that of second

cousins (Hostetler, 1985), thus the overall relatedness will be higher than shown here as

they will be related in multiple ways. The relatedness of the parents and the nature of the

pedigree suggests autosomal recessive inheritance for this disorder.

6

Figure 2. Partial pedigree illustrating the closest relationships among the parents of the Hutterite patients. Patient identification numbers are given below the symbols. The affected female patients are indicated by solid circles. The closest relationship between the parents, VI-1 and VI-2, of affected siblings VII-2 and VII-6 is third cousins once removed. The closest relationship between the parents, V-3 and V-4, of affected siblings VI-6 and VI-10 is second cousins. However, the overall relatedness will be higher than shown as they are related in multiple ways.

The four patients gave informed consent to participate in this study and blood was drawn to obtain their DNA. In addition, the four parents, two unaffected siblings from family 1 (VII-1 and VII-5), and four unaffected siblings from family 2 (VI-5, VI-8, VI-9, and VI-11) agreed to participate in the study and provided blood samples.

Additional patients with a similar phenotype have been sought. One Hutterite individual with some similarities was identified, but this patient did not have the same homozygous region as the patients, suggesting the cause of the clinical presentation is distinct. Despite presenting the phenotype to the genetics centres in North America that see Hutterite patients no additional patients have been identified.

7 Clinical features

The four patients presented with a novel neurodevelopmental disorder. They were examined by a medical geneticist and a summary of the major clinical features seen in each of the patients is provided in Table 1 (Boycott et al., submitted). The patients were found to have borderline microcephaly, with head circumferences at the 2nd percentile, indicating a mild defect in brain growth leading to a decreased brain size. A

CT scan and brain MRI performed on one patient did not reveal any abnormalities. All the patients exhibited delays in early childhood language development. They progressed to reading at an approximately grade 4 level, yet were far below this level in other skills including the ability to do basic arithmetic, tell time, remember sequences of events, or follow directions. (Boycott et al., submitted)

In addition to the neurological features, the patients had structural defects of other organs. Two patients were identified as having congenital heart defects including ventricular septal defect and patent ductus arteriosis. Ventricular septal defect is due to a hole in the wall separating the left and right ventricles, thus resulting in the mixing of oxygenated and unoxygenated blood. Patent ductus arteriosis refers to a condition resulting from the failure of the ductus arteriosis, a blood vessel used in the fetus to allow blood to bypass the lungs, to close after birth. A horseshoe kidney was observed in one of the patients and an absent left kidney in another patient. Individuals with a horseshoe kidney have one horseshoe shaped, but adequately functioning kidney, resulting from the two kidneys fusing during fetal migration of the kidney to the appropriate spot. One patient had velopharyngeal insufficiency; failure of the soft palate muscle in the back of the throat to close during speech resulting in a nasal quality to their voice. All the

8 patients had a shorter than average stature and dental problems including dental malocclusion (improper alignment of the teeth) and being prone to caries (decay of the tooth surface). Other symptoms that were seen in the patients included myopia, urinary tract infection, premature ovarian failure, and endometriosis. (Boycott et al., submitted)

The patients had a number of distinctive facial features illustrated by the photos of the patients (Figure 3). They had a tall forehead with a high hairline, deep set eyes with short palpebral fissures, a prominent nose with an overhanging columnella, and full lips.

(Boycott et al., submitted)

Table 1. Summary of clinical features in the Hutterite patients (Boycott et al.,submitted) Family 1 Family 2 Feature VII-2 VII-6 VI-6 VI-10 Age 18 8 29 22 Gender F F F F Growth Height (centile) 5th 10th 10-25th 10-25th OFC (centile) 2nd 2nd 2nd 2nd Characteristic facies + + + + Congenital anomalies Cardiac PDA, VSD VSD - - Renal/GU Horseshoe Left renal rUTI rUTI kidney agenesis Health issues Gynecological - - POF Endometriosis Dental caries + + + + Cognitive Developmental delay Language Language Language Language School Difficulties IEP IEP IEP IEP Other VPI eneuresis myopia myopia micrognathia OFC: occipitofrontal circumference; PDA: patent ductus arteriosus; VSD: ventricular septal defect; GU: genitourinary tract; rUTI: recurrent urinary tract infections; VPI: velopharyngeal insufficiency; IEP: individual education program

9

Figure 3. Characteristic facial features identified in the Hutterite patients. Recognizable facial features include tall forehead, high anterior hairline, short and up-slanting palpebral fissures, long nose with an overhanging columnella, and full lips. A) The two patients in family 1. B) The two patients in family 2.

The features described above were identified by medical geneticists as likely

attributable to a genetic defect. None of the distinctive facial features and the described

symptoms were seen in the unaffected siblings examined. However, with only four

patients available the complete phenotypic spectrum of this disorder is not yet fully

understood. Additionally, for those symptoms only seen in one or two patients there is

the possibility that these are coincidentally present in the patients and not due to the

shared mutation. For example, a horseshoe kidney, which was seen in one patient, has a

high frequency in the general population, being seen in 1 per 300-400 births (Stevenson

and Hall, 2005). Ventricular septal defect is the most common congenital heart defect with about 1-2 cases per 1000 births and patent ductus arteriosis has a frequency of about

1 per 2000-2500 births (www.emedicine.com). The frequencies seen in the general population for other symptoms present in just one patient are as follows: velopharyngeal

10 insufficiency 1 per 750 births, endometriosis 6-8% of women, and premature ovarian failure 1% of women (www.emedicine.com).

Each of the patients underwent metabolic tests, which revealed normal levels of creatine kinase, ammonia, organic acids in the urine, sugar in the urine, and plasma amino acids for all patients. A full karyotype done on each of the patients did not reveal any chromosomal abnormalities. An assessment of subtelomeric deletions was done as these are often associated with neurological disorders, but none were found. Fluorescent in situ hybridization (FISH) testing for a 22q11 deletion was done to rule out 22q11 deletion syndrome which has an overlapping phenotype with this disorder, but the patients did not have a 22q11 deletion. (Boycott et al., submitted)

Comparison to other disorders

The four patients demonstrated a unique combination of symptoms, not seen within any previously described disorder. A differential diagnosis of microcephaly, neurodevelopmental delay, and distinctive facial features in Huttertite newborns includes

Bowen-Conradi syndrome. One of the four patients was given this diagnosis for the first year of life (Boycott et al., submitted). However, Bowen-Conradi syndrome is a much more severe disorder and as the children develop, the distinction becomes clear. Bowen-

Conradi syndrome is a lethal autosomal recessive disorder characterized by pre- and postnatal growth retardation, microcephaly, a prominent nose, micrognathia, and severe psychomotor delay and is caused by mutation of the EMG1 gene on 12p13.3 (Armstead et al., 2009). The disorder being studied can also be distinguished from other inherited developmental disorders associated with microcephaly such as the primary microcephalies, Seckel syndrome, Dubowitz syndrome, Floating Harbor syndrome,

11 Feingold syndrome, and Rubinstein-Taybi syndrome based on the degree of microcephaly, facial features, other malformations present, or inheritance pattern

(Boycott et al., submitted).

Identity-by-descent mapping

A homozygosity mapping approach to localize the disease-causing gene was

undertaken. Homozygosity mapping is a technique that can allow mapping with very few

individuals, but does have the requirement that the disease alleles are identical-by-

descent. The concept of homozygosity mapping is illustrated in Figure 4. The children of

consanguineous marriages are expected to have a fraction of their genome identical-by-

descent. The parents are related and thus will have regions of their genome in common

with each other due to inheritance from a common ancestor. The children can be

expected to inherit two identical copies of a number of these regions, one from each

parent. The closer the relationship between the parents, the more genetic material the

parents share, so the more homozygous regions are expected (Sheffield et al., 1995).

Identification of these homozygous regions in affected individuals becomes a powerful

tool in the mapping of autosomal recessive conditions (Lander and Botstein, 1987). The

affected offspring are all expected to be homozygous for the disease gene, but unaffected

offspring are not expected to share homozygosity at the disease gene (Sheffield et al.,

1995). In the same way, the markers close to the disease gene are also expected to be

homozygous in the affected, but not the unaffected offspring. As the disease

chromosome is passed down to each new generation, meiosis is taking place. Each

meiotic event involves crossing over, thus with each round of meiosis the extent of

identity with the original disease chromosome may be reduced. The more rounds of

12 meioses that separate the individuals from the common ancestor, the smaller the identity-

by-descent region is expected to be (Sheffield et al., 1995). The comparison of

homozygous regions (through the use of genetic markers) between affected individuals, parents, and unaffected siblings should allow for localization of the disease locus within the genome.

Figure 4. Diagram illustrating identity-by-descent. The consanguineous marriage between individuals both of whom inherited a mutation from a common ancestor resulted in children who were identical-by-descent at the mutation locus. The area around the mutation, inherited from the common ancestor (shaded area), is decreased in size with each successive round of meiosis as crossing over occurs. This means children that are identical-by-descent for the mutation are also identical-by-descent for a limited region surrounding the mutation.

Homozygosity mapping facilitates mapping of rare recessive disorders that would

not be possible through traditional linkage analysis. Traditional linkage analysis can only

detect crossovers between the generations genotyped, thus requires multiple families,

each with multiple generations available and multiple affected individuals in order to

obtain significant evidence for linkage (LOD score >3). Homozygosity mapping detects

the closest crossover events that occurred since the common ancestor that passed down

the disease allele, thus mapping can be performed with only a few key individuals

13 (Lander and Botstein, 1987). Numerous studies have successfully utilized the power of homozygosity mapping to localize disease-causing genes using a limited number of individuals within consanguineous families or isolated populations. For example, the locus for Leucodysplasia, microcephaly, cerebral malformation was linked to 2p16 with only three affected individuals within a consanguineous Asian family (Chandler et al.,

2006). Likewise, GLIS2 was identified as a gene causing nephronophthisis within a

consanguineous Oji-Cree family using two siblings and a third-degree cousin (Attanasio et al. 2007). Within the Hutterite population, this approach was used to map the disease locus for Usher Syndrome Type 1F using only two affected individuals and the causative gene was subsequently identified (Alagramam et al., 2001). Also within the Hutterite population, homozygosity mapping was the first step towards identifying a mutation in

VLDLR as the cause of cerebellar hypoplasia (Boycott et al., 2005), DNAJC19 as the cause of dilated cardiomyopathy with ataxia (Davey et al., 2006), TRIM32 as the cause of limb-girdle muscular dystrophy type 2H (Frosk et al., 2002), and EMG1 as the cause of

Bowen-Conradi syndrome (Armistead et al., 2009).

Polymorphic markers are used to identify extended regions of homozygosity.

Both single nucleotide polymorphisms (SNP) and microsatellites markers have been used, although SNP microarrays have become the method of choice in recent years, followed in some instances by fine mapping with microsatellite markers. The variance in length of the microsatellite markers provides greater polymorphism opportunities than

SNPs, which are biallelic, but SNP genotyping has the advantage of being more easily automated and performed in only a single step. This means that very high density scans may be performed with relative ease (Kruglyak, 1997). Theoretically, maps of biallelic

14 markers at about 2.25-2.5 times the density of microsatellite maps will provide

approximately the same amount of information as the microsatellite map (Kruglyak,

1997). There have been cases where SNP genotyping has allowed for mapping when

microsatellite markers did not, due to the higher density of SNPs that can more easily be

achieved; for example, the use of a SNP array identified the locus for Bardet-Biedl

syndrome type 11 when genotyping with microsatellite markers failed (Chiang et al.,

2006).

Homozygosity mapping is typically performed by genotyping only affected

individuals and identifying shared regions of homozygosity. However, genotyping all

available family members can aid in refining a region. When more than one unlinked

homozygous region is obtained, identifying regions also shared in the unaffected siblings

may enable some regions to be ruled out (Strauss et al., 2008). Boundaries of a large

region may also be refined if a portion of the region is also homozygous in the unaffected

siblings. To take advantage of this power, all available family members could either be

included in the genome-wide scan or the genome-wide scan could be performed only

with the patients and fine mapping performed afterwards in those shared regions of

homozygosity with the additional family members.

Gene prioritization strategies

Identification of the disease locus is the first step to determining a molecular

cause, but the ultimate goal is to identify the causative mutation in the responsible gene.

Known and predicted genes within the shared region of homozygosity can be identified

using a genomic database. However, depending on the number of available individuals

and degree of relatedness, homozygous regions identified can be large and contain

15 hundreds of genes. If traditional sequencing methods are to be used, these genes need to be prioritized as to which are most likely to cause the disorder.

Numerous prioritization software programs exist. GeneWanderer prioritizes genes based on the idea that genes causing phenotypically similar or identical disorders form a highly interconnected protein subnetwork with each other (Kohler et al., 2008).

Endeavour prioritizes genes based on interactions, expression patterns, and other similarities to genes causing phenotypically similar or identical disorders (Aerts et al.,

2006). However, the use of these programs requires defining similar genes or a disease

family to which the novel disorder is expected to belong. This association does not

always exist or is not always obvious from the phenotypic presentation.

Manual data-mining of functional information available for genes in the region

may aid in prioritization. Genes that are associated with dissimilar disorders in humans

or dissimilar phenotypes in model organisms can be considered less likely to cause the

disorder. Genes shown to have roles in the development of the affected systems can be

considered more likely. When doing this type of prioritization, an in depth understanding

of gene types, families, complexes, and pathways is important, but is not always available

or evident upon first examination. Functional information is not available for the

majority of genes, thus expression in the affected systems can be used as an indirect

indicator of a function in those systems (Buchanan et al., 2009).

Large locus regions present difficulties in prioritization for novel disorders;

looking at expression differences within the patients using lymphocytes (the tissue most

easily obtained) may be a worthwhile prioritization strategy. If the causative gene

mutation affects mRNA levels, such as altering regulation or causing nonsense mediated

16 decay, then this would be detected as a reduction of RNA for this gene in the patients.

RNA levels can be detected by microarray in a technique referred to as “blood genome and transcriptome profiling”. This technique was used to examine expression of

candidate genes genome-wide to identify the gene causing a mild form of Xeroderma

Pigmentosum in a family (Vahteristo et al., 2007). It was also used successfully to identify the gene, within a mapped region, causing action myoclonus-renal failure syndrome (Berkovic S.F. et al., 2008). This technique has the potential to identify a gene quickly and to possibly detect regulatory or splicing mutations, which may be missed by sequencing only the coding regions. However, there are some limitations to this technique. The causative mutation has to affect mRNA levels; a missense mutation, for example, may not. Also, the causative gene needs to be expressed in lymphocytes and the gene needs to be represented on the microarray. If these conditions are not present and a gene within the region is not identified, genome-wide analysis may nonetheless point to altered pathways in the patients. By expanding the analysis to the whole genome instead of focusing on changes only in genes within the critical region, genes within the critical region that are related to changed pathways can be selected as good candidates even though their transcript expression levels are not altered. This pathway information may be useful in prioritization for novel disorders where the phenotype does not point to a specific pathway. It should be kept in mind that only pathways expressed in lymphocytes will be detectable.

Next generation massive parallel sequencing technology enables large regions to be sequenced and is beginning to eliminate the need for prioritization and sequencing of single genes. Variants identified will still need to be prioritized based on the type of

17 change and the potential biological relevance of the genes they are located within to identify the disease-causing gene amongst the many rare, non-pathogenic variants.

SNP arrays

A homozygous region of interest was identified by genome-wide SNP analysis

For this study, DNA samples from the four affected patients were genotyped using an Affymetrix 10K SNP microarray, later followed by an Affymetrix 50K SNP microarray. The 10K SNP microarray contained approximately 10,000 SNPs embedded on a single chip and the 50K SNP microarray contained approximately 50,000 SNPs.

DNA samples from the four affected patients were sent to Dr. Erik Puffenberger (The

Clinic for Special Children) for genotyping. SNP microarrays are performed following standard Affymetrix protocols by enriching for fragments containing the SNPs followed by hybridization to oligonucleotide probe sets containing probes to each of the two possible allele forms for each SNP. Since we were interested in using it as an initial screen to identify homozygous regions in the patients and additional genotyping of the homozygous regions would occur afterwards with microsatellite markers, it was deemed unnecessary to also genotype the parents and unaffected siblings using these arrays.

The homozygosity plot for the 10K and 50K SNP microarrays are shown in

Figure 5. The homozygosity plot shows the number of homozygous SNPs at each location and gives location scores. The location scores are calculated by adding up the autozygosity (identical-by-descent) LOD scores calculated for each SNP. The autozygosity LOD score is the probability that a SNP is homozygous due to identity-by- descent divided by the probability that it is not identical-by-descent. This calculation takes into account the frequency of the SNP’s different alleles within the population

(table 2). Genotype data from 80 healthy Old Order Amish samples were used to

18 estimate population-specific SNP allele frequencies for the 10K data while Affymetrix

European control data was used for the 50K analyses. The location score is not equivalent to a traditional LOD score, but does provide a relative measure across the patients that indicates the probability that a region is identical-by-descent compared to all other regions genotyped (www.silicongenetics.com Autozygosity Analysis and Broman and Weber, 1999 and Puffenberger et al., 2004). Only one strong homozygous region was identified that was shared between all of the patients; the same region was identified in both the 10K and 50K SNP microarrays. The region identified in the 10K SNP microarray, indicated by 18 homozygous SNPs, was 7.7 Mb in size and located on

16p13.3. The 50K SNP microarray indicated the same region by 65 homozgyous SNPs and reduced this region to 5.5 Mb.

19 Table 2. Location score calculation. (Broman and Weber, 1999)

Genotype Probability that SNP is Probability that SNP is from Array Identical-by-Descent Not Identical-by-Descent Ratio 2 2 AA (1-E)pA+EpA pA (1-E)/pA+E

AB 2EpApB 2pApB E

LOD=log (Probability Identical-by-descent/Probability Not Identical-by-Descent) Location score: Addition of the SNPs LOD scores in that region E: Combination of the rate of genotyping errors and mutations (1%) pA: The frequency of allele A within the isolated population

A

B

Figure 5. SNP homozygosity plots of the four patients. The number of contiguous homozygous SNPs are indicated in yellow and the location scores are shown in purple (for location score explanation see table 2). A) 10K SNP homozygosity plot indicating one strong homozygous region at 16p. The homozygous region indicated by 18 contiguous homozygous SNPs is 7.7 Mb. B) 10K SNP homozygosity plot indicating one strong homozygous region at 16p. The homozygous region indicated by 65 contiguous homozygous SNPs is 5.5 Mb.

20 Once the region was identified, a search was done to see if any previously described autosomal recessive disorders with symptoms similar to this novel disorder also mapped to this region. None were found, confirming that this syndrome under investigation was a novel disorder.

Copy number from SNP arrays

SNP microarrays can be used to determine copy number variation by looking at

changes in the signal-to-noise ratio between the arrays of interest and a large quantity of

control arrays (Nannya et al., 2005). This analysis was performed by Erik Puffenberger for these patients’ 10K and 50K SNP arrays looking for possible pathogenic copy number changes within the region. The 10K array showed no changes in copy number, but the

50K array indicated one SNP that may have four copies in the patients. This SNP was rs10500322 located at 1,844,240 bp. The flanking SNPs on the 50K array were located at

1,543,577 bp and 2,747,264 bp and no markers within this region were present on the

10K array. The Database of Genomic Variants indicates a number of known copy number variants (CNV) within this region (Figure 6). Two of these encompass a region

that would be covered by rs10500322 and not the two flanking SNPs on the SNP array; variation53272 was seen as a gain in 2/1854 samples and variation4916 was seen as a loss in 4/95 samples. This data suggests that the possible copy variant in the patients is not the causative mutation. However, further studies are performed to determine if this a real copy number change or a miscall by the array.

21

SNP rs1050032

Figure 6. Copy number variants within the Database of Genomic Variants surrounding the possible four copy SNP rs10500322 seen in the patients.

OBJECTIVES AND RATIONALE

The overall goal of this project was to identify the genetic cause of this novel autosomal recessive disorder. Four objectives were identified that were required for this to be completed successfully.

Objective 1: To narrow down the candidate region. Genome-wide genotyping of DNA from the four patients identified a large region on that was identical-by-descent in all four patients. Microsatellite markers were used to confirm the region and refine the boundaries by genotyping all available family members.

Objective 2: To prioritize the genes within the candidate region. Gene expression and functional information was sought from human databases and other species-specific databases. High priority genes were expected to have a function in or be expressed in heart, brain, and kidney tissues, which are abnormal in the patients. The expression patterns of genes that did not have this information available were determined by cDNA analysis. The large number of genes within the candidate region and the lack of obvious candidate genes made this type of prioritization difficult, thus an expression

22 microarray experiment comparing patient samples to an unaffected non-carrier sibling was also performed. It was hoped that the mutation may lead to a reduction in the mRNA levels for the affected gene or affect the expression of other genes in related pathways in the patients.

Objective 3: To identify variants possibly responsible for causing the disorder. To accomplish this objective, the coding regions, intron/exon boundaries, and in some cases the UTRs and conserved regulatory regions of candidate genes were

sequenced. Variants that followed an autosomal recessive transmission pattern and were

not already listed in dbSNP were analyzed for possible effects using software prediction

programs.

Objective 4: To confirm the mutation as causative and delineate its role in

development. Once a potential disease-causing variant was identified, the correct

transmission pattern was confirmed in all family members and a large number of general

population and Hutterite controls were genotyped to ensure it was not a common variant

(not seen within the general population and not seen to be homozygous in unaffected

individuals within the Hutterite population). Additional studies outside the scope of this

thesis will be required to confirm the pathogenicity of this variant.

23 METHODS

DNA extraction

Informed consent was obtained from the patients and other participating family members and each provided a blood sample. Genomic DNA was extracted from blood samples at the Alberta Children’s Hospital Molecular Diagnostic Laboratory using the

Gentra Puregene Blood purification system. Erythrocytes were first lysed in hypotonic lysis solution and removed as they do not contain DNA. Leukocytes were then lysed using a cell lysis solution containing an anionic detergent and DNA stabilizers. This is followed by the addition of a protein precipitate solution to remove . The DNA was precipitated in the presence of isopropanol allowing it to be collected in a pellet. The

DNA pellet was then washed with ethanol and re-suspended with DNA hydration solution.

Concentration and purity of the DNA was determined by the A260:A280 ratio measured using a spectrophotometer (Pharmacia GeneQuent). Nucleic acid absorbs at

A260 and protein absorbs at A280; an A260:A280 ratio of 1.8-2.0 indicates a pure DNA sample. The concentration of each sample was calculated from the A260 values whereby

1 absorbance units equals 50 ng/ul.

Fine mapping with the use of microsatellite markers

Identification of microsatellite markers

Microsatellite markers were used to confirm and narrow down the region identified through the SNP genotyping. Four microsatellite markers within or near the region of interest were available in the linkage kit ABI PRISM-MD5 (Applied

Biosystems). Additional markers were chosen from the UniSTS database to obtain dense

24 coverage, but there still remained a gap greater than 1 Mb at the distal end of the region with no published markers available. For this portion of the region, primers were designed around microsatellite markers identified through UCSC repeat finder and

Tandem repeat finder (Benson, 1999). Microsatellite repeats that were long di-nucleotide or tri-nucleotide repeats were preferred due to their greater polymorphic potential, provided they were not surrounded by numerous other repeats preventing primer design.

The microsatellite marker locations are listed in Table 3. All the primer pairs were synthesized commercially with either a HEX or FAM fluorescent label on the 5’end of one of the primers.

Table 3. Microsatellite markers used for genotyping are listed along with their location, primer sequences, and source obtained. Marker Location Approximate Expected Size Forward Primer Reverse Primer Source of Marker D16S521 34,246 162-182 FAM-GAGCGAGACTCCGTCTAAA CAGCAGCCTCAGGGTT linkage kit ABI PRISM-MD5 A 85,349 ~181 HEX-TCAGCCAATCACAACGAATA CACAGGTAACCTAGATCCCTC designed using UCSC repeat finder B 160,831 ~246 FAM-GCAAAACTCCGTCTCAA GGCAGTAGTTCTAGATGTAGC designed using UCSC repeat finder E 890,649 ~285 FAM-TACAGGCGCCCGCCAGCATA CGCAGGCCCACGGGAGGATAAA designed using UCSC repeat finder F 971,260 ~241 FAM-CGAGACCTCAGGCGCGAGACG GGGGTCTCCGGCGCTTC designed using UCSC repeat finder H 1,236,868 ~207 HEX-GCCAAGGCTGCAGTGAGTAA GGGCTGACTGGCTATACG designed using UCSC repeat finder J 1,381,242 ~211 FAM-ACTGCAGGCTGGGTGATAG AAACCAAAAATAGCATTCGTC designed using UCSC repeat finder D16S3024 1,594,204 208-248 FAM-ACATGCTGTGCCACCT AGCTGCCAGTATATGGAGGA UniSTS database D16S3395 1,941,689 124-137 HEX-CTAACCCTCAGCAGAGTTCTG CCTGGCAGTAAGTCCTGAAA UniSTS database D16S3124 2,387,586 91-103 HEX-CTGGNTGACAGAGTGAGACC CCCATTTTCTATTAATTTTTGTG UniSTS database D16S3070 3,033,868 153-173 HEX-CACGGGAGGTGGAGGT TGAAAGTGGTTTAAGAGAGCA UniSTS database D16S475 3,410,510 160-189 FAM-GGTTGACAGAGTGAGACTC GGAACAGAAAATACTGCACG UniSTS database D16S2622 3,649,724 71-91 FAM-ACTGCATCCCTTTAAACACTT TAGCTTGGGTGAAGGAGTGA UniSTS database D16S3027 3,990,873 214-238 FAM-ATATTTGGCATCTGGGG CCAGCATGAGTTGCTTT linkage kit ABI PRISM-MD5 D16S3388 4,407,497 143-171 HEX-ACAACCCTGCTTACACCCTG GGGAAATTCCATCTCCACAA UniSTS database D16S3134 5,164,472 161-174 FAM-CTGGGAAATTCTGGGA GGCCAAGGTGTTTGTT UniSTS database D16S423 5,983,322 140-166 HEX-AACAGGCTTGAAAGTCTCTGTC GCCTATTTGATAATGCTGTACG linkage kit ABI PRISM-MD5 D16S3042 6,665,972 233-243 FAM-AGCTTTACGTGGACACCAAG CTACCTATCTGATCCTAGTTGACC UniSTS database D16S3088 7,155,784 203-223 HEX-CTCTGAATAGGGTGGGGATG AAGGAAATCTGGGGTGTACG UniSTS database D16S418 7,579,843 174-196 FAM-TGTNAGGTATGAGACACTGC CACCTTCTTGCCTTTCATTC linkage kit ABI PRISM-MD5 25

26 PCR amplification of microsatellite markers

DNA from the four patients, the parents, and the seven available unaffected siblings along with a blank (to rule out contamination) were amplified using the polymerase chain reaction (PCR) as per the manufacturer’s protocol (ABI linkage kit).

Forward and reverse primer mix (1.0 ul of 5 uM each primer), true allele PCR pre-mix

(9.0 ul), QH20 (3.8 ul), and DNA (1.2 ul of 50 ng/ul) were added to each reaction tube.

The blank contained the same components except the 1.2 ul of DNA was substituted with

1.2 ul of QH2O. The PCR reaction cycles consisted of an initial denaturation at 95°C for

12 min followed by ten cycles of 95°C for 15 s, 55°C for 15 s, 72°C for 30 s, twenty

cycles of 89°C for 15 s, 55°C for 15 s, and 72°C for 30 s, and a final extension period of

10 min at 72°C and incubation at 4°C until use. The PCR amplified products were

electrophoresed on an automated sequencer (ABI3130XL) in order to determine the

length of the PCR products. Before electrophoresis, PCR products with different

fragment lengths and different fluorescent dye labels were pooled together. PCR product

(1-2 ul of each) was mixed with 0.2 ul of a ladder (400HD ROX) used as a reference for sizing and formamide (total volume in each reaction tube 15 ul) to denature the DNA.

For microsatellite markers that failed to be amplified by PCR using the standard

ABI link kit conditions, attempts were made to optimize the PCR conditions using

Invitrogen Platinum Taq reagents and Qiagen HotStarTaq Plus reagents. The standard

reaction components and conditions for each polymerase were used while varying the

annealing temperature, amount of magnesium (between 1 and 2 mM), and enhancing

additives provided with the kits (for more information see PCR amplification page 43).

However, the microsatellite markers that failed previously continued to fail under these

27 conditions. These markers and any that were not polymorphic within the families were not included in future analysis.

Analysis and haplotype construction

Fragment analysis to determine product sizes was performed using the software program GeneMapper 4.0 (Applied Biosystems). With the microsatellite results, haplotypes were constructed manually and linkage phase was assigned based on the minimum number of recombinants. The region was confirmed and refined by looking at informative markers that showed homozygosity in the patients without shared homozygosity in the unaffected siblings.

Real-time PCR to analyze suspected copy number variation regions

The Affymetrix 50K SNP microarray indicated a possible small copy number variation in the patients within the region. This was shown by the single SNP rs10500322 located within C16orf73 with an intensity indicating four copies. Two primer pairs, RT1 and RT2, were designed on either side of this variant (Figure 7).

Endogenous control primers that were within regions shown to have normal copy number were chosen to use for comparison. All of these primer pairs were predicted to be free of

self or cross dimerization and resulted in fragments less than 150 bp.

28

Figure 7. Location of the SNP that showed increased copy number on the Affymetrix 50K microarray and the primer pairs designed on both sides.

Optimization of amplification reactions was performed varying the final concentration of each primer in the reaction from 25-100 nM and varying the quantity of

DNA from 0-40 ng. Real-time PCR was performed adding SYBR Green PCR Master

Mix (Applied Biosystems) (10 ul), primer, and DNA up to a final volume of 20 ul.

Cycling conditions on the Applied Biosystems 9700HT were 50°C for 2 min, 95°C 10 min, and 40 cycles of 95°C for 15 s, 60°C for 1 min. For subsequent reactions, 100 nM of primer was used because it resulted in an earlier cycle threshold (Ct) (cycle when

29 fluorescence levels are first detectable above noise) with no primer dimer (confirmed by melt analysis after PCR). The relative Ct method was used for analysis, which employs an endogenous control primer amplification to normalize between samples, thus allowing direct comparison between samples. Test reactions were performed using serial dilutions of DNA at 40 ng, 20 ng, 10 ng, and 0 ng to perform reactions with the RT1, RT2, and control primer sets. When the control primer sets and test sets have the same varying

DNA concentrations and have similar amplification efficiencies this should produce what looks like a 0 fold change and indicate a reliable normalization can be performed. When the control primer sets have a non-changing DNA concentration, then the doubling of

DNA concentration for the test sets should result in around a 2 fold change. Based on these results, RT1 and RT2 were able to detect a 2 fold change and TIGD7 1.2Fseq and

1.2Rseq primers were chosen as the endogenous control. Real-time PCR was performed on genomic DNA from patients, a parent, an unaffected non-carrier sibling, and a non-

Hutterite normal control.

PCR was performed using primers RT2F and RT1R to determine if a small duplication occurred between them that would not be seen in the real-time PCR with either primer set. A patient, parent, unaffected sibling, and control were amplified using

HotStarTaq Plus (Qiagen) with an annealing temperature of 58°C and using Q-solution additive (final concentration 1X) (for complete conditions see PCR amplification page

43).

30 Prioritization of Genes

Data mining

All known and hypothetical genes within the region were identified using the

NCBI databases and this list was updated as new builds became available up to build

36.3. Databases containing human and model organism information (OMIM, UniGene

EST, PubMed, UCSC, Zfin, FlyBase, JacksonLab, WormBase) were data-mined for functional and expression information to aid in prioritization. The disorder has a combination of clinical features including non-verbal learning disabilities, congenital

heart defects, and genitourinary abnormalities (including the kidney and reproductive

system), thus the gene that causes the disorder is likely to have a role in the development

of the brain, heart, and kidney.

Initially, OMIM was used to identify genes that were associated with other

dissimilar disorders and these were classified as less likely to cause the disorder. The

genomic database UCSC identified orthologs in mice, zebrafish, fly, worm, and yeast and

provided links to the specific organism databases. Knockdown or knockout phenotypes

provided insight into the function of genes and those shown to have a role in brain, heart,

kidney, or genitourinary development were considered more likely to cause the disorder.

Where functional information was unknown, function could sometimes be

inferred by the presence of specific domain motifs. The type of gene products were

considered during prioritization. Initially, it was predicted that a transcription factor may be causative. Transcription factors act in a time and tissue specific manner to coordinately activate expression of multiple genes, and thus could affect multiple organs.

As prioritization progressed, centrosomal localized, cell cycle genes, and DNA damage

31 repair genes were also considered likely causes as number of genes causing disorders involving microcephaly are centrosomally localized or involved DNA damage repair

(Griffith et al., 2008). In addition, software prioritization was performed with Gene

Wanderer; genes causing primary microcephaly were used as the training set for comparison.

Many genes did not have known functional information available, thus expression information was also important during prioritization. The causative gene was expected to be expressed at least in the brain, heart, and kidney at some point during development with heart and kidney expression likely occurring during fetal development. Human expression information was obtained from OMIM, PubMed articles, UCSC microarray data, and UniGene EST (expression sequence tags). Whenever more than one source was available, northern blot or RT-PCR data was considered more reliable than microarray and EST data. Model organism expression information was also taken into account when available. cDNA expression studies

For some genes that appeared plausible, but did not have reliable human expression information available, expression studies were completed. PCR amplification of cDNA (Clonetech) from fetal brain, heart, and kidney and adult brain, heart, kidney, lymphocytes, and placenta was performed. Placental tissue expresses a larger number of genes than most other tissues with 10,908 mRNA and 4189 other ESTs being seen (Chen and Aplin, 2003), thus it was used as a control to help indicate that the primer is amplifying when no expression is seen in the other tissues tested. Primer sets were designed near the 3’end of 45 genes using the Oligo primer design software (Molecular

32 Biology Insights Inc.). At least one of the primers in each set was designed over an exon/exon boundary in order to detect solely cDNA without interference from traces of genomic DNA in the samples. The primers were designed to be between 24 and 30 bp to match the length of the control G3PDH primers to increase the chance that they would amplify under the same conditions. Other standard primer considerations were taken into

account (see Primer design page 40). PCR was performed using HotStarTaq (Qiagen) under the following conditions: 10X PCR buffer (2 ul), dNTP (2 ul of 2 mM), primer (2

ul of both forward and reverse at 5 uM), cDNA (2 ul), MgCl2 (15 mM in PCR buffer plus

an additional 0.2 ul of 25 mM to a final concentration of 2 mM), Taq polymerase (0.2 ul

of 5 units/ul), and water in a final volume of 20 ul. Thermocycling conditions for PCR

were 95°C for 1 min, 35 cycles of 95°C for 30 s and 68°C for 3 min. Five microlitres of

product was mixed with 2 ul of orange G running dye and electrophoresed on a 1%

agarose gel. For each primer set, tissues were noted as either having expression or no

expression based on the presence or absence of a band.

Microarray

Microarray analysis was performed to aid in prioritization comparing expression

levels in two patients (sisters from family 1, VII-2 and VII-6) to an unaffected, non-

haplotype carrying sibling (VII-1). This involved extracting high quality RNA from

cultured lymphocytes, performing microarrays at the Southern Alberta Microarray

Facility in duplicate, and analyzing the microarray results.

33 RNA extractions

A variety of available sample types and RNA extraction methods were attempted using control samples. The quality of each was assessed using the Agilent Bioanalyzer

2100, a sensitive gel system that allows determination of the ribosomal 28S:18S ratio.

Two distinct bands with a 28S:18S ratio of 2:1 indicates high sample integrity;

degradation decreases this ratio and results in a smeared appearance. Quantity and purity

of nucleic acid in the sample was determined using a nanodrop instrument

(Thermoscientific) which measures A260 (nucleic acid absorbance) and A280 (protein

absorbance). Saliva extraction was attempted using Oragene RNA saliva preservation kit

along with Qiagen RNeasy for extraction, but this resulted in very low 28S:18S ratios likely due to the presence of bacterial rRNA and the yield of human RNA was unable to

be determined. Extraction from patient blood samples was attempted, but it was difficult

to obtain sufficient quantities of RNA regardless of the extraction method. Transport

from the site of blood draw to the laboratory allowed time for degradation to occur. For these reasons, samples for creation of lymphocyte cell lines were sent to The Centre for

Applied Genomics in Toronto. Samples were obtained for both patients in family 1 and their unaffected sibling who was predicted to not be a carrier based upon haplotype analysis.

Total RNA was extracted from the lymphocyte lines using Qiagen RNeasy mini standard protocols along with DNase digestion to remove genomic DNA. Cells were sent to us in RNAlater preservation solution (Qiagen). The cells were pelleted by centrifuging and 600 ul of buffer RLT was added. This solution was vortexed, placed into a

QiaShedder spin column, and centrifuged for 2 min at full speed to homogenize. 550 ul

34 of 70% ethanol was added and the sample was mixed by pipetting. The sample was transferred in two batches to an RNeasy spin column and spun for 15 s at 10,000 rpm.

Flow through was discarded, the sample was washed with 350 ul of RW1 buffer, incubated at room temperature with DNase1 (10 ul DNase and 70 ul RDD buffer) for 15 min, and washed again with 350 ul of RW1 buffer. 500 ul of RPE buffer was spun through and then an additional 500 ul RPE buffer was added and spun at full speed for 2 min to dry. The sample was eluted from the column with 80 ul of RNase free water.

Quality and yield of the sample was determined; a 2.1:1 ratio of 28S:18S indicated high quality RNA for each sample.

Microarray hybridization procedure

Two aliquots of each sample containing 5 ug of RNA were used to perform duplicate micorarray hybridizations. Analysis was performed using Affymetrix

U133Plus2 arrays following the standard procedure for one-cycle target labelling and all reagents used in the procedure were Affymetrix. cDNA is first synthesized then biotin labelled, fragmented, and hybridized to the array. Exogenous Poly-A RNA controls were added to monitor the entire process independently of the starting RNA samples. Three serial dilutions of the Poly-A RNA control stock was performed resulting in a total dilution of 1000. The final dilution copy numbers of the controls were expected to be lys

1:100,000, phe 1:50,000, thr 1:25,000, and dap 1:6667, thus if the process was successful this should correspond to signal strengths on the microarray.

First strand cDNA synthesis was performed. 5 ug RNA, 2 ul diluted poly-A RNA controls, 2 ul T7-Oligo dt Primer (50 uM), and RNase-free water (up to 12 ul final

35 volume) were mixed together, incubated at 70°C for 10 min and cooled to 4°C for 2 min.

To this was added a mix of 4 ul of 5X 1st strand reaction mix, 2 ul of dithiothreitol

(0.1M), and 1 ul of dNTP (10 mM) and then incubated at 42°C for 2 min. SuperScript II,

a reverse transcriptase enzyme, was added (1 ul) and incubation continued at 42°C for 1

hr. The samples were cooled to 4°C and second-strand synthesis was immediately

performed with the following mix: RNase-free water (91 ul), 5X 2nd strand reaction mix

(30 ul), dNTP (3 ul of 10 mM), E. coli DNA ligase (1 ul), E. coli DNA Polymerase I (4 ul), and RNase H (1 ul). This mix was incubated at 16°C for 2 hr, 2 ul of T4 DNA

Polymerase was added and incubation continued at 16°C for another 5 min. 10 ul of

EDTA (0.5 M) was added. Cleanup of double-stranded cDNA was performed immediately at room temperature. cDNA Binding Buffer (600 ul) was added to the cDNA mixture. The mixture was added in two batches to a cDNA Cleanup Spin

Column, centrifuged for 1 min at 10,000 rpm, and the flow through discarded. 750 ul of cDNA Wash Buffer was added and the column centrifuged at full speed for 5 min to dry.

Elution was performed with 14 ul of cDNA Elution Buffer, incubation for 1 min, and centrifugation at full speed for 1 min.

Biotin labelling of the purified sample was performed by adding the entire sample to RNase-free water (8 ul), 10X in vitro transcription (IVT) Labeling Buffer (4 ul), IVT

Labeling NTP Mix (12 ul), and IVT Labeling Enzyme Mix (4 ul) and incubating at 37°C for 16 hr. Then, cleanup of the biotin-labeled cRNA was performed. The sample was diluted with 60 ul RNase-free water. 350 ul of IVT cRNA Binding Buffer was added, the mix was vortexed for 3 s, 250 ul of ethanol (96-100%) was added, the solution was pipetted to mix, and the entire sample was added to a cRNA Cleanup Spin Colum and

36 centrifuged for 15 s at 10,000 rpm. The flow through was discarded, the column washed with 500 ul of IVT cRNA Wash Buffer, centrifuged for 15 s at 10,000 rpm with flow through discarded, 500 ul of ethanol (80%) added, and centrifuged at full speed for 5 min to dry. Elution was performed adding 11 ul of RNase-free water and centrifuging followed by another 10 ul of RNase-free water and centrifuging. A 1 ul aliquot of each sample was used for quantification of the cRNA and the rest stored at -20°C until fragmentation and hybridization. A260 was measured using the nanodrop

(Thermoscientific) and used to calculate concentration; adjusted cRNA yield equaled indicated yield minus total RNA starting material. Fragmentation is designed to break down the cRNA into 35-200 bp fragments using metal-induced hydrolysis. cRNA (20 ug), 5X Fragmentation Buffer (8 ul), and RNase-free water (up to 40 ul final volume) was incubated at 94°C for 35 min and then placed on ice.

Pre-hybridization mix was added to each of the array cartridges through the rubber septa and incubated at 45°C for 10 min. The hybridization cocktail containing 15 ul labelled and fragmented cRNA, 5 ul of 3 nM control oligonucleotide B2, 15 ul 20X

eukaryotic hybridization controls (bioB, bioC, bioD, cre), 30 ul of 2X hybridization mix,

and 70 ul of nuclease-free water was heated for 5 min at 99°C, then 5 min at 45°C, and

then spun at full speed for 5 min and hybridized to the array for 16 hr while rotating. The

Affymetrix Fluidic Station performed array washing and staining. An GeneArray

Scanner was used to obtain the signal intensities.

37 Microarray analysis

Quality control and analysis of the results were performed. Affymetrix Gene

Chip Operating Software (GCOS) was used to perform quality control assessing the level of noise, the relative signal intensities of exogenous Poly-A RNA controls as outlined above, and the signal strength of hybridization controls. All quality control measures were acceptable. The data was imported into the software GeneSifter (Geospiza) for analysis. The GCOS (MAS5) algorithm was used to normalize the signal intensities of the arrays to each other so that comparisons could be performed. Quality calls of present, marginal, or absent were made for each probe set based on the intensity of the perfect match probes compared to the mismatch probes.

Initially, probe sets representing genes within the mapped region of interest that had present or marginal calls were examined for changes. The four arrays representing the patients (duplicates of the two patients) were grouped and compared against duplicate arrays for the unaffected sibling using a t-test. Probe sets that had a t-test p-value of

<0.05 and a fold change of greater than 1.5 were scored as changed. A low stringency was used because it was considered important that no changed genes in the patients were missed and false positives were not of major concern. As the disorder is recessive, genes that were downregulated in the patients were determined to be of most interest.

Genome-wide analysis was also performed. The same type of analysis was performed on all probe sets that were present or marginal on the array, but this time the

Benjamini and Hochberg multitest correction was applied along with the t-test. The adjusted p-value threshold was set at <0.05 which means that about 5% of probe sets with significantly altered expression will occur by chance. A fold change cut-off of 1.5 fold or

38 greater was again used. The GeneSifter software allowed for and pathway enrichment analysis to be performed. Pathways and gene ontologies that had more genes changed than expected by chance were noted. This was indicated by a z-score greater than 1. However, this indicated a large number of pathways enriched with changed genes, thus my analysis was restricted further to those changes in expression with a z- score greater than 2.

The mutation in the causative gene might affect the expression of genes that interact with the causative gene. These relationships may be direct or indirect; in an attempt to identify indirect relationships not evident by pathway analysis the software prioritization programs GeneWanderer and Endeavour were used. Significantly changed genes in the patients’ genome-wide with a high fold change (>5 fold difference) were used as the training set to identify genes within the region of interest with high degrees of interaction or similarity to these genes. Prioritization using GeneWanderer provides a single list of the highest ranking genes. Prioritization using Endeavour provides lists of the highest ranking genes in each category as well as a global list combining these rankings. The categories examined include gene ontology, kegg pathways, cis-regulatory elements, expression (two separate databases), interaction (seven separate databases), and text mining. The top genes in each category were examined rather than the global prioritization in order to increase the chances of identifying genes with less information available.

39 Confirmation by semi-quantitative PCR

Confirmation of the microarray results was done by semi-quantitative PCR, testing a number of the genes that were changed within the region.

Reverse transcriptase PCR was performed on the two patient samples and the unaffected sibling sample using SuperScript III first-strand synthesis system (Invitrogen).

RNA (500ng), Oligo dT (2 ul of 50 uM), dNTP (2 ul of 10 mM), and RNase-free water

(up to 20ul total volume) was incubated at 65°C for 5 min. It was cooled to 4°C for 2

min and cDNA synthesis mix was added (4 ul of 10X RT buffer, 8 ul of 25 mM MgCl2, 4 ul 0.1M DTT, 2 ul 40 U/ul RNaseOUT, and 2 ul 200 U/ul SuperScript III reverse transcriptase. This was incubated at 50°C for 50 min to allow cDNA synthesis then 85°C for 5 min to terminate the reaction. RNase H (2 ul of 2 U/ul) was added and the solution incubated at 37°C for 20 min to remove the RNA while leaving the synthesized cDNA.

Primer pairs were designed near the 3’end of TMEM201, CCDC64B, NME3,

LOC124220, SEPX1, CLUAP1, and MAPK8IP3 using Oligo primer design software. At least one of the primers was designed over an exon/exon boundary in order to detect solely mRNA without interference from traces of genomic DNA in the samples. The primers were designed to be between 24-30 bp in long to match the length of the control

G3PDH primers in order that they amplify under the same conditions. Other standard primer considerations were taken into account (see primer design page 40). PCR was performed using HotStarTaq (Qiagen) with the following reaction: 10X PCR buffer (5 ul), dNTP (5 ul of 2 mM), primer (5 ul of both forward and reverse at 5 uM), cDNA

(from reverse transcription 2 ul), MgCl2 (15 mM in PCR buffer plus an addition 1 ul of

25 mM to a final concentration of 2 mM), Taq polymerase (0.5 ul of 5 units/ul), and

40 water in a final volume of 50 ul. Thermocycling conditions for PCR were 95°C for 1 min, 22-40 cycles of 95°C for 30 s and 68°C for 3 min. In order to distinguish differences in mRNA amounts, detection needed to take place near the end of log phase when the product was in great enough quantities to be visualized, but not yet begun to plateau to saturation levels. For this reason, 5 ul of each product was removed after cycles 22, 26, 30, 34, and 40. The 5 ul of product was mixed with 2 ul of orange G running dye and electrophoresed on a 1% agarose gel. Intensities in the patients and unaffected sibling were compared for each primer set.

Sequencing

Primer design

Forward and reverse primers for PCR (1210 primers) and sequencing (519 additional primers) were designed using Oligo Primer Analysis Software. The PCR primers were designed such that the difference in melting temperature was no greater than 5°C between the forward and reverse primers and no greater than 20°C between the expected PCR product and the primers. The length difference between the forward and reverse primers was no larger than 3 bp and was adjusted to aid in matching amplification efficiencies and melting temperatures. The primers were between 18-24 bp long.

Primers that formed hairpin, formed cross or self dimers, or contained a mononucleotide repeat were avoided. The primers chosen were synthesized commercially by IDT

(Integrated DNA Technologies). Primer sequences are available upon request.

The primers were designed to cover the coding regions, intron-exon boundaries, and 5’UTRs of each gene, searching for mutations that may have a significant effect on the protein. It was ensured that the primers covered at least thirty base pairs of exon-

41 flanking intron, thus enabling sequence information to be obtained at the intron-exon boundaries. PCR products were no less than 300 bp and no greater than 1500 bp. The

sequencing protocol used obtained sequence reads of 500-600 bp, thus additional

sequencing primers were required for longer PCR fragments. Sufficient overlap between

sequences obtained for each primer was ensured. Sequencing through polyAs or polyTs

greater than 7 bp long is difficult, thus primers were designed to avoid sequencing

through these when possible.

Genes that were identified to be downregulated by the microarray were also

sequenced for additional regions that may affect expression including the promoter,

3’UTR, and some introns. Conserved possible regulatory regions upstream of the

downregulated genes were identified using the UCSC genome browser and these regions

were sequenced. Figure 8 provides an example of identifying conserved upstream regions

and Table 4 indicates the extent to which each of the downregulated genes were

sequenced.

Figure 8. Identifying regions of conservation upstream of a gene. An approximately 4 kb region upstream of CCDC64B was sequenced to cover conserved regions (shown by 7X regulation potential, the level of the blue blocks indicates conservation) and potential transcription factor binding sites indicated.

Table 4. The extent of sequencing for each of the downregulated genes Genes Coverage CCDC64B UTRs, 4400 bp upstream, all introns NME3 UTRs, 150 bp upstream, all introns MAPK8IP3 UTRs, 400 bp upstream CLUAP1 UTRs, 100 bp upstream SEPX1 UTRs, 300 bp upstream LOC124220 UTRs, 7500bp upstream, first three introns HAGH UTRs, 80 bp upstream 42

43 PCR amplification

PCR amplification was performed for each of the primer pairs. The majority of

PCR reactions were performed using Qiagen HotStarTaq Polymerase. Each PCR reaction consisted of 10X PCR buffer (Qiagen) (4 ul), dNTP (4 ul of 2 mM), primer (4 ul of both forward and reverse at 5 uM), DNA (4 ul of 50 ng/ul), MgCl2 (Qiagen) (15 mM in PCR

buffer plus an addition .8 ul of 25 mM to a final concentration of 2 mM), Taq polymerase

(Qiagen) (0.4 ul of 5 units/ul), and water in a final volume of 40 ul. For some fragments, especially those that were GC-rich, Q-solution additive (Qiagen) was used (8 ul of 5X) to lower the melting temperature. Some primer pairs were optimized more successfully using Platinum Taq polymerase (0.4 ul of 5 units/ul) (Invitrogen) in a reaction mix of

10X PCRx Amp buffer (Invitrogen) (4 ul), dNTP (4 ul of 2 mM), primer (4 ul of both

forward and reverse at 5 uM), DNA (4 ul of 50 ng/ul), MgSO4 (Invitrogen) (50 mM solution up to a final concentration of 1-2 mM), and water in a final volume of 40 ul. As with HotStarTaq, some fragments were optimized with an Invitrogen additive to lower melting temperature (Enhance 4 ul of 10X). For some primer pairs, PCR cycling conditions were used with the annealing temperature starting at 68°C and decreasing by

1°C every cycle for 13 cycles until reaching 55°C and then 23 additional cycles were performed at 55°C (this is referred to as touchdown PCR). Other genes were optimized for a specific annealing temperature. The conditions were an initial denature at 98°C for

5 min followed by 36 cycles of 98°C for 30 s, annealing for 30 s (touchdown or constant temperature), and 72°C for 90 s then a final extension at 72°C for 7 min was performed and the reactions were stored at 4°C. Gel electrophoresis of the PCR products was

performed; 2 ul of orange G running dye was first added to 5ul of the sample. PCR

44 products, along with a 100bp and 500bp ladder, were electrophoresed on a 1% agarose gel. The appropriate conditions were identified by the presence of a single, strong band for that sample. Conditions for each primer pair are available upon request.

Sequencing

As an initial screen, DNA from one affected patient, a parent or non-carrier sibling, and a normal control, along with a blank (to rule out contamination) were PCR amplified under the optimized conditions. The samples were electrophoresed on a 1% agarose gel to ensure that a single PCR product was present and that the blank was not contaminated.

The PCR amplified samples were purified and quantified with the aid of a liquid handler (Tecan) following standard procedures in the Alberta Children’s Hospital

Molecular Diagnostic Lab. Purification was performed using the Montage kit (Millipore

Inc.) which retains the DNA amplified products in the membrane, but does not retain smaller components including any remaining primers, nucleotides, and other contaminants which can be vacuumed through. The DNA that is bound by the filter was then be eluted off using Tris (a low salt buffer). Quantification of the purified product was performed using a flurescent nucleic acid stain, PicoGreen reagent (Invitrogen)

(1:200 dilution with low TE) with the appropriate optical filter and using Lambda DNA prepared at varying concentrations as a standard.

Bi-directional sequencing was performed using the PCR primers and additional sequencing primers. The purified PCR samples were sequenced using the ABI BigDye terminator v1.1 cycle sequencing kit (Applied Biosystems). The sequencing reactions were carried out using Applied Biosystem reagents 5X sequencing buffer (3.5 ul),

45 BigDye 1.1 (1 ul), amplified DNA samples (10-20 ng), and water up to a volume of 20 ul per reaction tube. Alternatively, some reactions were carried out using a higher concentration of BigDye with 5X sequencing buffer (3 ul), BigDye 1.1 (2 ul), amplified

DNA samples (20-50 ng), and water up to a volume of 20 ul per reaction. The conditions for the sequencing reaction were 96°C for 3 min followed by 25 cycles at 96°C for 10 s and 60°C for 4 min. The samples were kept at 4°C until use.

After undergoing the sequencing reaction, the samples were filtered through sephadex (G50 fine) to remove unincorporated dye-terminators. This purification procedure was performed following the Alberta Children’s Hospital Molecular

Diagnostic Laboratory Manual. Sephadex (500 ul of G50 fine 70g/L) was added to each required well of two Whatman Unifilter 800 plates. They were centrifuged (Eppendorf

Centrifuge 5810R) at 700 rpm for 2 min to remove liquid from the sephadex. The filter plates containing the sephadex were then placed overtop of a 96 well microplate and sequencing reaction is placed in the middle of the sephadex column formed in each well.

The two plates were again centrifuged at 700 rpm for 2 min. The samples collected in the microplate were used immediately, or if necessary, lyophilized (using CentriVap) and resuspended in Hi-Dye formamide (Applied Biosystems) for later use. The samples were denatured at 95°C for 2 min, followed by 2 min in ice water in order to remove any secondary structure present. They were electrophoresed on an Applied Biosystems automated sequencer within the Alberta Children’s Hospital Molecular Diagnostic

Laboratory (ABI3130 16 capillary).

46 Analysis of sequence

The quality of sequence was examined using Sequence Analysis software

(Applied Biosystems). Sequences with unacceptably high background were repeated or in some cases new primers were designed. Acceptable sequences were analyzed for mutations using the software program Mutation Surveyor. Sequences obtained for the patient and parent (or sibling) were analyzed through comparison to the normal control sequence and to the genbank reference sequence file.

Variants that were seen as heterozygous in the patients were included in the haplotypes and served to narrow down the region. Variants that were homozygous in the patients, heterozygous in the parent or not seen in the non-carrier sibling, and not seen in

the normal control were examined to determine if they were possibly disease-causing.

Variants present within dbSNP did not undergo further consideration. It was deemed

unlikely that a variant causing this very rare disorder would have been noted in other

non-Hutterite individuals and be seen within this database. Novel variants were analyzed using software prediction programs to assess their disease-causing potential. Splice-site software that uses consensus sequence algorithms (Flybase and Alamut) was used to

determine if the variant may reduce or eliminate the efficiency of an actual splice-site or

if the variant may create a new high efficiency splice-site. For non-silent coding variants,

programs were used that look at conservation between species, protein structure, and

differences between the original and variant amino acids (Polyphen and Alamut). In

addition, all of the coding variants and a number of the non-coding variants that looked

possibly significant were sequenced in up to 150 general population controls to determine

if they were common variants. Variants that appeared significant were also sequenced in

47 all other available family members to ensure the variant followed the expected recessive transmission pattern.

MAPPING RESULTS

Haplotype and minimal region

A single homozygous region was identified by the 10K and 50K SNP microarrays. The minimal region identified by this method was 5.5 Mb on chromosome

16. To confirm and refine the region, microsatellite markers within and surrounding the region were genotyped in all available family members. Sample electopherograms for a microsatellite marker are shown in Figure 9.

A B

Figure 9. Sample electropherogram for microsatellite marker genotyping. A) Individual homozygous for size 227 at marker D16S3388. B) Individual heterozygous for sizes 219 and 227 at marker D16S3388.

48 Haplotypes are shown in Figure 10A-C. In family 2, haplotypes indicated the patients were homozygous throughout the entire region examined (Figure 10C). The unaffected siblings did not share any portion of the region homozygously with the patients. Family 1 haplotypes indicated the patients were heterozygous for markers Unkl

SNP, J, A, and D16S521 refining the distal boundary to 1,404,019 bp. The unaffected siblings in family 1 also did not share any portion of the region homozygously with the patients. Markers rs2215408, D16S3042, and D16S3088 were different between the patients in family 1 and family 2 refining the proximal boundary to 6,458,669 bp. This resulted in a minimal region of 5.1 Mb (~7-12 cM). It appeared unlikely that additional reductions in the region could be obtained by genotyping with more markers. Obtaining

additional patients may have aided in reducing the region’s size but none were identified.

Two different haplotype options for family 1 are diagrammed (Figure 10A and

B). Option 1 indicates a crossover in both patients in family 1 around the same location

(within a 0.5 Mb region, ~2 cM) and an ancestoral event changing only the phase of the

marker Unkl SNP from T to G (mutation, gene conversion, or two separate crossover

events on either side). Option 2 indicates a double crossover in one unaffected sibling

within a 2 Mb region (~4-5 cM), a single crossover in another unaffected sibling within a

0.5 Mb region (~2 cM), and an ancestral crossover event changing the phase of marker

Unkl SNP and all distal markers. Both options indicate unlikely events occurring and

neither one can be ruled out with certainty. However, these paternal recombination

events appear to have occurred around a region with a hotspot for male meiotic

recombination (3 kb region at 1.1 Mb from the telomere with an approximately 300-fold

increase in male recombination) (Badge et al., 2000). The different haplotype phasings

49 do not affect the size of the region as the proximal boundary is defined by a heterozygous marker (Unkl SNP) in the patients.

Marker D16S3027 located within the homozygous region was seen to be heterozygous in the patients in family 1. It is only one heterozygous marker surrounded by informative markers that were homozygous, and thus is not likely to be a true crossover event redefining the boundaries of the region. This variation likely occurred in an ancestral event and is probably due to slippage during replication of the microsatellite region although a gene conversion event is also possible.

The identified 5.1 Mb region contained 173 known or predicted genes (NCBI build 36.3). No other recessive disorders with similar symptoms are currently mapped to this region. This confirmed that the disorder was unique and gave us a region in which to search for the disease-causing mutation.

Family 1

Marker Location D16S521 34,246 174 162 174 170 A 85,349 169 173 169 175 B 160,831 242 242 242 242 E 890,649 286 282 286 291 H 1,236,868 195 195 195 199 J 1,381,242 189 189 189 191 Unkl SNP 1,404,019 GT TT D16S3024 1,594,204 228 230 228 216 D16S3395 1,941,689 133 127 133 127 D16S3124 2,387,586 93 93 93 93 D16S3070 3,033,868 156 162 156 156 D16S475 3,410,510 176 176 176 183 D16S2622 3,649,724 78 71 78 74 D16S3027 3,990,873 229 229 227 219 D16S3388 4,407,497 167 167 167 163 D16S3134 5,164,472 161 165 161 161 D16S423 5,983,322 137 153 137 153 rs2215408 6,458,669 TG TT D16S3042 6,665,972 232 240 232 232 D16S3088 7,155,784 220 220 220 220 D16S418 7,579,843 186 186 182 190

2 Marker Location D16S521 34,246 162 170 162 174 174 170 162 174 A 85,349 173 175 173 169 169 175 173 169 B 160,831 242 244 242 242 242 244 242 242 E 890,649 282 291 282 286 286 291 282 286 H 1,236,868 195 199 195 195 195 199 195 195 J 1,381,242 189 191 189 189 189 191 189 189 Unkl SNP1,404,019 TT GT GT GT D16S3024 1,594,204 230 216 228 228 228 216 228 228 D16S3395 1,941,689 127 127 133 133 133 127 133 133 D16S3124 2,387,586 93 93 93 93 93 93 93 93 D16S3070 3,033,868 162 156 156 156 162 156 156 156 D16S475 3,410,510 176 183 176 176 176 183 176 176 D16S2622 3,649,724 71 74 78 78 71 74 78 78 D16S3027 3,990,873 229 219 229 227 229 219 229 227 D16S3388 4,407,497 167 163 167 167 167 163 167 167 D16S3134 5,164,472 165 161 161 161 165 161 161 161 D16S423 5,983,322 153 153 137 137 153 153 137 137 rs2215408 6,458,669 GT TT GT TT D16S3042 6,665,972 240 232 232 232 240 232 232 232 D16S3088 7,155,784 220 220 220 220 220 220 220 220 D16S418 7,579,843 186 190 186 182 186 190 186 182

Figure 10 A. Haplotype analysis in Family 1 based on microsatellite and SNP markers within and surrounding the homozygous region identified through the genome-wide 50K SNP array. Haplotype option 1 shown indicates a crossover in both families and Unkl SNP with an ancestral event (crossovers or mutation). Recombination in patients VII-2 and VII-6, Family 1, places the distal boundary for the disease gene at Unkl SNP. The proximal boundary is defined at rs2215408 by an ancestral crossover (different alleles in family 1 compared to family 2). 50

Family 1

Marker Location D16S521 34,246 162 174 174 170 A 85,349 173 169 169 175 B 160,831 242 242 242 242 E 890,649 282 286 286 291 H 1,236,868 195 195 195 199 J 1,381,242 189 189 189 191 Unkl SNP 1,404,019 GT TT D16S3024 1,594,204 228 230 228 216 D16S3395 1,941,689 133 127 133 127 D16S3124 2,387,586 93 93 93 93 D16S3070 3,033,868 156 162 156 156 D16S475 3,410,510 176 176 176 183 D16S2622 3,649,724 78 71 78 74 D16S3027 3,990,873 229 229 227 219 D16S3388 4,407,497 167 167 167 163 D16S3134 5,164,472 161 165 161 161 D16S423 5,983,322 137 153 137 153 rs2215408 6,458,669 TG TT D16S3042 6,665,972 232 240 232 232 D16S3088 7,155,784 220 220 220 220 D16S418 7,579,843 186 186 182 190

2 Marker Location D16S521 34,246 162 170 162 174 174 170 162 174 A 85,349 173 175 173 169 169 175 173 169 B 160,831 242 244 242 242 242 244 242 242 E 890,649 282 291 282 286 286 291 282 286 H 1,236,868 195 199 195 195 195 199 195 195 J 1,381,242 189 191 189 189 189 191 189 189 Unkl SNP 1,404,019 TT GT GT GT D16S3024 1,594,204 230 216 228 228 228 216 228 228 D16S3395 1,941,689 127 127 133 133 133 127 133 133 D16S3124 2,387,586 93 93 93 93 93 93 93 93 D16S3070 3,033,868 162 156 156 156 162 156 156 156 D16S475 3,410,510 176 183 176 176 176 183 176 176 D16S2622 3,649,724 71 74 78 78 71 74 78 78 D16S3027 3,990,873 229 219 229 227 229 219 229 227 D16S3388 4,407,497 167 163 167 167 167 163 167 167 D16S3134 5,164,472 165 161 161 161 165 161 161 161 D16S423 5,983,322 153 153 137 137 153 153 137 137 rs2215408 6,458,669 GT TT GT TT D16S3042 6,665,972 240 232 232 232 240 232 232 232 D16S3088 7,155,784 220 220 220 220 220 220 220 220 D16S418 7,579,843 186 190 186 182 186 190 186 182

Figure 10 B. Haplotype analysis in Family 1 based on microsatellite and SNP markers within and surrounding the homozygous region identified through the genome-wide 50K SNP array. Haplotype option 2 shown indicates a single crossover in one unaffected sibling, a double crossover in another unaffected sibling, and an ancestral crossover even at Unkl SNP. An ancestral recombination at Unkl SNP defines the distal boundary. The proximal boundary is defined at rs2215408 by an ancestral crossover (different alleles in family 1 compared to family 2). 51

Family 2

Marker Location D16S521 34,246 174 172 174 174 A 85,349 169 173 169 173 B 160,831 242 242 242 247 E 890,649 286 288 286 286 H 1,236,868 195 191 195 187 J 1,381,242 189 191 189 189 Unkl SNP 1,404,019 TT TT D16S3024 1,594,204 228 232 228 232 D16S3395 1,941,689 133 127 133 124 D16S3124 2,387,586 93 95 93 91 D16S3070 3,033,868 156 156 156 162 D16S475 3,410,510 176 174 176 167 D16S2622 3,649,724 78 78 78 78 D16S3027 3,990,873 227 230 227 234 D16S3388 4,407,497 167 165 167 158 D16S3134 5,164,472 161 164 161 164 D16S423 5,983,322 137 153 137 143 rs2215408 6,458,669 GT GG D16S3042 6,665,972 240 232 240 240 D16S3088 7,155,784 222 220 222 212 D16S418 7,579,843 180 178 180 180

3

Marker Location D16S521 34,246 172 174 174 174 174 174 172 174 174 174 172 174 A 85,349 173 169 169 169 169 173 173 169 169 169 173 169 B 160,831 242 242 242 242 242 247 242 242 242 242 242 242 E 890,649 288 286 286 286 286 286 288 286 286 286 288 286 H 1,236,868 191 195 195 195 195 187 191 195 195 195 191 195 J 1,381,242 191 189 189 189 189 189 191 189 189 189 191 189 Unkl SNP 1,404,019 TT TT TT TT TT TT D16S3024 1,594,204 232 228 228 228 228 232 232 228 228 228 232 228 D16S3395 1,941,689 127 133 133 133 133 124 127 133 133 133 127 133 D16S3124 2,387,586 95 93 93 93 93 91 95 93 93 93 95 93 D16S3070 3,033,868 156 156 156 156 156 162 156 156 156 156 156 156 D16S475 3,410,510 174 176 176 176 176 167 174 176 176 176 174 167 D16S2622 3,649,724 78 78 78 78 78 78 78 78 78 78 78 78 D16S3027 3,990,873 230 227 227 227 227 254 230 227 227 227 230 234 D16S3388 4,407,497 165 167 167 167 167 158 165 167 167 167 165 158 D16S3134 5,164,472 164 161 161 161 161 164 164 161 161 161 164 164 D16S423 5,983,322 153 137 137 137 137 143 153 137 137 137 153 143 rs2215408 6,458,669 TG GG GG TG GG GT D16S3042 6,665,972 232 240 240 240 240 240 232 240 240 240 232 240 D16S3088 7,155,784 220 212 222 222 222 212 220 222 222 222 220 212 D16S418 7,579,843 178 180 180 180 180 180 178 180 180 180 178 180

Figure 10 C. Haplotype analysis in Family 2 based on microsatellite and SNP markers within and surrounding the homozygous region identified through the genome-wide 50K SNP array. The patients in family 2 are homozygous throughout the entire region genotyped. The proximal boundary is defined at rs2215408 by an ancestral crossover (different alleles in family 1 compared to family 2). 52

53 Copy number variation in the region

The Affymetrix 50K SNP microarray indicated a possible copy number variation in the patients within the region. This was shown by the single SNP rs10500322 located within C16orf73 with an intensity indicating four copies and a maximum size of 1.2 Mb.

A number of copy number variations encompassing this SNP are documented in the

Database of Genomic Variants. However, confirming the presence of this apparent variant and examining its breakpoints was important to determine whether it could be causative of the disorder, thus real-time PCR was performed with primer sets on both sides of the SNP.

Quality control measures of the real-time PCR reactions were performed and are shown in Figure 11. All of the primers used resulted in efficient amplification (data not shown) and no primer dimer products. Gene expression plots of serial diluted control

DNA normalized against TIGD7 endogenous controls with a constant DNA concentration indicated that a doubling of DNA (such as would be the case with a copy number increase of two to four) is seen as an approximate log10 0.3 (2-fold) change. Gene

expression plots of serial diluted control DNA normalized against TIGD7 endogenous control with the same changes in DNA concentration showed log10 changes less than 0.1

when changing DNA concentrations. This indicated that differences between samples

concentrations could be normalized successfully.

A B C

C16orf73 RT1

Samples

Blanks 10ng 20ng 40ng 10ng 20ng 40ng 10ng 20ng 40ng 10ng 20ng 40ng

Figure 11. Quality control of Real-Time PCR A) An example of a melting curve of primer pair C16orf16RT1 indicating a single product and no primer dimer in the samples or blanks. B) Gene expression plot of serial diluted control DNA for both RT1 and RT2 normalized against TIGD7 endogenous control with a constant 10ng of DNA. An approximatly log10 0.3 (2-fold) change is seen when doubling DNA concentration. C) Gene expression plot of serial diluted control DNA for both RT1 and RT2 normalized against TIGD7 endogenous control with the same changes in DNA concentration. Successful normalization is seen with log10 changes less than 0.1 when changing DNA concentrations. The error bars show a 95% confidence interval based on triplicate results. 54

55 The real-time PCR results obtained from three patients, a parent, an unaffected non-carrier sibling, and a non-Hutterite normal control is shown in Figure 12.

Normalization between samples was performed using the TIGD7 endogenous control primers. No 2-fold copy number changes are observed in the patients compared to the controls. For C16orf73RT2, the 95% confidence intervals fell below a log10 change of

0.2 for all samples. C16orf73RT1 has 95% confidence intervals that in some cases reach

close to log10 change of 0.3 (2-fold change) decreasing the confidence in these primers’

results. However, the patients have lower levels than the controls inconsistent with the

copy number change observed in the SNP array. These preliminary results do not

confirm the copy number change seen in the SNP array.

PCR amplification across the apparently duplicated region using RT1F and RT2R

primers resulted in bands of equivalent sizes in a patient, parent, and unaffected control

(data not shown). This indicated the absence of a small duplication that would have been

missed by the real-time PCR primers.

56

Figure 12. Gene expression plots comparing intensity levels in the patients to unaffected controls. Normalization between samples was performed using the TIGD7 endogenous control primers. The two test primer sets, RT1 and RT2, do not show a two fold differences between the patients and controls. However, the RT1 primer set has large 95% confidence intervals not producing significant results.

57 PRIORITIZATION RESULTS

Prioritization using functional and expression information

Information available for genes in the region was examined. Genes that were found to have a known function or be expressed in the systems affected in the patients were considered to be higher priority. In addition, focus was given to transcription factors (which could cause a multi-system disorder) and genes involved in cell cycle or

DNA repair (which have been seen to cause microcephaly). The top ten genes indicated by Gene Wanderer software prioritization based on comparison to primary microcephaly genes are shown in Figure 13. The top gene, CREBBP, is an essential gene with a large number of interactions, a known cause of bias in software prioritization. Sixty-eight genes were chosen for sequencing based on functional and expression information (Table

5). Results will be described below in Sequencing Results page 76.

Rank Gene Symbol Score Start End 1 CREBBP 3.437 3716567 3870711 2 UBE2I 0.05677 1299180 1315390 3 TSC2 0.02946 2038599 2078712 4 DNAJA3 0.01881 4415882 4446776 5 HMOX2 0.01708 4466446 4500348 6 TCEB2 0.01486 2761415 2767297 7 SLC9A3R2 0.01433 2016929 2028483 8 NTHL1 0.01399 2029816 2037867 9 TRAF7 0.01205 2145799 2168130 10 PDPK1 0.01065 2527970 2593189 Figure 13. Gene Wanderer software prioritization of genes in the region based on comparison to primary microcephaly genes. Eighty-eight of the genes within the region were unable to be prioritized with Gene Wanderer as they did not have the required information available.

Table 5. Genes chosen through prioritization with functional and expression information. Expression Gene Symbol Gene Name FB FH FK B H K L P Information source GLIS2 GLIS family zinc finger 2 After choosing, it was implicated in Nephronophthisis CORO7 coronin 7 EST ++++++++ Inhibits transforming growth factor-beta. Drosophila homolog VASN SLITL2 slit-like 2 (Drosophila) Northern 0 + + + has a role in the nervous system NMRAL1 HSCARG protein EST ++++++0+ ZNF500 zinc finger protein 500 EST ++++0+++ ROGDI leucine zipper domain protein EST ++++++++ N-PAC cytokine-like nuclear factor n-pac Activation CREB1 histone acetylase ZSCAN10 zinc finger protein 206 ZNF205 zinc finger protein 205 Northern +++++C2H2, or Kruppel, type (repression) ZNF213 zinc finger protein 213 C cDNA ++++++++ ZNF263 zinc finger protein 263 Northern +++++May be a transcriptional repressor tigger transposable element derived TIGD7 RT-PCR ++++++++ 7 Has a CEN-P domain. Homolog to mouse jerky family. ZNF75A zinc finger protein 75a C cDNA ++++++++ ZNF434 zinc finger protein 434 C cDNA ++++++++C2H2 KRAB ZNF598 zinc finger protein 598 C cDNA ++++++++ Repression of growth factor gene expression C2H2 and SCAN ZNF174 zinc finger protein 174 Northern +++++0++ domain FLJ14154 hypothetical protein FLJ14154 LOC114984 hypothetical protein FLYWCH2 C cDNA ++++++++ BC014089 FLYWCH1 FLYWCH-type zinc finger 1 C cDNA ++++++++ Possibly contributes to childhood epilepsy or autism spectrum calcium channel, voltage-dependent, CACNA1H Northern + + + disorders. CACNA1C causes Timothy Syndrome. Mouse: alpha 1H subunit Role in cardiovascular system and dorsal root ganlia chromosome 16 open reading frame C16orf28 C cDNA +0++++++ 28 Dropophila: Lethal with recessive knockout. Heterozygotes UNKL unkempt-like (Drosophila) C cDNA ++++++++develop to pharate adults that exhibit an 'unkempt' phenotype; small rough eyes, held out wings and crossed scutellar bristles. 58

Expression Gene Symbol Gene Name FB FH FK B H K L P Information source CREBBP CREB binding protein Causes Rubinstein-Taybi Syndrome C. elegans: Clk-2 is involved in DNA damage and S phase replication checkpoints. Display a pleiotropic phenotype that TEL02 KIAA0683 gene product RT-PCR ++++++++ includes slowing of numerous physiologic processes, an increased life span, and some lethality. chromosome 16 open reading frame C16orf30 C cDNA ++++++++ 30 Cell adhesion and cellular permeability at adherens junctions essential meiotic endonuclease 1 EME2 homolog 2 (S. pombe) XPF (278760)-type flap/fork endonuclease in DNA repair SEC14L5 SEC14-like 5 (S. cerevisiae) C cDNA + 0 + + 0 + 0 Yeast null mutant inviable similar to ciliary rootlet coiled-coil, LOC645811 EST 0 0 0 rootletin Similar to rootletin which is involved in cell cycle splA/ryanodine receptor domain and SPSB3 C cDNA ++++++++ SOCS box containing 3 RPS2 ribosomal protein S2 C. elegans RNAi embyonic lethal, early larval lethal, or slow TBL3 transducin (beta)-like 3 C cDNA ++++++++ growth and patchy NTHL1 nth endonuclease III-like 1 (E.coli) DNA excision repair Mouse: Null mutant normal RAB26, member RAS oncogene RAB26 C cDNA +00+++++ family Mouse: Interactor with MEKK3 (MAP3K3). Mekk3 -/- embryos died at approximately embryonic day 11, displaying TRAF7 TNF receptor-associated factor 7 disruption of blood vessel development and the integrity of the yolk sac. C. elegans slow growth, larval arrest, or no phenotype. Yeast: GBL G protein beta subunit-like C cDNA ++++++++Loss of Lst8p results in cell wall instability and depolarization of the actin skeleton. Homozygous null mice display early embryonic lethality with E4F1 E4F transcription factor 1 mitotic progression failure and increased apoptosis. CCNF cyclin F Essential role in cell cycle. Knockouts exhibit lethality chromosome 16 open reading frame C16orf59 C cDNA +0++++++Not highly conserved 59 59

Expression Gene Symbol Gene Name FB FH FK B H K L P Information source Believed to be part of the proton channel of the vacuolar H(+)- ATPase, H+ transporting, lysosomal ATP6V0C ATPase. Mouse: Knockout exhibited embryonic lethality with 16kDa, V0 subunit c disorganized cells. Mouse: Homozygous mutant mice exhibit embryogenesis 3-phosphoinositide dependent defects, impaired forebrain development, and die by mid PDPK1 protein kinase-1 gestation. Cardiac muscle-specific conditional mutants exhibit thin ventricular walls and die of heart failure. transcription elongation factor B C.elegans: elb-1 encodes an Elongin B ortholog required for TCEB2 (SIII), polypeptide 2 (18kDa, chromosome condensation, segregation during mitosis and elongin B) meiotic division II, and cell proliferation. This kinase phosphorylates and inactivates cell division cycle 2 protein (CDC2), and thus negatively regulates cell cycle G2/M protein kinase, membrane associated PKMYT1 C cDNA +++++0++transition. Drosophila: Involved in eye, meiotic cell cycle, and tyrosine/threonine1 primary spermatocyte. C. elegans knockouts are embryonic lethal or maternal sterile Pericardin (Prc), a novel Drosophila extracellular matrix LOC729729 similar to pericardin CG5700-PB protein is a good candidate to participate in heart tube formation CLDN9 claudin 9 Tight junctions and cellular adhesion CLDN6 claudin 6 C cDNA + + + 0 0 0 0 + Tight junctions and cellular adhesion DnaJ (Hsp40) homolog, subfamily Mouse knockouts lethal, involved in cardiovascular, DNAJA3 Northern + A, member 3 embryogenisis, growth, and behavious chromosome 16 open reading frame Suspected nuclear localization Drosophila: viable and fertile C16orf5 Northern + + + 5 C.elegans: no abnormalities observed RING-containing protein with ubiquitin ligase activity. MGRN1 mahogunin, ring finger 1 Mouse: Mutants have dark fur, left-right symmetry defects, and adult-onset spongy degeneration nudix (nucleoside diphosphate NUDT16L1 linked moiety X)-type motif 16-like Nudix seem to be Phosphohydrolase 1 ANKS3 KIAA1977 protein C cDNA ++++++++ bZIP transcription factor like CREB Drosophila: expressed in UBN1 ubinuclein 1 RT-PCR ++++++++ tissues that contain actively dividing and differentiating cells 60

Expression Gene Symbol Gene Name FB FH FK B H K L P Information source family with sequence similarity 86, FAM86A C cDNA ++++++++Methyltransferase activity, yeast mutation inviable member A small nucleolar RNA, H/ACA box SNORA10 EST 00000000 10 small nucleolar RNA, H/ACA box SNORA64 C cDNA ++++0+++ 64 small nucleolar RNA host gene SNHG9 (non-protein coding) 9 small nucleolar RNA, H/ACA box SNORA78 78 SNORD60 small nucleolar RNA, C/D box 60 MIRN940 microRNA940 hypothetical gene supported by LOC440335 C cDNA ++++++++ BC022385; BC035868; BC048326 similar to hypothetical protein LOC342346 C cDNA ++++0++0 DKFZp434P0316 LOC729878 hypothetical protein LOC729878 C cDNA + 0 + 0 ++++ LOC100128510 hypthetical protein LOC100128510 LOC283951 hypothetical protein LOC283951 C cDNA ++++++++ LOC197350 hypothetical protein LOC197350 FLJ39639 hypothetical protein FLJ39639 C cDNA +++0++++Not conserved across species c16orf79 hypothetical protein MGC21830 C cDNA ++++++++ MGC21830 C.elegans: RNAi no abnormalities. Yeast: Dephosphorylation LOC283871 hypothetical protein LOC283871 C cDNA + 0 + + ++++ of histone II-A, but mutants are viable LOC100130436 hypothetical protein LOC10013436 Expression Sources Northern: Northern Blot , RT-PCR, and EST indicate the source the expression information was obtained from in a database. C cDNA refers to expression information obtained using clonetech cDNA. FB: Fetal Brain FH: Fetal Heart FK: Fetal Kidney B: Brain H: Heart K: Kidney L: Lymphoctyes P: Placenta 61

62 Prioritization based on downregulated genes within the candidate region

Microarrays were performed in duplicate using RNA extracted from cultured lymphocytes. Microarray results from the two affected individuals in family 1 were grouped and compared against their non-carrier unaffected sibling. Probe sets covering genes within the region were given a present, marginal, or absent call (GCOS algorithm).

Of the 173 genes within the region, 115 genes were covered by probe sets on the microarray and 87 of the 115 probe sets were called as present or marginal. Out of the genes with present and marginal calls five were seen to be significantly upregulated and seven significantly downregulated based on a t-test p-value of less than 0.05 and a fold

change greater than 1.5 (Figure 14). Some of these changes were confirmed by semi-

quantitative PCR (TMEM201, CCDC64B, and MAPK8IP3) while for others the change

seen was minimal and more sensitive methods would be required (NME3, LOC124220,

CLUAP1, and SEPX1) (Figure 15). Since the disorder is recessive, it was considered

more likely to be caused by a downregulated gene resulting from a loss of function

mutation. Based on this, the genes chosen for sequencing of coding regions, intron/exon

boundaries, UTRs, and conserved possible regulatory regions were NME3, MAPK8IP3,

CLUAP1, SEPX1, CCDC64B, LOC124220, and HAGH.

63

Figure 14. Genes within the region with significantly changed expression in the patients compared to the unaffected sibling (t-test p-value <.05 and fold change >1.5). Red indicates an upregulation in the patients, green indicates a downregulation in the patients, and grey indicates an absent call.

Figure 15. Semi-quantitative PCR of genes downregulated in the region (CCDC64B, NME3, LOC124220, MAPK8IP3, CLUAP1, and SEPX1), gene upregulated in the region (TMEM201), and G3PDH as a control.

64 Prioritization using genome-wide microarray data

The genome-wide microarray data was analyzed with the hypothesis that there would be a relationship between the significantly changed genes in the patients and the

gene responsible for causing the disorder. GeneSifter software (Geospiza) was used to

determine pathways and gene ontologies with more genes significantly changed than

expected by chance (z-value >1) and pathways with significantly less genes changed than

expected by chance (z-value <-1).

Significantly downregulated pathways included cytokine-cytokine receptor, toll-

like receptor, JAK-STAT, and MAPK signalling (Figure 16A). Significantly

downregulated gene ontologies included cadmium ion binding (the metallothionein

genes), MAP kinase, and interleukin 6 (Figure 16B). Based on these results, genes such

as MAPKs, TNFs, ILs, metallo- and stress response genes, and transmembrane receptors

of unknown function were chosen for sequencing, including MMP25, TNFRSF12A, and

IL32.

65

Figure 16. Genome-wide analysis of downregulated genes in the patients. A) Pathways with more downregulated genes than expected by chance. B) Gene Ontologies with more genes downregulated than expected by chance.

66 Significantly upregulated pathways included apoptosis, cell cycle, calcium signalling, and N-Glycan biosynthesis (Figure 17A). Significantly upregulated gene ontologies included NAD+ nucleosidase activity and glycosyltransferase (Figure 17B).

These results supported the continued selection of cell cycle and apoptosis related genes to sequence. In addition, genes involved in glycosylation or related processes and genes encoding proteins known to localize to the golgi or endoplasmic reticulum were chosen for sequencing, including HS3ST6 and ALG1.

Figure 17. Genome-wide analysis of upregulated genes in the patients. A) Pathways with more upregulated genes than expected by chance. B) Gene Ontologies with more genes upregulated than expected by chance.

67 Although the mutation in the causative gene would be expected to affect the expression of related genes, such as genes expressed in the same tissues and those that interact with the causative gene, these relationships may be indirect and thus not immediately evident by the pathway analysis performed. In an attempt to identify indirect relationships that may have been missed, software prioritization programs

GeneWanderer and Endeavour were used; genes with the greatest fold change (>5) were used as the training set for comparison. The results from this analysis is shown in Figure

18. The highest priority gene identified by GeneWanderer was PRSS33 which is expressed only in macrophages, and thus was not chosen for sequencing. The second ranked gene TRAP1 was chosen for sequencing. Genes ranking at the top of one of the

Endeavour categories were chosen for sequencing including TNFRSF12A, THOC6,

HCFC1R1, SRRM2, and MMP25.

The twenty-five genes chosen on the basis of the genome-wide microarray expression data and pathway, gene-ontology, and software analysis are indicated in Table

6.

68 A

B

Figure 18. Software prioritization of genes in the region based on a training set of genes that showed a 5-fold or greater change in the patients on the microarray. A) GeneWanderer prioritization. The top 16 genes are shown. Eighty-eight of the genes within the region were unable to be prioritized with GeneWanderer as they did not have the required information available. B) Endeavour Prioritization. The top 16 genes in each prioritization category are shown.

Table 6. Genes chosen for sequencing based on genome-wide microarray pathway analysis. Expression Gene Symbol Gene Name FB FH FK B H K L P Information source heparan sulfate (glucosamine) 3-O- HS3ST6 sulfotransferase 6 Suggested endothelial cell growth and migration Mouse: tumor necrosis factor receptor TNFRSF12A EST + + + Null showed reduced liver regeneration, but otherwise superfamily, member 12A normal Cleaves type IV collagen, gelatin, fibronectin, and fibrin. Mouse: homozygous mutation fails to develop neuropathic MMP25 matrix metallopeptidase 25 pain after peripheral nerve injury. They also experience reduced stress and enhanced mechanical coordination. Drosophila mutants viable C.elegans no abnormalities HSP75 TRAP1 TNF receptor-associated protein 1 Northern + + + + + + + + observed Two inflammatory cytokines, TNFA and IL1B induced DNASE1L2 deoxyribonuclease I-like 2 RT-PCR ++ + + + + + + DNAS1L2 expression in a keratinocyte cell line via the NFKB pathway. IL32 interleukin 32 Catalyzes the second step in the formation of the mannose N-acetylglucosamine-1-phosphodiester NAGPA Northern +++++++ + 6-phosphate targeting signal on lysosomal enzyme alpha-N-acetylglucosaminidase oligosaccharides. asparagine-linked glycosylation 1 ALG1 homolog (yeast, beta-1,4- Causes a congenital disorder of glycosylation mannosyltransferase) AMDHD2 amidohydrolase domain containing 2 N-acetylglucosamine metabolic process 3-phosphoinositide dependent protein NCBI classified as pseudogene, UCSC classified as active PDPK2 kinase 1 pseudogene gene nucleotide binding protein 2 (MinD NUBP2 Northern + + + + + + + + Yeast: cytosolic Fe-S cluster biogenesis homolog, E. coli) SRRM2 serine/arginine repetitive matrix 2 Northern + + + + + + + + Regulates alternative splicing of a variety of pre-mRNAs RNA binding protein S1, serine-rich RNPS1 C cDNA +++++++ + Drosophila: Nervous system role C. elegans: Locomotion domain abnormal THOC6 THO complex 6 homolog (Drosophila) C cDNA + + + + + + + + WD40 domain, splicing coupling 69

Expression Gene Symbol Gene Name FB FH FK B H K L P Information source Mouse: Mice with homozygous disruption showed ADCY9 adenylate cyclase 9 Northern + + + + + + + + increased IgG1 C.elegans: RNAi no abnormalities observed Mouse: Homozygous null mice exhibit impaired calcium SRL sarcalumenin store functions in skeletal and cardiac muscle cells resulting in slow contraction and relaxation phases. HN1L chromosome 16 open reading frame 34 C cDNA + + + + + + + + CASKIN1 CASK interacting protein 1 C. elegans phenotype appears normal progestin and adipoQ receptor family PAQR4 member IV host cell factor C1 regulator 1 (XPO1 HCFC1R1 C cDNA + + + + + + + + Viral cycle dependent) RNF151 ring finger protein 151 family with sequence similarity 100, FAM100A C cDNA 0 0 + + + + + + member A LOC100128788 hypothetical protein LOC100128788 LOC100132779 hypothetical protein LOC100132779 LOC100128770 hypothetical protein LOC100128770 Expression Sources Northern: Northern Blot, RT-PCR, and EST indicate the source the expression information was obtained from in a database. C cDNA refers to expression information obtained using clonetech cDNA. FB: Fetal Brain FH: Fetal Heart FK: Fetal Kidney B: Brain H: Heart K: Kidney L: Lymphoctyes P: Placenta 70

71 Genes not chosen for sequencing

A summary of all the genes chosen for sequencing is shown in Figure 19.

Seventy-three genes were not chosen for sequencing; these genes are shown in Table 7.

Genes causing dissimilar disorders, pseudogenes, hypothetical genes not supported by independent evidence, genes with functions not related to the symptoms of the disorder, genes without expression in the systems affected by this disorder, and genes that would have been expected to lead to different microarray results were left unsequenced.

Figure 19. Summary of all genes chosen for sequencing. One hundred genes were chosen for sequencing and seventy-three genes remained unsequenced.

Table 7. Genes that were not chosen for sequencing. Expression Gene Symbol Gene Name Source FB FHFK B H K L P Reason not chosen Causes bone disorder osteopetrosis autosomal recessive CLCN7 chloride channel 7 type 4 A2BP1 ataxin 2-binding protein 1 Disruption causes epilepsy and mental retardation insulin-like growth factor binding protein, Only expressed after birth and inactivated IGFALS in a IGFALS acid labile subunit boy caused delayed puberty and slightly reduced growth. TSC2 tuberous sclerosis 2 Causes dominant tuberous sclerosis MEFV Mediterranean fever Mediterranean fever polycystic kidney disease 1 (autosomal PKD1 Causes polycystic kidney disease dominant) ATP-binding cassette, sub-family A (ABC1), Causes severe neonatal respiratory distress and surfactant ABCA3 member 3 deficiency DNASE1 deoxyribonuclease I Knockout mice have systemic lupus erythematosus NOXO1 NADPH oxidase organizer 1 Mouse mutant exhibits inner ear defects Null mice have limb defects. Involved in Wnt signalling KREMEN2 kringle containing transmembrane protein 2 and microarray did not see a change in Wnt signalling Heme degredation, knockout mice normal except in HMOX2 heme oxygenase (decycling) 2 response to oxygen Oxidative degradation of unsaturated fatty acids. Mice dodecenoyl-Coenzyme A delta isomerase DCI mutants normal until stressed, but increased fatty acids (3,2 trans-enoyl-Coenzyme A isomerase) secreted in urine Would be expected to affect epithelial. Also, mice PPL periplakin RT-PCR + + + knockout normal intraflagellar transport 140 homolog Large gene and contains 3 different frameshifts listed in IFT140 C cDNA+++++++ + (Chlamydomonas) dbSNP Microarray and semi-quantitative PCR indicated this gene is not expressed in unaffected sibling, but is expressed in TMEM204 transmembrane protein 204 patients, thus recessive mutation not likely to cause symptoms Drosophila semi-lethal affects eyes, bristles, wings. Not CRAMP1L Crm, cramped-like (Drosophila) C cDNA + 0 + + + + + + expressed in fetal heart. 72

Expression Gene Symbol Gene Name Source FB FHFK B H K L P Reason not chosen Important for translation of mitochondrial proteins, expect MRPS34 mitochondrial ribosomal protein S34 EST + + + + + + + + a more severe mitochondrial phenotype. C. elegans RNAi lethal fumarylacetoacetate hydrolase domain Fumarylacetoacetate disorder not indicated by phenotype FAHD1 containing 1 or on microarray NADH dehydrogenase (ubiquinone) 1 beta NDUFB10 Essential enzyme in electron transport chain subcomplex, 10, 22kDa NPW neuropeptide W Expected to cause a more neuronal specific phenotype solute carrier family 9 (sodium/hydrogen Excretory system: Sodium/hydogen exchanger in SLC9A3R2 exchanger), member 3 regulator 2 proximal tubule, intestine, and colon. NTN2L netrin 2-like (chicken) Axon guidance specific potassium channel tetramerisation domain KCTD5 Drosophila and C.elegans knockout no abnormalities containing 5 Although homologous to serine proteases, it has lost all MGC52282 hypothetical protein MGC52282 EST 0 0 0 0 0 0 0 0 essential catalytic residues and has no enzymatic activity olfactory receptor, family 1, subfamily F, OR1F1 Olfactory system member 1 olfactory receptor, family 1, subfamily F, OR1F2P Olfactory system member 2 olfactory receptor, family 2, subfamily C, OR2C1 Olfactory system member 1 NLRC3 NLR family, CARD domain containing 3 T-cell immune response transcription factor AP-4 (activating enhancer TFAP4 C cDNA + 0 + + + + + + Drosophila phenotype eye leg and wing binding protein 4) mitochondria-associated protein involved in MAGMAS granulocyte-macrophage colony-stimulating Mitochondrial not indicated by microarray factor signal transduction SYNGR3 synaptogyrin 3 Northern + 0 0 0 + Only brain and placenta expression growth factor, augmenter of liver GFER EST Expression liver and testis regeneration (ERV1 homolog, S. cerevisiae) Not expressed in fetal heart or brain and not in mouse ZNF597 zinc finger protein 597 C cDNA 0 0 + + + + + + heart or kidney 73

Expression Gene Symbol Gene Name Source FB FHFK B H K L P Reason not chosen TBC1D24 TBC1 domain family, member 24 C cDNA + 0 + + + + + + Not expressed in fetal heart CEMP1 cementum protein 1 Expression restricted to periodontal ligament C16orf73 chromosome 16 open reading frame 73 EST 0 0 0 0 0 0 0 0 Testis expression PRSS27 protease, serine 27 Northern0000000 0 Pancreatic tryptic serine peptidase PRSS33 protease, serine, 33 EST +0 0 0 0 0 0 0 Predominantly expressed in macrophages TESSP1 testis serine protease 1 Testis specific expression LOC729652 hypothetical protein LOC729652 C cDNA 0 0 + + 0 + 0 0 Not expressed in fetal or adult heart The enzyme is expressed in the airways in a PRSS22 protease, serine, 22 developmentally regulated manner. PRSS21 protease, serine, 21 (testisin) Northern000000+ 0 Testis specific expression RPL3L ribosomal protein L3-like EST 0 + 0 0 0 Northern blot shows major expression is in muscles LOC646174 hypothetical protein LOC646174 C cDNA 0 0 0 0 0 0 0 0 Expression makes it unlikely ZNF200 zinc finger protein 200 Northern0000000 0 Expression highest in testis BTBD12 BTB (POZ) domain containing 12 C cDNA + 0 0 0 0 0 0 0 Expression makes it unlikely c16orf71 LOC146562 hypothetical protein AF447587 C cDNA 0 0 0 0 0 0 0 0 Expression makes it unlikely LOC440337 hypothetical gene supported by AK094332 C cDNA + 0 0 + 0 0 0 0 Expression makes it unlikely MGC45438 hypothetical protein MGC45438 C cDNA + 0 + + + + 0 0 Not expressed in fetal heart similar to Neuronal pentraxin II precursor LOC390667 EST 0000000 0 (NP-II) (NP2) similar to sphingomyelinase, intestinal LOC650177 Similar to sphingomyelinase alkaline similar to ATP-binding cassette, sub-family ABCA17 Pseudogene A (ABC1), member 17 LOC100127948 hypothetical LOC100127948 Pseudogene LOC729935 hypothetical protein LOC729935 Pseudogene LOC100131498 similar to hCG1778991 pseudogene LOC100132433 similar to hCG1778991 pseudogene LOC100131668 hypothetical protein LOC100131668 Pseudogene LOC100130126 hypothetical LOC100130126 Pseudogene LOC390671 similar to 60S ribosomal protein L18 Pseudogene LOC100131804 hypothetical protein LOC100130566 Pseudogene LOC100132075 hypothetical LOC100132075 Pseudogene 74

Expression Gene Symbol Gene Name Source FB FHFK B H K L P Reason not chosen LOC646140 similar to ribosomal protein L7-like 1 Pseudogene transcription elongation factor B (SIII), LOC246718 Pseudogene polypeptide 2 (18kD, elongin B) pseudogene LOC100129495 hypothetical LOC100129495 Pseudogene LOC100127963 hypothetical LOC100127963 Pseudogene LOC100131502 hypothetical LOC100131502 Pseudogene nucleophosmin 1 (nucleolar phosphoprotein NPM1P3 Pseudogene B23, numatrin) pseudogene 3 May not be a real gene: Not highly conserved, SINE in LOC100129318 similar to GRTG3118 middle May not be a real gene: Does not match ESTs and many LOC100129334 hypothetical protein LOC100129334 SNPs May not be a real gene: ESTs don't match entire region LOC441744 similar to 40S ribosomal protein S16 lots of SNPs and SINES LOC729639 hypothetical protein LOC729639 May not be a real gene: Overlaps well know genes LOC100128510 hypthetical protein LOC100128510 LOC652276 hypothetical protein LOC652276 Sept12 FLJ25410 hypothetical protein FLJ25410 Expression Sources Northern: Northern Blot, RT-PCR, and EST indicate the source the expression information was obtained from in a database. C cDNA refers to expression information obtained using clonetech cDNA. FB: Fetal Brain FH: Fetal Heart FK: Fetal Kidney B: Brain H: Heart K: Kidney L: Lymphoctyes P: Placenta

75

76 SEQUENCING RESULTS

Variants identified

One hundred genes were chosen for sequencing in a patient, parent/sibling, and normal control based on prioritization. The coding regions, intron/exon boundaries, and

5’UTRs of these genes were sequenced. The conserved possible regulatory regions

upstream of genes within the region seen to be downregulated in the patients upon

microarray analysis were also sequenced. Sequence was obtained using 806 primer pairs.

Based on an average of 500 bp of sequence per primer, approximately 0.4 Mb of the 5.1

Mb region was sequenced bi-directionally for each individual.

Variants that followed a recessive transmission pattern were identified for further analysis; this meant variants that were homozygous in the patient, heterozygous in the parent or not present in the unaffected non-carrier sibling, and not present in the normal control. Initially, the presence of these variants in dbSNP was determined; the variant causing this rare disorder would not be expected to have been seen previously. Variants identified that followed a recessive transmission pattern, but were present in dbSNP, are listed in Table 8. Variants that followed a recessive transmission pattern and were not seen in dbSNP are listed in Table 9. In silico predictions and sequencing in control (non-Hutterite individuals) were used to determine the likelihood that the variant was causative of the disorder. Variants that affected amino acid residues that were conserved, significantly affected the coding sequence or a splice site, and were not seen to be common in general population chromosomes were considered more likely to be pathogenic. One variant in THOC6 matched these criteria and more details about this

variant are given in the next section. The other variants identified appeared to be benign.

77 Table 8. Variants that appeared to have a recessive transmission pattern in the individuals sequenced and were present in dbSNP. These variants were considered unlikely to be causative. Gene dbSNP variant Type of variant Gene dbSNP variant Type of variant CACNA1H rs2738896 intronic CORO7 rs3747576 intronic rs2753322 intronic rs374757 intronic rs2738891 silent coding rs2277851 intronic rs1054644 missense coding rs2277852 intronic c16orf30 rs8448206 before ex1 rs740375 intronic rs1057612 3'UTR ZNF174 rs37813 intronic CLDN9 rs221608 5'UTR ZNF213 rs2735537 3'UTR rs2227269 5'UTR FLYWCH1 rs207436 5'UTR c16orf59 rs3810764 silent coding rs11644380 intronic SPSB3 rs1178432 silent coding rs9928269 3'UTR KIAA0683 rs11344587 before ex1 UBN1 rs9928036 before ex1 rs2294604 5'UTR rs9934930 intronic rs2248128 missense coding MAPK8IP3 rs6600164 before ex1 rs1124882 intronic rs34520001 intronic rs4558903 intronic rs11248888 intronic rs45457991 intronic rs2294614 silent coding rs4558903 intronic rs57876690 intronic rs1124882 intronic rs2917508 intronic rs3180228 silent coding MMP25 rs10431961 silent coding rs7191986 3'UTR HS3ST6 rs8046208 intronic rs9454 3'UTR rs1742400 intronic LOC645811 rs58293574 missense coding rs1657143 intronic rs1164130 intronic rs2906901 before ex1 rs8054054 silent coding HAGH rs28416063 intronic rs12925636 missense coding rs11641257 intronic rs4984841 intronic rs5867758 intronic rs4984840 intronic rs55786131 intronic rs4984839 intronic rs11860469 intronic rs909974 intronic rs11863390 intronic rs3751894 intronic rs11862755 intronic rs9927323 intronic NUBP2 rs28534374 intronic rs9927318 intronic TRAP1 rs2072380 intronic rs719060 intronic RNPS1 rs161431 intronic rs7192331 silent coding ADCY1 rs12923006 intronic rs61743495 silent coding NHN1L rs34148924 intronic

78 Table 9. Variants that appeared to have a recessive transmission pattern in the individuals sequenced and were not listed in dbSNP. Software predictions and sequencing in control chromosomes (non-Hutterite individuals) was used to determine the likelihood that the variant was causative of the disorder. Variants that were conserved, significantly affected the coding sequence or a splice site, and were not seen to be common in general population chromosomes were considered more likely to be pathogenic. Presence in control Gene Variant Software Prediction chromosomes CORO7 IVS12+49C>T benign 22bp del int19 benign ZNF213 ex1 dupGGGGC repeated region CACNA1H p.A1966V c.5898C>T benign 1/70 conserved, no expression FLYWCH1 G>A 133bp upstream ex1 change on microarray p.A622G c.1865C>G benign 17/78 IVS7-72A>G benign MGRN1 IVS8+228T>C benign LOC283871 p.E151K c.451G>A benign 12/76 IVS7+15G>C extra acceptor splice site 4/76 KIAA0683 p.S317N c.950G>A benign 4/154 IVS17+355insC benign LOC645811 IVS4+110C>T benign Yoruba individual* IVS8+30C>T benign IVS15-60C>T benign IVS18+28C>T benign p.R332W c.994C>T benign >50% p.A720A c.2160G>A benign MAPK8IP3 del repeated region int 13 repeated region NME3 dupA 17bp before ex1 benign 3/140 CCDC64B promoter insertion of sine benign common/80 SRL IVS1+6G>T increase splicing efficiency NUBP2 IVS52+54G>A new stronger branch point HAGH IVS5-98G>T benign Yoruba individual* IVS5-66G>C benign Yoruba individual* THOC6 p.G46R c.136G>A pathogenic 0/300 *Refers to an individual with a publicly available genome sequence.

79 In some genes close to the region’s distal boundary, variants that were heterozygous in the patients were identified (CACNA1H: 1,190,560 bp rs8044363,

1,192,260 bp rs9934839, 1,192,442 bp rs4984636 UNKL: 1,404,019 p.F39V). These

variants refined the region’s distal boundary to 1,404,019 bp from the telomere reducing

the region in size to 5.1 Mb.

Genes chosen by prioritization, but not sequenced in their entirety due to technical

challenges or the identification of the possible pathogenic variant before completion

included LOC283871 ex 3.2 and ex 4, CREBBP ex 1, TCEB2 ex 1-2, E4F1 ex 4-5,

SNORD60, RAB26 ex 1, TRAF ex 8-14, HAGH ex 1, ADCY9 ex 1, CASKIN, ZNF598 ex

12, FAM86A, and ALG1. The genes ZNF598 ex 12, FAM86A, and ALG1 contained exact sequence matches elsewhere in the genome interfering with sequencing. PDPK1 and

PDPK2, both within the region, were almost identical in numerous places, thus the sequence often detected both at the same time with differences appearing like heterozygous variants in the patients; despite this, no pathogenic variants appeared to be present in either gene. Other regions that were difficult to sequence were GC-rich or contained repetitive sequences.

80 THOC6 Variant

A variant was seen in THOC6, c.136G>A p.G46R, which was not seen within dbSNP and followed the expected recessive transmission pattern in all family members.

It was predicted to be pathogenic by Alamut (Interactive Biosoftware) based on high conservation and physiochemical difference of the amino acid change (Figure 20). There was also a slight indication that it may have an effect on splicing as Alamut showed a small increase in the strength of a hypothetical pseudo splice donor site and a small decrease in strength of the actual splice donor site (Figure 21A). However, the patient mRNA showed no alternate mRNA species compared to the unaffected sibling when

PCR analysis and sequencing was performed across the region using cDNA. Figure 21B indicates no change in product size and sequencing confirmed no change. This variant was not identified within 150 general population controls (300 chromosomes) or 500

Schmiedeleut Hutterite controls (Hutterite controls genotyped by Dr. Carol Ober,

University of Chicago).

THOC6 is a WD40 domain containing protein; this domain tends to form β- propeller structures facilitating binding. THOC6 is a member of the THO and TREX complex (transcription export complex) and is involved in the export of mRNA transcripts from the nucleus to the cytoplasm. Transcripts with export affected due to knockdown of this complex include inducible heat shock proteins. More details regarding this protein will be provided in the discussion.

81

Figure 20. Variant identified within THOC6. A) Sequence of a control, parent, and patient at the site of the variant indicating a G>A base change. B) Conservation of the variant among species and pathogenicity clues as indicated by Alamut. C) The THOC6 variant is located within the first of seven predicted WD40 like repeats. D) Structures of glycine and arginine indicating the amino acid change in the patients.

82

Figure 21. Determining the affect of the variant on splicing. A) Alamut splicing prediction showed a small increase in the strength of a hypothetical pseudo splice donor site and a small decrease in strength of the actual splice donor site. B) The patient cDNA indicated no change in splicing as the product size was the same as in the unaffected sibling.

DISCUSSION

Technical challenges

Mapping and prioritization

The use of homozygosity mapping allows rare mutations that are identical-by- descent to be mapped with only a few available individuals (Lander and Botstein, 1987), but this often results in multiple possible regions being detected and/or very large candidate regions. With more closely related patients, more shared homozygous regions and larger homozygous regions are expected. Only one 5.1 Mb region was identified in this study. However, the region was gene rich containing 173 genes.

83 Sequence capture of the entire region followed by next generation massively parallel sequencing would provide a method to obtain all of the patients’ variation within the region. Variants could then be examined to determine which are most likely causative. Variants with a confirmed recessive transmission pattern, not present in variant databases, within or surrounding genes of interest to the phenotype, found in conserved residues, predicted to have significant effects, and not seen in controls would be the first to be considered as possibly pathogenic. This method was not an option at the time this project started and it was not reasonable to sequence 5.1 Mb in its entirety by

Sanger sequencing methods. Genes needed to be prioritized as to which where most likely to cause the disorder. Sequencing of ninety-seven genes before the potential pathogenic variant was found highlights the difficulties in prioritization. Many mapping projects are faced with similar difficulties.

Several different approaches to prioritize genes were undertaken. This disorder did not have a defined phenotype fitting into a known spectrum of disorders; this made it difficult to know what type of gene would be involved. Similar phenotypes will often be caused by similar or interacting genes (Wu et al., 2008); thus, a comparison was performed to other microcephaly genes, but this did not lead to identification of a possible causative mutation in the most obvious candidates. Functional information was not available for many genes and in some cases it was hard to correlate model organism information (often from large screens) to the patients’ phenotype. Expression information aided in judging some genes as less likely, but a large number of genes remained as plausible candidates. Sequencing sixty-eight genes based on these types of prioritization methods failed to reveal the disease-causing mutation.

84 Genes were sequenced only for the coding regions, intron/exon boundaries, and

UTRs; this was due to introns and intergenic regions being large with many nucleotide changes which are difficult to interpret. However, there was the possibility that the causative mutation was located within one of the genes chosen for sequencing, but in an unsequenced region (Kuivenhoven et al., 1996, Baralle D. and Baralle M., 2005,

Wiederholt et al., 2007). Examining gene expression changes in the patients provided another means of prioritization as well as a method to detect a regulatory or splicing mutation that may have been missed by sequencing. The microarray data was used to point to genes within the region with changed expression and these genes were sequenced not only for coding regions, but also for UTRs and any conserved possible regulatory regions.

Prioritization of genes by an expression microarray does have a number of limitations. Not all genes were represented on the microarray and not all mutations would result in decreased RNA levels. THOC6 mRNA showed no changes in expression, but the possible causative variant identified in this gene was a missense mutation, which is not likely to affect mRNA levels. Less direct, but also important, are the genome-wide changes in expression induced in the patients presumably by the mutation; it potentially provides a picture of the cellular changes causing the phenotype.

Although not immediately obvious, examination of these changes led to support for the gene THOC6 as discussed later.

Gene expression varies from tissue to tissue, thus selection of tissue is important when performing an expression microarray. The most readily available cell type from the patients is lymphocytes. However, the patients did not have an obvious blood or immune

85 system phenotype, thus there was the possibility that the causative gene may not be expressed. Also, when studying genome-wide pathways in lymphocytes there was the expectation that immune pathways would be enriched possibly biasing the results. There was some evidence of this and when prioritizing based on expression changes, genes solely expressed in lymphocytes were excluded.

Another limitation was the comparison against one individual. It was desirable to use female non-carrier controls within the Hutterite population with as close a relationship as possible in order to minimize the genetic and environmental differences between the patients and controls. However, the use of a single unaffected sibling, rather than multiple controls, does not allow discrimination between gene expression differences in the control individual or changes induced in both of the patients due to the mutation. An additional comparison was attempted grouping a number of database controls together to compare against the patients, but the accuracy of this was questionable due to the variation between microarrays (data not shown).

Copy number variation detection

The 50K SNP microarray identified a possible CNV with a single SNP rs10500322 indicating four copies. The Database of Genomic Variants indicated a number of variants in this region. The preliminary real-time PCR results on both sides of this SNP did not detect any copy number changes. PCR amplification across the region showed no small duplication. This discrepancy is most likely a false positive result from the array. The detection of copy number variations is dependent on comparison between intensities of test and control arrays. Signal to noise ratios are variable and there are differences in signal intensities between arrays caused by changes in PCR kinetics and

86 hybridization (Nannya et. al, 2005). Copy number variations indicated by numerous

contiguous SNPs with changed intensity rather than a single SNP will be more accurate.

However, we cannot rule out the possibility that it is real and was not detected by the

real-time PCR analysis. The C16orf73 RT1 primer set had some variability leading to

large confidence intervals; an additional primer could be used on the same side of the

SNP as RT1 to increase confidence in the real-time PCR results.

Genotyping and sequencing

The large majority of PCR reactions for microsatellite markers and sequencing

were successful with very little optimization required. However, there were

characteristics of some segments that made amplification and sequencing more difficult.

Microsatellite markers were often located in regions with other simple repeats, and a

trade-off existed between selected markers that appeared to have the potential to be

highly polymorphic and those that were in regions apparently easier to amplify. The

majority of microsatellite markers designed were polymorphic in the families, but a

number of them did not amplify successfully despite repeated attempts to optimize.

Regions that contained polyA or polyT repeats could usually be amplified, but

sequencing could not occur past the repeat. This meant some regions were only covered

by sequence in one direction. For some fragments with multiple repeats within or close

to the region desired to be covered, PCR was performed from outside the repeats

followed by the use of different sequencing primers to avoid sequencing through the

repeats, but still covering the entire area of interest in at least one direction. Regions with

very high GC content were difficult to amplify and sequence despite the use of an

enhancing additive which lowers melting temperature. For some GC rich fragments, the

87 following helped in optimization: decreasing the fragment size to a few hundred base pairs, increasing the melting temperature to 98°C and increasing the melting time (both

PCR and sequencing reactions), increasing the amount of additive to twice the concentration and using this in both PCR and sequencing reactions as well as resuspending product in formamide to keep it denatured during electrophoresis.

FAM86A and ALG1 contained exact matches elsewhere in the genome, interfering with sequencing. Sequencing of ALG1 has been accomplished previously by reverse transcribing mRNA into cDNA using a primer specific to the 3’end of ALG1 and then sequencing (Grubenmann et al., 2004).

PCR always presents the risk of missing alleles due to variants underneath the

primer sites. Touchdown PCR increases this risk. When a primer binds more efficiently to one allele, this allele is preferentially amplified at the higher annealing temperature and will be present in much greater quantity by the time lower annealing temperatures

(sufficient for the primer to bind to the other allele) are reached. This was seen with apparent non-transmission from parent to child occurring for variants in HS3ST6.

Missing mutations was less of a concern looking at a recessive disorder caused by an identical-by-descent mutation because both alleles of the patients will be the same; if amplification is seen in the patients, then the allele is being detected. However, initially only variants that appeared heterozygous in the parents were examined further. In the parents, it is possible that the wild-type allele could be missed resulting in a heterozygote appearing homozygous. All the variants that appeared homozygous in the parents in an amplicon containing no heterozygous variants were re-examined by the same method as described for other variants in analysis of sequence page 46; none appeared pathogenic.

88 The other solution to this was to compare against an unaffected non-carrier sibling rather than the parent (explanation shown in table 10).

Table 10. The possible consequences of allele dropout when comparing a variant in the patients against a parent or non-carrier sibling.

Genotypes Assumption Allele Dropout in heterozygote Homozygous Parent VV Variant not significant No variant to transmit NN Something is wrong, non-transmission Heterozygous N dropout: Appears VV NV Variant could be significant Appears not significant, but could actually be significant V dropout: Appears NN Something is wrong, non-transmission Non-carrier Both parents homozygous for variant Sibling VV Variant not significant Both parents heterozygous NN Variant could be significant One parent homozygous for variant N dropout: Appears VV NV Variant not significant Appears not significant and is actually not significant V dropout: Appears NN Appears possibly significant and is actually not significant N: Normal V: Variant in patient

THOC6

A possible pathogenic variant was detected in THOC6. It was a homozygous

missense variant in the patients, causing a glycine to arginine amino acid change at

position 46 (NM_024339.2) and was predicted to be pathogenic by Alamut and Polyphen

based in part on high conservation amongst species. This variant was not seen within

dbSNP, not detected in 150 general population controls or 500 Schmiedeleut Hutterite

controls. This indicated it was a very rare variant consistent with what would be

expected for the disease-causing allele of this rare disorder. Also, it appears the variant

arose in the Dariusleut as it was not seen within the Schmiedeleut.

89 THOC6 (fSAP35) is part of the nuclear THO complex that couples transcription and mRNA processing to nuclear export (Rewinkel et al., 2004 and Masuda et al., 2005).

The THO complex components bind together in equal proportions and are considered to be a component of the larger TREX complex (transcription export complex) (Strasser et al., 2002 and Reed and Cheng, 2005). The THO and TREX complexes were initially discovered in yeast (Chavez et al., 2000) and were later found in higher organisms from

Drosophila to humans with high levels of conservation (Rewinkel et al., 2004 and

Masuda et al., 2005). Figure 22 indicates the components of the THO and TREX complexes in yeast and higher organisms. If pathogenicity is confirmed for THOC6, this will be the first human mutation identified affecting the THO/TREX complex.

Figure 22. Components of the THO and TREX complex in yeast, Drosophila, and human. The components with orthologs between yeast and the higher organisms are highlighted in blue and underlined. The Drosophila and human THO complex shown within the circle is composed of THOC1, THOC2, THOC5, THOC6, and THOC7 (Reed and Cheng, 2005). The Drosophila and human TREX complex shown within the box is composed of the THO complex proteins and the additional proteins UAP56, ALY (THOC4), and TEX1 (THOC3) (Reed and Cheng, 2005).

90 An overview of mRNA export is shown in Figure 23. UAP56, which is a member of the TREX complex, binds to already processed transcripts and recruits ALY (Carmody and Wente, 2009). ALY binds near the cap region of the transcript directly interacting with CBP80, a cap protein (Cheng et al., 2006). ALY recruits NXT1 and NXF1. NXT1 and NXF2 have a well-defined role in mRNA export, binding transcripts and docking to the nuclear pore complex via interacting with the phenylalanine-glycine repeats in nucleoporin proteins (Carmody and Wente, 2009). After passage through the nuclear pore complex, DBP5 and GLE1 bind and remove NXF1 and NXT1 in an ATP dependent manner preventing the transcript from returning through the nuclear pore complex to the

nucleus (Carmody and Wente, 2009).

Figure 23. Diagram illustrating components involved in mRNA export from the nucleus to the cytoplasm. UAP56 binds to already processed transcripts and recruits ALY. ALY recruits NXT1 and NXF1 and NXT1 and NXF2 target the transcript to the nuclear pore complex via interacting with the phenylalanine-glycine (FG) repeats in nucleoporin proteins. After passage through the nuclear pore complex, DBP5 and GLE1 remove NXF1 and NXT1 preventing the transcript from returning back through the nuclear pore complex (Carmody and Wente, 2009).

The THO complex is thought to be involved in this mRNA export process, but its

role is not yet fully understood. In yeast, this complex was found to have a role in

coupling transcription to mRNA export (Strasser et al., 2002). Yeast mutant knockouts

for THO complex factors show defects in transcription and accumulation of poly (A)

RNA in the nucleus (Chavez et al., 2000, Strasser et al., 2002). Likely due to the

91 retention of the transcript at the site of transcription, mitotic hyper-recombination and degradation of the 3’ends of mRNA by the exosome occurs (Libri et al., 2002 and

Garcia-Rubio et al., 2008). Initial studies in human cell lines found THOC1 to bind

DNA, interact with RNA Polymerase II, and function in transcriptional elongation as was previously seen in yeast (Li et al., 2005). However, recent literature has favoured the hypothesis that in metazoan organisms the THO complex couples splicing with mRNA export because the THO complex components have only been found bound to already processed transcripts (Masuda et al., 2005). Export of transcripts to the cytoplasm by

TREX was determined to be both cap and splicing dependent (Cheng et al., 2006). It is tempting to speculate that the THO complex which binds to both DNA during transcription and mRNA after processing may be transferred during transcription termination and polyadenylation from the DNA to the 5’end of the newly processed transcript, thus coupling transcription, processing, and quality control to mRNA export.

This would be a mechanism that allows for high levels of coordination in both yeast and metazoan organisms independently of whether the genes have introns or are intronless. A sub2 mutant in yeast showed the formation of heavy chromatin containing DNA, RNA, and nuclear pore complex components with the DNA being held in close proximity to the nuclear pore (Rougemaille et al., 2008). This could be accounted for by the inefficient transfer of export components from DNA to RNA.

There has also been an indication that THO complex proteins may be microtubule associated. THOC2 in plant cells was seen to bind to microtubules, to localize with spindles during mitosis, to localize with cytoplasmic microtubules during interphase, and to co-localize with the nucleolus in non-dividing cells (Hamada et al., 2009). THOC1

92 was also seen to have differing expression and localization based on cell-cycle and apoptosis in HeLa and other human cell lines (Gasparri et al., 2004). This could indicate an alternate function for this complex, but this complex does not appear to be essential for microtubule morphology (Hamada et al., 2009). Alternatively, this localization could be a mechanism of regulation of mRNA export during stages of the cell cycle and also possibly mRNA transport within the cytoplasm.

The details described above, obtained from numerous biochemical studies, have led to the hypothesis that the THO/TREX complex is involved in the export of a variety of mRNA transcripts from the nucleus to the cytoplasm. However, many of the details of this process are speculative. Examination of the molecular phenotype of knockdowns through expression studies provide a different picture and provide insight into possible mechanisms of pathogenicity of the THOC6 mutation. Knockdowns of TREX complex components and other mRNA export factors resulted in impairments to the export of a large number mRNA transcripts as expected. Uap56 and Nxf1 knockdowns in

Drosophila appeared to affect export of the majority of mRNA transcripts (~75%)

(Herold et al., 2003) and knockdown of ALY in HeLa cells caused a ~70% accumulation of bulk poly (A) RNA (Katahira et al., 2009). However, knockdowns of THO complex components have produced different results. In a Drosophila embryonic-derived cell line, an RNAi knockdown of Thoc2 revealed decreases in cell proliferation and defects in export of inducible heat shock protein mRNA, but normal export of the majority of mRNA transcripts (>88%) (Rewinkel et al., 2004). Knockdown of THOC5 in HeLa cells failed to show an accumulation of bulk mRNA, but did show defects in the export of inducible HSP70 transcripts after heat shock (Katahira et al., 2009). In addition, yeast

93 mutants for THO complex components showed lethality at increased temperatures and an inability to export certain heat shock protein mRNA after heat shock (Chavez et al.,

2000). An apparent contradiction exists between the results from biochemical methods indicating the central role of the THO complex in the export of a variety of transcripts and the results from knockdown expression studies indicating defects in the export of only a small number of transcripts. One suggestion for this was that for the majority of transcripts, a non-physiological mechanism can substitute for the THO complex role in its absence (Masuda et al., 2005). Another possibility is that stress inducible transcripts with rapid transcription induction have an increased requirement for THO complex components and that the expression studies performed with knockdowns only detected these highly noticeable changes in export; the remaining levels of THO complex proteins may have been sufficient for apparent normal export of other transcripts. This hypothesis could be examined by looking at export in complete knockouts of THO complex components.

Regardless of the mechanism acting, knockdown of each THO complex component, including THOC6, appeared to specifically affect the export of inducible heat shock protein, but not other transcripts. Small interfering RNA knockdown of THOC6 in

HeLa cells showed a defect in export of HSP70 transcripts, but no accumulation of overall polyA RNA (Katahira et al., 2009). The microarray results from our experiment indicated strongly inducible HSP70 mRNAs were seen to be highly downregulated in the patients including HSPA1A (19.74 and 18.6 fold down), HSPA1B (11.75 fold down), and

HSPA6 (8.21 and 7.67 fold down). This was unexpected as total mRNA was used in the microarray and knockdown of THOC6 would be expected to affect export and not total

94 mRNA levels for HSPs. However, knockdown of Uap56 and Nxf1 in Drosophila also led to significant decreases in total mRNA in addition to affecting export (Herold et al.,

2003). Knockdown of THOC1 also affected expression levels of some transcripts in addition to export (Li et al., 2005). Another microarray result of interest, the stress- inducible family of metallothionein 1 showed large fold downregulation in the patients (3 fold to 49 fold down). Although additional studies would be required to determine if this is a direct or secondary effect of the export defect, it raises the possibility that the mutation in the patients may not only be affecting the export of stress inducible heat shock protein transcripts, but also of other transcripts inducible by external stresses.

Interestingly, a decrease in the amount of HSP70 (due to a defect in export of the mRNA) could also explain the other expression changes in the patients seen on the microarray providing support for the pathogenicity of the THOC6 variant. HSP70 is involved in preventing apoptosis, is thought to be involved in a number of developmental processes, and has been shown to have a protective role against stress in numerous tissues including the brain (Kabani and Martineau, 2008). Genes involved in apoptosis were highly upregulated in the patients and survival genes were downregulated which could be caused by a lack of HSP70 (Figure 24). Although intracellular HSP protects against apoptosis, extracellular HSP has been found to upregulate cytokines acting through certain toll-like receptors resulting in the phosphorylation of IRAK which activates NF- kB and MAPK signalling (Asea et al., 2002 and Kim et al., 2009). As could be caused by reduced HSP70, pathways that appeared to be significantly downregulated in the patients included cytokine signalling, toll-like receptor signalling, and MAPK signalling.

Glutamine is an essential factor in the stress response by causing the upregulation of

95 HSPs through the hexosamine biosynthetic pathway (a glycosylation pathway) (Hamiel et

al., 2009). Genes upregulated in the patients included glycosyltransferase activity,

glycan biosynthesis, and some related pathways. The upregulation of these in the

patients may be a regulatory compensation effect.

Figure 24. Apoptosis pathway indicating genes that were upregulated or downregulated in the patients on the microarray. 96

97 Consistent with its role in mRNA export, the TREX complex appears to have widespread tissue expression. All THOC genes with in situ performed in zebrafish, including thoc6, showed whole organism expression from 1-cell to Pec-fin (zfin.org).

Mouse embryo in situ data was available for Thoc1-5 and Thoc7 showing widespread expression (genepaint.org). Human THOC6 was tested for expression using semi- quantitative PCR with cDNA from fetal brain, heart, and kidney and adult brain, heart, kidney, lymphocytes, and placenta and showed expression in all of these tissues. The expression pattern does not rule it out as a cause of this novel multisystem disorder. It is notable that an apparently ubiquitously expressed gene may be causing this mild, yet specific phenotype. However, many of the known genes causing disorders affecting the development of the nervous system are widely expressed genes involved in cell cycle,

DNA damage repair, and protein synthesis. For the majority of these disorders, the cause behind only some tissues being affected severely is not yet clear. In the case of the THO complex, there is no evidence for redundancy in any tissues, but it seems plausible that

some tissues may experience greater stresses during development. The central nervous

system is predicted to have increased sensitivity to stresses due to carefully coordinated

levels of rapid cell division and apoptosis while other tissues may experience some

variance in stress levels during development in different individuals.

Knockdowns of THO complex components confirm that members of this complex

are capable of causing a mild, multisystem disorder. A knockdown (insertional mutant) of

thoc2 in zebrafish caused a small head and eyes, abnormal jaw/facial structure, mild

pericardial edema of the heart, liver enlargement, and under development (Amsterdam et

al., 2004). Hypomorphic Thoc1 mutant mice demonstrated a proportional decreased size

98 of approximately 80% the weight of wild type mice, both males and females were infertile, and male testis development was impaired (Wang et al., 2007 and Wang et al.,

2009). These phenotypes do indicate the possibility of a THO complex dysfunction causing microcephaly, reduced size, heart defects, and genitourinary abnormalities. In contrast to this, complete knockout of THO complex factors and knockdown of other essential export factors (not part of the THO complex) show a much more severe phenotype. Mouse knockout of Thoc1 showed embryonic lethality around the time of implantation (Wang et al., 2006). Nxf1 knockdown in zebrafish causes severe central nervous system and body necrosis and death before day 3 of development (Amsterdam et al., 2004). Human mutations in GLE1, a factor essential for the export of mRNA, cause a fetal motor neuron disorder resulting in prenatal death at thirty-two weeks (Nousianen et al., 2008).

The observations that a knockdown of Nxf1 caused a defect in the export of the majority of mRNA transcripts and a severe knockdown phenotype while knockdown of

THO complex components caused an export defect of only stress inducible transcripts and a milder phenotype are consistent with what is expected. The severe phenotype in the complete Thoc1 knockout compared to the hypomorph with greatly reduced levels of

Thoc1 is consistent with the hypothesis that residual levels of the protein in the knockdown, but not the knockout, may be sufficient for export of the majority of transcripts except for stress inducible transcripts. However, export of transcripts was not examined in either of the mouse mutants cells to examine potential causes for the differences in phenotype. Despite the mechanism being unclear, it is likely that the mild phenotype observed in the patients is the result of reduced activity caused by the

99 missense variant rather than a complete knockout. There is also the possibility that different THO complex components may have differing roles or importance.

If pathogenicity is confirmed this will be the first human mutation identified affecting the THO/TREX complex. For the families participating in this study and additional Dariusleut Hutterite families, identification of the causative gene will allow for diagnostic and carrier testing. In addition, it may provide important information into increased sensitivities or risks the patients may experience. The patients showed greatly reduced levels of metallothionein and heat shock transcripts on the microarray and if these changes are confirmed may indicate an increased sensitivity to environmental shocks that may be experienced. For example, metallothionein knockout mice have increased sensitivity to acetaminophen and alcohol induced liver damage at lower doses

(Liu et al., 1999 and Zhou et al., 2002).

Although only four patients within the Hutterite population have currently been identified, it is likely that additional patients will be identified within different populations and testing will be possible with the causative gene identified. However, mutations within THOC6 are likely not to be a sole cause of this disorder as THOC6 is a member of the THO complex, which based on current information, appears to act as a single entity within the cell. Of particular interest is THOC2, which has been shown to be an important member of the complex, and is located on the X chromosome raising the possibility of an X-linked disorder. Due to the mild phenotype and the large number of

X-linked neurological impairments without known causes, there is a high likelihood mutations in THOC2 may exist, but be undetected.

100 The mutation in THOC6, if confirmed, may provide a link between export of stress inducible transcripts and a neurodevelopmental disorder. As more information is obtained regarding the THO complex, the exact mechanism behind pathogenicity will become clearer. It appears likely that the phenotype is at least partially due to the lack of stress inducible protein during times of cellular stress. Heat shock proteins respond to a variety of stresses including temperature stress, toxins, infections, UV and X-ray irradiation, and physical cell trauma (Khlebodarova, 2002). Defects in the stress response is expected to lead to increased levels of apoptosis in the unprotected cells presumably leading to the abnormalities seen in the patients either directly or through altering pathways. This would provide interesting insight into the role of stress inducible transcripts in development, which due to redundancy within the protein families, cannot be seen by mutations in the individual stress inducible genes. Mutations in this complex provide the intriguing idea of studying genes that have a direct role in an individual’s response to the environment. Environmental stresses would be expected to have a large effect on the severity of phenotype for this monogenic disorder and identifying additional patients would allow for examining phenotypic variance within the disorder between individuals within the Hutterite population and different populations. The connection made between the THO complex and neurodevelopment may also have implications for complex disorders. For example, THOC2, within a suicide susceptibility locus, was found to have reduced expression levels in the brains of suicide patients (Fiori et al.,

2009). And, autism, a common neural development disorder with delayed development

of speech, impaired social interaction, and sometimes mental retardation, has been

101 studied for possible genetic factors predisposing individuals to environmental stresses

(including metallothioneins and heat shock proteins) (Walker et al., 2006).

FUTURE DIRECTIONS

The pathogenicity of the novel missense variant identified has yet to be established. The importance and challenge of this future task is increased with the variant being a missense change in a gene and pathway not previously implicated in any disorders. With the publishing of a clinical and mapping paper, it is hoped that additional

Hutterite or non-Hutterite patients with a similar phenotype will be brought to our attention. Identification of the same very rare mutation in homozygous form in other similar Hutterite patients or identification of different THOC6 mutations in a non-

Hutterite patient would greatly increase our confidence in its pathogenicity.

The variant did not appear to have an effect on mRNA levels or splicing, but the effect it could have on the protein is not clear. It is possible that the variant could interfere with THOC6 protein’s secondary structure resulting in degradation or mislocalization. These possibilities will be examined using a clone containing normal

THOC6 and one containing the patients’ missense variant. These are being created and transfected into cell lines by collaborators in Ottawa. Future studies with these could include Western hybridization to determine changes in THOC6 protein levels and in situ hybridization to determine localization. These studies may or may not detect a change as the variant may interfere with THOC6 protein interactions; co-precipitation assays could be used to determine changes in interactions.

Knockdown of THOC6 has been shown to result in accumulation of HSP70 mRNAs in the nucleus (Katahira et al., 2009). Using the cultured lymphocytes from two

102 patients and an unaffected sibling, comparisons could be made looking at nuclear and cytoplasmic levels of HSP70 mRNAs both with and without heat shock. This would be used to determine if the ratios of nuclear to cytoplasmic HSP70 mRNAs are significantly greater in the patients and could provide an indication of a defect in Tho complex- mediated export. The levels of HSP70 proteins could also be examined using an antibody.

Studies in model organisms may provide a way to indicate pathogenicity without needing to have a strong knowledge of the mechanism of THOC6 function or of the effect of the missense variant on the protein. THOC6 is conserved from drosophila to human. Zebrafish is a model system that would allow for examining organs expected to be affected, such as the heart, with relatively rapid knockdown and rescue test methods.

In zebrafish, the entire Tho complex family has been annotated and gene duplications have not been noted. A zebrafish thoc2 knockout showed an obvious phenotype and thoc6 has been shown to have the expected widespread expression pattern in zebrafish by in situ. Conservation of the protein between human and zebrafish is about 60% identity with coverage of 96%; the variant of interest is conserved and it is located within a fairly conserved region (Figure 25). Knockdown of thoc6 could be performed using a morpholino targeted to the translation start site and the resulting phenotype noted. If a phenotype is observed, human THOC6 mRNA obtained from the clones, including the normal transcript and the transcript containing the variant, could be used for rescue. The morpholino would be designed such that it binds to zebrafish thoc6 mRNA, but not to human. If an injection of the morpholino along with the normal human mRNA rescues the phenotype to a greater extent than does the morpholino along with the variant human

103 mRNA, this would provide evidence that this THOC6 variant is capable of being pathogenic (for an example see Chiang et al., 2006).

Figure 25. Alignment of human THOC6 with zebrafish THOC6 in the region surrounding the missense variant. Conservation between the full-length proteins is about 60% identical. The variant (highlighted) is conserved between them and is in a fairly conserved region.

CONCLUSION

A homozygosity mapping approach was undertaken to map a novel

neurodevelopmental, autosomal recessive condition in the Hutterite population. A single

locus was identified on 16p by 10K and 50K SNP microarrays performed on the patients.

Microsatellite markers were used to haplotype all family members confirming the

identified locus and refining it to a region of 5.1 Mb. One hundred genes were chosen for

sequencing based on prioritization by data-mining and subsequently by microarray

expression analysis. A possible pathogenic missense variant in THOC6 was identified in

the patients. If pathogenicity is confirmed, this will be the first human mutation affecting

the THO/TREX complex and may provide a link between export of heat shock protein

mRNA, apoptosis, and a neurodevelopmental disorder.

104 ONLINE RESOURCES

Database of Genomic Variants projects.tcag.ca/variation

Endeavour Gene Prioritization program homes.esat.kuleuven.be/~bioiuser/endeavour/index.php

Flybase flybase.org

Flybase splice analysis software fruitfly.org/seq_tools/splice.html

GeneWanderer compbio.charite.de/genewanderer/GeneWanderer

Genome browser genome.ucsc.edu

Genomic database including Genbank, OMIM, and UniGene ncbi.nlm.nih.gov

Jackson Labs mouse database jax.org

Mouse embryo in situs genepaint.org

Polyphen coding variant analysis software genetics.bwh.harvard.edu/pph

Wormbase wormbase.org

Zebrafish database zfin.orf

105 REFERENCES

Aerts S., Lambrechts D., Maity S., Van Loo P. Coessens B., De Smet F., Tranchevent L.C., De Moor B., Marynene P., Hassan B., Carmeliet P., and Moreau Y. (2006) Gene prioritization through genomic data fusion. Nature Biotechnology 24: 537-544

Alagramam K., Yuan H., Kuehn M., Murcia C., Wayne S., Srisailpathy C.R., Lowry R., Knaus R., Laer L.V., Bernier F., Schwartz S., Lee C., Morton C., Mullins r., Ramesh A., Camp G.V., Hageman G., Woychik R., Smith R., and Hagemen G. (2001) Mutation in the novel protocadherin PCDH15 cause Usher Syndrome Type 1F. Human Molecular Genetics 10:1709-1718

Amsterdam A., Nissen R.M., Sun Z., Swindell E., Farrington S., and Hopkins N. (2004) Identification of 315 genes essential for early zebrafish development. Proceedings of the National Academy of Sciences 101:12792-12797

Antonarakis S.E. and Beckmann J.S. (2006) Perspective: Mendelian disorders deserve more attention. Nature Reviews Genetics 7:277-282

Arcos-Burgos M. and Muenke M. (2002) Genetics of population isolates. Clinical Genetics 61:233-247

Armistead J., Khatkar S., Meyer B., Mark B.L., Patel N., Coghlan G., Lamont R.E., Liu S., Wiechert J., Cattini P.A., Koetter P., Wrogemann K., Greenberg C.R., Entian K.D., Zelinski T., and Triggs-Raine B. (2009) Mutation of a gene essential for ribosome biogenesis, EMG1, causes Bowen-Conradi syndrome. The American Journal of Human Genetics 84:728-739

Asea A., Rehli M., Kabingu E., Boch J.A., Bare O., Auron P.E., Stevenson M.A., and Calderwood S.K. (2002) Novel signal transduction pathway utilized by extracellular HSP70. Journal of Biological Chemistry 277:15028-15034

Attanasio M., Uhlenhaut N.H., Sousa V.H., O’Toole J.F., Otto E., Anlag K., Klugmann C., Treier A.C., Helou J., Sayer J.A., Seelow D., Nurnberg G., Becker C., Chudley A.E. Nurnberg P., Hildebrandt F., and Treier M. (2007) Loss of GLIS2 causes nephronophthisis in humans and mice by increased apoptosis and fibrosis. Nature Genetics 39:1018-1024

Badge R.M., Yardley J., Jeffreys A.J., and Armour J.A.L. (2000) Crossover breakpoint mapping identifies a subtelomeric hotspot for male meiotic recombination. Human Molecular Genetics 9:1239-1244

Baralle D. and Baralle M. (2005) Splicing in action: Assessing disease causing sequence changes. Journal of Medical Genetics 42:737-748

106 Benson. G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acid Research 27:573-580

Berkovic S.F., Dibbens L.M., Oshlack A., Silver J.D., Katerelos M., Vears D.F., Lullmann-Ranch R., Blanz F., Zhang K.W., Stankovich J., Kalnins R.M., Dowling J.P., Andermann E., Andermann F., Faldini E., D’Hooge R., Vadlamudi L., Macdonell R.A., Hodgson B.L., Bayly M.A., Savige J., Mulley J.C., Smyth G.K., Power D.A., Saftig P., and Bahlo M. (2008) Array-based gene discovery with three unrelated subjects shows SCARB2/LIMP-2 deficiency causes myoclonus epilepsy and glomeruloscerosis. The American Journal of Human Genetics 82: 673-684

Bounkari O.E., Guria A., Klebba-Faerber S., Claussen M., Pieler T., Griffiths J.R., Whetton A.D., Koch A., and Tamura T. (2009) Nuclear localization of the pre-mRNA associating protein Thoc7 depends upon its direct interaction with Fms tyrosine kinase interacting protein (FMIP). FEBS Letters 583:13-18

Boycott K.M., Beaulieu C., Puffenberger E.G., McLeod D.R., Parboosingh J.S., and Innes A.M. A novel autosomal recessive malformation syndrome associated with developmental delay and distinctive facies maps to 16ptel in the Hutterite population. (Submitted for review to American Journal of Medical Genetics)

Boycott K.M., Flavelle S., Bureau A., Glass H.C., Fujwara T.M., Wirrell E., Davey K., Chudley A.E., Scott J.N., McLeaod, D.R., and Parboosingh J.S. (2005) Homozygous deletion of the very low density lipoprotein receptor gene causes autosomal recessive cerebellar hypoplasia with cerebral gyral simplification. American Journal of Human Genetics 70:553-672

Boycott K.M., Parboosingh J.S., Chodirker B.N., Lowry R.B., McLeod D.R., Morris J., Greenberg C.R., Chudley A.E., Bernier F.P., Midgley J., Moller L.B., and Innes A.M. (2008) Clinical genetics and the Hutterite population: A review of mendelian disorders. American Journal of Human Genetics 146A:1088-1098

Brinkman R.R., Dube M.P., Rouleau G.A., Orr A.C., and Samuels M.E. (2006) Human monogenic disorders-a source of novel drug targets. Nature Reviews Genetics 7:249-260

Broman K.W. and Weber J.L. (1999) Long homozygous chromosomal segments in reference families from the Centre d’Etude du Polymorphisme Humain. American Journal of Human Genetics 65: 1493-1500

Buchanan A.V., Sholtis S., Richtsmeier J., and Weiss K.M. (2009) What are genes “for” or where are traits “from”? What is the question? BioEssays 31:198-208

Carmody S.R. and Wente S.R. (2009) mRNA nuclear export at a glance. Journal of Cell Science 122:1933-1937

107 Chandler K.E, Del Rio A., Rakshi K., Springell K., Williams D.K., Stoodley N., Woods C.G., and Pilz D.T. (2006) Leucodysplasia, microcephaly, cerebral Malformation (LMC): a novel recessive disorder linked to 2p16. Brain 129: 272-277

Chavez S., Beilharz T., Rondon A.G., Erdjument-Bromage H., Tempst P., Svejstrup J.Q., Lithgow T., Aguilera A. (2000) A protein complex containing Tho2, Hpr1, Mft1 and a novel protein Thp2, connects transcription elongation with mitotic recombination in Saccharmyces cervisiae. The EMBO Journal 19:5824-5834

Cheng H., Dufu K., Lee C.S., Hsu J.L., Dias A., and Reed R. (2006) Human mRNA export machinery recruited to the 5’end of mRNA. Cell 127:1389-1400

Chen C.P. and Aplin J.D. (2003) Placental extracellular matrix: Gene expression, deposition by placental fibroblasts and the effect of oxygen. Placenta 24: 316-325

Chiang A., Beck J., Yen H. Tayeh M., Scheetz T., Swiderski R., Nishimura D., Braun T., Kim K., Huang J., Elbedour K., Carmi R., Slusarski D., Casavant T., Stone E., and Sheffield V. (2006) Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11) Proceedings of the National Academy of Science 103:6287-6292

Davey K.M., Parboosingh J.S., McLeod D.R., Chan A., Casey R., Ferreira P., Snyder F.F., Bridge P.J., and Bernier F.P. (2006) Mutation of DNAJC19, a human homologue of yeast inner mitochondrial membrane co-chaperones, causes DCMA syndrome, a novel autosomal recessive Barth syndrome-like condition. Journal of Medical Genetics 43:385- 393

Fiori L.M., Zouk H., Himmelman C., and Turecki G. (2009) X chromosome and suicide. Molecular Psychiatry (In press)

Frosk P., Weiler T., Nylen E., Sudha T., Greenberg C., Morgan K. Fujiwara M., and Wrogemann K. (2002) Limb-girdle muscular dystrophy type 2H associated with mutation in TRIM32, a putative E3-ubiquitin-ligase gene. American Journal of Human Genetics 70:663-672

Garcia-Rubio M., Chavez S., Huertas P., Tous C., Jimeno S., Luna R., and Aguilera A. (2008) Different physiological relevance of yeast THO/TREX subunits in gene expression and genomic integrity. Molecular Genetics and Genomics 279:123-132

Gasparri F., Sola F., Locatelli G., and Muzio M. (2004) The death domain protein p84N5, but not the short isoform p84N5s, is cell cycle-regulated and shuttles between the nucleus and the cytoplasm. FEBS Letters 574:13-19

108 Griffith E., Walker S., Matin C.A., Vagnarelli P., Stiff T., Vernay B., Al Sanna N., Saggar A., Hamel B., Earnshaw W.C., Jeggo P.A., Jackson A.P., and O’Driscoll M. (2008) Mutations in pericentrin cause Seckel syndrome with defective ATR-dependent DNA damage signaling. Nature Genetics 40: 232-236

Grubenmann C.E., Frank C.G., Hulsmeier A.J., Schollen E., Matthijs G., Mayatepek E., Berger E.G., Aebi M., and Hennet T. (2004) Deficiency of the first mannosylation step in the N-glycosylation pathway causes congenital disorder of glycosylation type 1k. Human Molecular Genetics 5:532-542

Hamada T., Igarashi H., Taguchi R., Fujiwara M., Fukao Y., Shimmen T., Yokota E., and Sonobe S. (2009) The putative RNA-processing protein, THO2, is a microtubule- associated protein in tobacco. Plant Cell Physiology 50:801-811

Hamiel C.R., Pinto S., Hau A., and Wischmeyer P.E. (2009) Glutamine enhances heat shock protein 70 expression via increased hexosamine biosynthetic pathway activity. American Journal of Physiology-Cell Physiology (In press)

Herold A., Teixeira L., and Izaurralde E. (2003) Genome-wide analysis of nuclear mRNA export pathways in Drosophila. The EMBO Journal 22:2472-2483

Hostetler J.A. (1985) History and relevance of the Hutterite population for genetic studies. American Journal of Medical Genetics 22:453-462

Hirawat S., Welch E.M., Elfring G.L., Northcutt V.J., Paushkin S., Hwang S., Leonard E.M., Almstead N.G., Ju W., Peltz S.W., and Miller L.L. (2007) Safety, Tolerability, and Pharmacokinetics of PTC124, a nonaminoglycoside nonsense mutation suppressor, following single- and multiple-dose administration to healthy male and female adult volunteers. The Journal of Clinical Pharmacology 47:430-444

Kabani M. and Martineau C.N. (2008) Review: Multiple Hsp70 isoforms in the eukaryotic cytosol: Mere redundancy or functional specificity? Current Genomics 9:338- 348

Kamiya A., Tan P.L., Kubo K.I., Engelhard C., Ishizuka K., Kubo A., Tsukita S., Pulver A.E., Nakajima K., Cascella N.G., Katsanis N., and Sawa A. (2008) PCM1 is recruited to the centrosome by the cooperative action of DISC1 and BBS4 and is a candidate for psychiatric illness. Archives of General Psychiatry 65:996-1006

Katahira J., Inoue H., Hurt E., and Yoneda Y. (2009) Adaptor Aly and co-adaptor Thoc5 function in the Tap-p15-mediated nuclear export of HSP70 mRNA. The EMBO Journal 20:556-567

109 Kennedy G.C., Matsuzaki H., Dong S., Liu W., Huang J., Liu G., Su X., Cao M., Chen W., Zhang J., Liu W., Yang G., Di X., Ryder T., He Z., Surti U., Phillips M.S., Boyce- Jacino M.T., Fodor S.P.A., and Jones K.W. (2003) Large-scale genotyping of complex DNA. Nature Biotechnology 21: 1233-1237

Kim S.C., Stice J.P., Chen L., Jung J.S., Gupta S., Wang Y., Baumgarten G., Trial J., and Knowlton A.A. (2009) Extracellular heat shock protein 60, cardiac myocytes, and apoptosis. Circulation Research (In press)

Khlebodarova T.M. (2002) Review: How cells protect themselves against stress? Russian Journal of Genetics 38:345-358

Kohler S., Bauer S., Horn D., and Robinson P. (2008) Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics 82: 1-10

Kruglyak L. (1997) The use of a genetic map of biallelic markers in linkage studies. Nature Genetics 17:21-24

Kuivenhoven J., Weibusch H., Pritchard P., Funke H., Benne R., Assmann G., and Kastelein J. (1996) An intronic mutation in a lariat branchoint sequence is a direct cause of an inherited human disorder (Fish-Eye Disease). Journal of Clinical Investigation 98:358-364

Lander E.S. and Botstein D. (1987) Homozygosity mapping: A way to map recessive traits with the DNA of inbred children. Science 236:1567-1570

Libri D., Dower K., Boulay J., Thomsen R., Rosbash M., and Jensen T.H. (2002) Interactions between mRNA export commitment, 3’-end quality control, and nuclear degradation. Molecular and Cellular Biology 22:8254-8266

Li Y., Wang X., Zhang X., and Goodrich D.W. (2005) Human hHpr1/p84/Thoc1 regulates transcription elongation and physically links RNA Polymerase II and RNA Processing Factors. Molecular Cell Biology 25:4023–4033

Liu J., Liu Y., Hartley D., Klaassen C.D., Shehin-Johnson S.E., Lucas A., and Cohen S.D. (1999) Metallothionein-I/II knockout mice are sensitive to acetaminophen-induced hepatotoxicity. The Journal of Pharmacology and Environmental Therapeutics 289:580- 586

Masuda S., Das R., Cheng H., Hurt E., Dorman N., and Reed R. (2005) Recruitment of the human TREX complex to mRNA during splicing. Genes and Development 19:1512- 1517

110 Nousiainen H.O., Kestila M., Pakkasjarvi N., Honkala H., Kuure S., Tallila J., Vuopala K., Ignatius J., Herva R., and Peltonen L. (2008) Mutations in mRNA export mediator GLE1 result in a fetal motoneuron disease. Nature Genetics 40:155-157

Nimgonkar V.L., Fujiwara T.M., Dutta M., Wood J., Gentry K., Maendel S., Morgan K., Eaton J. (2000) Low prevalence of psychoses among the Hutterites, an isolated religious community. American Journal of Psychiatry 157:1065-1070

Peltonen L. (2000) Positional cloning of disease genes: Advantages of genetic isolates. Human Hereditary 50: 66-75

Puffenberger EG., Hu-Lince D., Parod J.M., Craig D.W., Dobrin S.E., Conway A.R., Donarum E.A., Strauss K.A., Dunckley T., Cardenas J.F., Melmed K.R., Wright C.A., Liang W., Stafford P., Flynn C.R., Morton D.H., Stephan D.A. (2004) Mapping of sudden infant death with dysgenesis of the testes syndrome (SIDDT) by a SNP genome scan and identification of TSPYL loss of function. Proceedings of the National Academy of Science 101:11689-11694

Reed R. and Cheng H. (2005) Review: TREX, SR proteins, and export of mRNA. Current Opinion in Cell Biology 17:269-273

Rehwinkel J., Herold A., Geri K., Kocher T., Rode M., Ciccarelli F.L., Wilm M., and Izaurralde E. (2004) Genome-wide analysis of mRNAs regulated by the THO complex in Drosophila melanogaster. Nature Structural and Molecular Biology 11:558-566

Rougemaille M., Dieppois G., Kisseleva-Romanova E., Gudipati R.K., Lemoine S., Blugeaon C., Boulay J., Jensen T.H., Stutz F., Devaux F., and Libri D. (2008) THO/Sub2p functions to coordinate 3’end processing with gene-nuclear pore association. Cell 135:308-321

Sheffield V.C., Nishimura D.Y., and Stone E.M. (1995) Novel approaches to linkage mapping. Current Opinion in Genetics and Development 5:335-341

Stevenson R.E. and Hall J.G. (2005) Human Malformations and Related Anomalies 2nd edition. Oxford University Press.

Strauss K.A., Puffenberger E.G., Bunin N., Rider N.L., Morton M.C., Eastman J.T., and Morton D.H. (2008) Clinical application of DNA microarrays: Molecular diagnosis and HLA matching of an Amish child with severe combined immune deficiency. Clinical Immunology 128:31-38

Strasser K., Masuda S., Mason P., Pfannstiel J., Oppizzi M., Rodriguez-Navarro S., Rondo A.G., Aguilera A., Struhl K., Reed R., and Hurt E. (2002) TREX is a conserved complex coupling transcription with messenger RNA export. Nature 417:304-308

111 Vahteristo P., Kokko A., Saksela O., Aittomaki K., and Aaltonen L.A. (2007) Blood derived gene expression profiling in unraveling recessive disease susceptibility. Journal of Medical Genetics 44:718-720

Walker S.J., Segal J., and Aschner M. (2006) Cultured lymphocytes from autistic children and non-autistic siblings up-regulate heat shock protein RNA in response to thimerosal challenge. Environment and Neurodevelopmental Disorders 27:685-692

Wang X., Chang Y., Li Y., Zhang X., and Goodrich D.W. (2006) Thoc1/Hpr1/p84 is essential for early embryonic development in the mouse. Molecular and Cellular Biology 26:4362-4367

Wang X., Chinnam M., Wang J., Wang Y., Zhang X., Marcon E., Moens P., and Goodrich D.W. (2009) Thoc1 deficiency compromises gene expression necessary for normal testis development in the mouse. Molecular and Cellular Biology 29:2794-2803

Wang X., Li Y., Zhang X., and Goodrich D.W. (2007) An allelic series for studying the mouse Thoc1 gene. Genetics 44:32-37

Wiederholt T., Poblete-Gutierez P., Gardlo K., Goerz G., Bolsen K., Merk H.F., and Frank J. (2007) Identification of mutations in the uroporphyrinogen III cosynthase gene in German patients with congenital Erythropoietic Porphyria. Physiological research 55:S85-92

Wheeler T.M., Sobczak K., Lueck J.D., Osborne R.J., Lin X., Dirksen R.T., and Thornton C.A. (2009) Reversal of RNA dominance by displacement of protein sequestered on triplet repeat RNA. Science 325:336-339

Wu X., Jiang R., Zhang M.Q., and Li S. (2008) Network-based global inference of human disease genes. Molecular Systems of Biology 4:189

Zhou Z., Sun X., Lambert J.C., Saari J.T., and Kang Y.J. (2002) Metallothionein- independent zinc protection from alcoholic liver injury. American Journal of Pathology 160:2267-2274