GENOMIC AND REGULATORY NETWORK DIVERSITY iii

REVEALED BY REST/NRSF, MALTASE GLUCOAMYLASE

AND THE PROTOCADHERIN CLUSTER

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF GENETICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Sarah Theresa Kerfoot Garcia

March 2010

iii

© 2010 by Sarah Theresa Kerfoot Garcia. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/mn338zt6208

Includes supplemental files: 1. (Sarah Garcia Supplement 1.pdf)

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Richard Myers, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Gregory Barsh

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Anne Brunet

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Timothy Stearns

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract This thesis details the experimentation on three gene/gene families: the RE1- silencing also known as neuron-restrictive silencer factor

(REST/NRSF), a repressor of neuronal ; maltase-glucoamylase (MGAM), the intestinal enzyme responsible for the breakdown of starch; and the neuronally expressed protocadherin gene family of unknown function (PCDH). The research I have completed is tied together by the fundamental question: how is diversity achieved? On the evolutionary scale, I have investigated the functional consequences of mutation and how these functional changes may have impacted the evolution of diversity. On the cellular scale, I have sought to understand how diversity is generated by attempting to define the binding networks of important in the determination of cell fate in the case of

REST/NRSF and in the determination of specific neuronal identities as the predicted function of the protocadherins.

REST/NRSF is canonically known as a master repressor whose function is to establish and maintain the repression of neuronal genes in nonneuronal cells. While a handful of REST/NRSF target genes of neuronal function have been extensively studied, there is little consensus on the entire repertoire of REST/NRSF targets. I participated in the development of a set of software tools for transcription factors with known binding sites which allows the identification and characterization of binding sites and their potential target genes. We identified 660 putative target genes for REST/NRSF and demonstrated that this set is enriched for genes of known neuronal function and that the expression of REST/NRSF is negatively correlated with predicted target .

We validated a large number of potential binding sites and experimentally determined a threshold value which enriches the set of predicted binding sites for true positive sites.

iv

Maltase glucoamylase (MGAM) is responsible for starch breakdown and is thought to be especially important in brain development when glucose demand is high.

Other components of starch metabolism are known to have common variations which affect function and these high functioning variants are more common in populations with high starch diets. Research in our lab revealed a large, common deletion in the MGAM gene with a high allele frequency in Europeans, a population with comparatively increased starch intake. I mapped the endpoints of this large, common deletion and determined the likely mechanism by which the deletion occurred to be recombination between two highly similar repeats within the MGAM gene. I then demonstrated that the deletion has no gross impact on mRNA structure or expression in heterozygous form and is therefore likely to be a neutral mutation.

The protocadherins are a diverse set of neuronally-expressed cell-adhesion-like molecules hypothesized to provide the molecular code necessary for the identification of individual neurons for their incorporation into appropriate neural circuits. I demonstrate that gene conversion events occur and have likely functional effects on protocadherin paralogs, a hypothesis that had only previously been supported by evolutionary evidence.

Previous research suggested that endogenous cleavage products of some protocadherins localized to the nucleus and affected gene expression. I verified nuclear localization of the protocadherin paralogs previously published and demonstrated that this nuclear localization is a common feature of all protocadherin paralogs assayed. I found no evidence that the protocadherin fragment was capable of binding to DNA directly or to affect protocadherin expression, but was unable to determine whether this was indicative of a lack of these functions in vivo or a failure of my experimental system to detect them.

v

I conclude that nuclear localization of protocadherin internal fragments is a process common to disparate members of the protocadherin cluster but the functional consequences of this nuclear localization remain unknown.

vi

Acknowledgments I would like to thank Rick for all of his invaluable help and guidance. I am grateful to the Myers lab, especially Loan Nguyen, for their support and advice in and out of the lab. Thanks to my wonderful daughter Emery for allowing me to write even when Brown Bear, Brown Bear really needed to be read one more time. Finally, I am lucky to have had the love, support, patience and thesis editing skills of my husband, John Garcia.

vii

Table of Contents

ABSTRACT ...... IV ACKNOWLEDGMENTS...... VII LIST OF FIGURES ...... X METHODS...... XI CHAPTER 1: INTRODUCTION...... 1

PROTOCADHERIN DIVERSITY ...... 3 CHOICE OF PROTOCADHERIN PARALOG EXPRESSION BY INDIVIDUAL NEURONS...... 8 PROTOCADHERIN STRUCTURE AND HOMOPHILIC INTERACTIONS...... 10 CADHERINS IN SYNAPTOGENESIS ...... 12 PROTOCADHERIN EXPRESSION AND SUBCELLULAR LOCALIZATION ...... 16 LESSONS ON PROTOCADHERIN FUNCTION FROM MODEL ORGANISMS...... 18 PROTOCADHERIN BINDING PARTNERS...... 22 INVESTIGATION INTO PROTOCADHERIN FUNCTION...... 24 REST/NRSF AND NEURONAL FATE DETERMINATION AND MAINTENANCE...... 25 MALTASE-GLUCOAMYLASE POLYMORPHISM AND STARCH DIGESTION...... 30 CHAPTER 2: IDENTIFICATION OF CODING POLYMORPHISMS IN THE PROTOCADHERIN CLUSTER...... 35

PROTOCADHERIN-Α SEQUENCING REVEALS POLYMORPHISM AND CONVERSION ...... 43 POLYMORPHISMS IN THE PCDHΒ CLUSTER ...... 45 EXPANDED PROTOCADHERIN-Α SEQUENCING AND POLYMORPHISM ANALYSIS ...... 45 CHAPTER 3: THE PROTOCADHERIN CYTOPLASMIC DOMAIN AND FRAGMENT LOCALIZATION ...... 62

PROTOCADHERIN CTF2 FRAGMENTS LOCALIZE TO THE NUCLEUS...... 68 STABLE CELL LINES EXPRESSING PROTOCADHERIN CTF2 FRAGMENTS ...... 74 EXPLORATION OF TRANSCRIPTIONAL ROLES FOR PROTOCADHERIN FRAGMENTS ...... 76 STABLE CELL LINES ARE NON-HOMOGENOUS...... 86 CHIP-SEQ REVEALS NO EVIDENCE OF CTF2-DNA BINDING ...... 94 CHAPTER 4: PREDICTION AND VALIDATION OF REST/NRSF BINDING SITES ...... 97

DEVELOPMENT OF A TRANSCRIPTION FACTOR BINDING SITE PREDICTION TOOL: CISTEMATIC ...... 98 PREDICTION OF RE1/NRSE SITES ...... 99 VALIDATION OF PREDICTED RE1/NRSE SITES ...... 102 CHARACTERIZATION OF PREDICTED RE1/NRSE SITES...... 105 CHAPTER 5: A COMMON DELETION IN THE MALTASE GLUCOAMYLASE GENE ...... 110

DETERMINATION OF DELETION ENDPOINTS...... 111 MGAM GENOTYPE DOES NOT PREDICT MGAM EXPRESSION LEVELS ...... 115 CHAPTER 6: CONCLUSIONS ...... 120

GENE CONVERSION AND THE EVOLUTION OF PROTOCADHERIN DIVERSITY...... 120

viii

NUCLEAR LOCALIZATION OF PROTOCADHERIN FRAGMENTS ...... 125 REST/NRSF BINDING SITE PREDICTION...... 126 MALTASE DELETION MAPPING ...... 130 WORKS CITED...... 132

ix

List of figures Figure 1: Genomic organization of protocadherin clusters in multiple species...... 7 Figure 2: Paralog sequence diversity is negatively correlated with third-position GC content...... 41 Figure 3: Hypothetical gene conversion event visualization ...... 42 Table 1: Table of confirmed Pcdhα single nucleotide polymorphisms ...... 51 Figure 4: Identified gene conversion event may transfer functional domains between human protocadherin paralogs...... 59 Figure 5: Tracking identified SNPs onto known promoter-defined Pcdhα haplotypes61 Figure 6: Schematic of protocadherin constructs...... 71 Figure 7: Protocadherin CTF2 fragments are expressed by transiently transfected HEK293 cells...... 71 Figure 8: Protocadherin CTF2 fragments are not excluded from the nucleus of transiently transfected HT1080 cells ...... 72 Figure 9: Stable HEK293 cell lines express protocadherin CTF2 fragments...... 75 Figure 10: No significant increase in protocadherin paralog expression is observed in the presence of CTF2 expression...... 81 Figure 11: Protocadherin promoter CSEs are not enriched by IP of CTF2 fragments.83 Figure 12: Probe Z-scores are not skewed in the positive direction...... 84 Table 2: Top twenty positively enriched regions identified by chIP-chip...... 84 Figure 13: Selected positively enriched regions identified by chIP-chip are validated by qPCR in amplified samples...... 86 Figure 14: Initially derived stable cell lines show non-homogenous expression and localization of CTF2 fragments ...... 89 Figure 15: Rederived stable cell lines show non-homogenous expression and localization of CTF2 fragments...... 90 Figure 16: Confluency has no effect on CTF2 fragment expression or localization....91 Figure 17: Proteasome inhibition has no effect on CTF2 expression or localization...92 Table 3: Table of significant peaks by ChIPSeq...... 95 Figure 18: Significant peaks called are also present in the control library...... 96 Figure 19: Selection of a threshold and PSFM match score is negatively correlated with promoter activity of REST/NRSF repressed genes ...... 102 Figure 20: PSFM match score is correlated with fold enrichment by chIP/qPCR .....105 Figure 21: Brain-specific expression of putative REST/NRSF target genes...... 108 Figure 22: Deletion endpoint schematic ...... 114 Figure 23: PCR genotyping correctly predicts array genotyping ...... 115 Figure 24: Genotype and MGAM expression in the cortex and medulla...... 118 Figure 25: Expression of MGAM exons in the cortex...... 119

x

Methods Samples for SNP identification. We used DNA isolated from lymphoblastoid cell lines derived from members of families having at least two children affected with autism 1, 2, 3. All samples were collected with approval from appropriate institutional review boards and with informed consent from participants. For our expanded sample sets we used the HD50CAU (European), HD50AA (African-American), HD32 (Chinese), HD07 (Japanese), HD13 (Southesast Asian) and HD12 (African) panels from the Coriell Cell Repository.

Protocadherin PCR for sequencing SNPs. PCR reactions were performed in 10 µl with 1 µM of each primer, 1-2 units of AmpliTaq Gold DNA Polymerase (Applied Biosystems), 1X GeneAmp PCR Gold Buffer (Applied Biosystems), 2 mM MgCl2, 5% DMSO, 2.5 mM of each dNTP and 25 ng genomic DNA. Reactions were performed on GeneAmp 9700 thermal cyclers (Applied Biosystems) by denaturing for 10 minutes at 95ºC, followed by 35 cycles of 94ºC for 30 seconds, 58ºC for 3 minutes and 72ºC for 30 seconds, followed by a final extension step of 10 minutes at 72ºC. Primers were designed using Primer3 (http://frodo.wi.mit.edu/primer3/) with the modification of several search parameters4. Primers were synthesized at Invitrogen and were resuspended at a concentration of 10 µM.

Each protocadherin was amplified using paralog-specific primers. The forward and reverse primers were located in the 5’ and 3’ UTR sequences respectively (see table below for sequences). For the Pcdhβ cluster we performed initial PCR amplification using the 5’ and 3’ UTR sequences and then used a nested PCR strategy to further amplify the 3’ regions. These forward primers were designed downstream of EC6 (1800 primer). The reverse primer was the same 3’ UTR primer used in the first round of amplification. We could not design 1800 primers for all Pcdhβ paralogs, instead a universal 1800 primer was designed to amplify those paralogs for which a specific primer could not be designed with specificity being provided by the 3’ UTR reverse primer. PCR conditions were the same for these amplifications except that 2.5 µl of the original PCR reaction was used as the DNA input.

α1 5’ UTR Forward Primer GGCATGATTTGAAGTAAGAGGAGA α2 5’ UTR Forward Primer GAAGCAGCAGGACTTTAACAGAGA α3 5’ UTR Forward Primer TTCAGATACTGCTTTGCTTCATCC α4 5’ UTR Forward Primer GCACTTGACTGACCGATTAAAAGA α5 5’ UTR Forward Primer CATTGTGTGGTGATGCAATAGAAA α6 5’ UTR Forward Primer TGGAAATAAAACCAGAGGTATTTGAC α7 5’ UTR Forward Primer ACATCGAGATTGAAATGAAGGGAT α8 5’ UTR Forward Primer GCAGCGGAATTGGATTAAAAGAC α9 5’ UTR Forward Primer CTTCTAATTTGGAGGCAATTTTCA α10 5’ UTR Forward Primer AGAAAATGTCAGATCGTATGTGCG α11 5’ UTR Forward Primer TATTTTGGAAGCCAATTTCGTATG α12 5’ UTR Forward Primer CAGAAAAGGGTGACTGCTCATAAA

xi

α13 5’ UTR Forward Primer CACTAGGAAGCCATAAAAATTGGG αC1 5’ UTR Forward Primer GGTGTAGCGTGTTGGTGGAAC α1 3’ UTR Reverse Primer CCATTCAAGACAAAACTGCCATTA α2 3’ UTR Reverse Primer ATGATGGAAGTGCCAAAATGTAAA α3 3’ UTR Reverse Primer CGTTTTAACCACAAAACCACAAAA α4 3’ UTR Reverse Primer AAATTTTAACCCAGGGAAATTTGA α5 3’ UTR Reverse Primer GCAGAAAAGCCTCATAAAAAGCAT α6 3’ UTR Reverse Primer GCAAAGAGATTAAAATAAGAAACACAA α7 3’ UTR Reverse Primer TACGCATCCCAAAACAAAAGTGTA α8 3’ UTR Reverse Primer CAAGATAGACCCAACTTGTGGAGA α9 3’ UTR Reverse Primer ACTAGAAGTCAAGACAAAGTCGGC α10 3’ UTR Reverse Primer AAGTACCAATCAAAGGCTGACACA α11 3’ UTR Reverse Primer GGACATTTATTGGTTAAAGAAGCCA α12 3’ UTR Reverse Primer CGGAAGTTCTTCAAGGAAACAAAT α13 3’ UTR Reverse Primer AAACAACTGCAAGGAAGGCTAAAG αC1 3’ UTR Reverse Primer CACAGGAAGTGCTAAGAAGGAGTG β1 5’ UTR Forward Primer GCAGTAACCTGTTGCAGAAAAGTG β2 5’ UTR Forward Primer GATACGGGGAGATAGAGTTAGCGA β3 5’ UTR Forward Primer AAAGACTGTTTCCCAGCTCTGTCT β4 5’ UTR Forward Primer ATTGAATGTCTCAAGTCTCGTTGC β5 5’ UTR Forward Primer TATTTATGCAAATCATCTGGGTGG β6 5’ UTR Forward Primer GCTGAAAAGGATTTTTCTTCCGTA β7 5’ UTR Forward Primer ATTTCAAAGGATTCCGCTGCT β8 5’ UTR Forward Primer GCCTCAGATACTGGGGACTTTACA β8a 5’ UTR Forward Primer AAGCTTCTGTGAACCAACTTTTCA β9 5’ UTR Forward Primer GATTGCTATTTGTGCTGGGGCAGTG β10 5’ UTR Forward Primer TGTGGCTGTAACCAACTAGGAAAT β11 5’ UTR Forward Primer ATCCTTAGACCACAGAGGATTTGG β12 5’ UTR Forward Primer GATTTGGTAGACAGATCAGAGGCA β13 5’ UTR Forward Primer GCCTCAGATACTGGGGACTTTACA β14 5’ UTR Forward Primer TGAGCTTCAGTTTTTCCACAAGAG β15 5’ UTR Forward Primer AAAGGGGCGTGTACAGAAGTAAAG β1 1800 Forward Primer GCCACTGACCTTGGGTTATTTTC β7 1800 Forward Primer AACGCCTGGCTGTCGTACC β8 1800 Forward Primer TACCAGCTGCTCAAGGCCAC β8a 1800 Forward Primer CAGAATGCCTGGCTGTCGTA β universal 1800 Forward Primer TGTCGTACCAGCTGCTCAAGG β1 3’ UTR Reverse Primer TAGCCTCACTCTAGGTTTCCCATT β2 3’ UTR Reverse Primer GGACTTTCCACAAATTAAACGAGA β3 3’ UTR Reverse Primer TAAGCCATCCCTGGATTTGATATT β4 3’ UTR Reverse Primer GGGACAAATTTAGCTTTATCAGGG β5 3’ UTR Reverse Primer TTTTTGTCCATGACACTCTCGAAT β6 3’ UTR Reverse Primer GCATGAACAGGAAATACTAAGATACCA β7 3’ UTR Reverse Primer TTAAGGGAACCTCGCAATAAGTTC β8 3’ UTR Reverse Primer GTTGACAGAAAACCATTCCTTTGA β8a 3’ UTR Reverse Primer AATACCTTATTGGTGGTGGTGAGC β9 3’ UTR Reverse Primer ATTACTGGGGAGTTCTTGAACCTG β10 3’ UTR Reverse Primer ATGAAACATTGTAAAACCACAGGG β11 3’ UTR Reverse Primer AAAAATTGGGGGAGAAAAGAAAAA β12 3’ UTR Reverse Primer TTACAGGCACCTCACGATAAGTTC β13 3’ UTR Reverse Primer CACAAATTGGGGGAAATAAACATT β14 3’ UTR Reverse Primer CATTTCCACGTCATTATAGCCACT β15 3’ UTR Reverse Primer CAATCAGACAAAGGAAATGAGTAACA

xii

Protocadherin ectodomain sequencing. After DNA amplification unincorporated dNTPs and primers were digested with 40 units of exonuclease I (New England Biolabs) and 20 units of shrimp alkaline phosphatase (Amersham Life Science). Samples were incubated for 1 hour at 37ºC, followed by inactivation of the enzymes at 75ºC for 15 minutes. PCR products were sequenced in both directions using primers specific for the ectodomains of each protocadherinα paralog. Sequencing primers for Pcdhα ectodomain 1 consisted of the same forward primer used for amplification, located in the 5’ UTR and a reverse primer located in the linker sequence between EC1 and EC2. Sequencing primers for Pcdhα ectodomain 6 consisted of the same reverse primer used for amplification, located in the 3’ UTR, and a forward primer located in the linker sequence between EC5 and EC6. All primers were specific for the paralog being assayed. All other ectodomain sequencing primers for Pcdhα were designed to be present in the upstream and downstream linker sequences for the ectodomain being assayed. Sequencing primers for Pcdhβ consisted of one of two forward sequences. The forward primer used for sequencing from the full length amplified DNA was designed in EC4 (1200 primer). The same forward primer (1800) was used for sequencing for all of the 3’ amplified sequences. The reverse primer was the same 3’ UTR primer used in amplification.

Sequencing was performed in a 10 µl volume with 2 µl of exonuclease-treated PCR product, 0.2 µM of sequencing primers, 2 µl ABI PRISM BigDye Terminator Ready Reaction Mix containing AmpliTaq DNA Polymerase (Applied Biosystems) and 2 µl 2.5X ABI Sequencing Buffer I(Applied Biosystems). Sequencing reactions were carried out in GeneAmp 9700 Thermocyclers (Applied Biosystems) by denaturing for 5 minutes at 94ºC, followed by 25 cycles of 10 seconds at 96ºC, 5 seconds at 50ºC, and 4 minutes at 60ºC. Excess dye-terminators were removed by isopropanol precipitation then ethanol precipitation. Products were resuspended in 2 µl loading buffer (5:1 deionized formamide, 25mM EDTA, 50 mg/ml Blue Dextran), heated at 95ºC and loaded onto the Applied Biosystems 377 or 3700 DNA sequencers. Sequence traces were evaluated by using Phred, Phrap, Polyphred, and Consed 5-8. Polyphred was used to flag heterozygous SNPs. Each SNP was visually verified by inspecting the trace to ensure that the SNP was present on both forward and reverse reads.

α1 EC2 Forward Primer GATAATCCACCCGTCTTCAGG α1 EC3 Forward Primer ACTGTTTGACCAGGCCGTAT α1 EC4 Forward Primer AACTGGCGGTCACTTCATT α1 EC5 Forward Primer GAATGACAACGCGCCTG α1 EC6 Forward Primer GAGAACGACAACGCGCC α2 EC2 Forward Primer GCCAATATTTCCAATGACAGTAAAG α2 EC3 Forward Primer CAACTTTTGCCCAATCAGTTTAC α2 EC4 Forward Primer CACACCAGAAGTCTCAATAACGTC α2 EC5 Forward Primer CGTTCGCACAGCCTGAGTA α2 EC6 Forward Primer GAGAACGACAACGCGCC α3 EC2 Forward Primer CCAGTTTTTCCAATGGCTGTA α3 EC3 Forward Primer AGCGTTTGAGAGGACGATCTAT α3 EC4 Forward Primer CATCAATGATAATGTACCTGAGTTAG α3 EC5 Forward Primer GTGAACGACAATGCGCC

xiii

α3 EC6 Forward Primer GCACTGCTGATGCCTCG α4 EC2 Forward Primer TGTTCCCAGCAACACAAAAG α4 EC3 Forward Primer CCAGCTTTTGACAGAACCATTTA α4 EC4 Forward Primer AACAACGATAATGTCCCAGATTT α4 EC5 Forward Primer ATGTGAACGACAACGCTCC α4 EC6 Forward Primer GAAAACGACAACGCGCC α5 EC2 Forward Primer CAGGTTCTCCAGACAAGAACAA α5 EC3 Forward Primer ATGCTAATGATAACGCCCC α5 EC4 Forward Primer AATGATAATACCCCAGAGATGGC α5 EC5 Forward Primer ACGTGAACGACAACGCTCC α5 EC6 Forward Primer CTGGTGCCTCGAGTGGG α6 EC2 Forward Primer CCTTGTTCCCGGTAGAGGA α6 EC3 Forward Primer TGAATGATAATGCTCCCACTTTC α6 EC4 Forward Primer TGATAACGTCCCTGAGATAGCAC α6 EC5 Forward Primer ACATGAATGACAATGCTCCG α6 EC6 Forward Primer CCTCGGGTGGGTGGTACT α7 EC2 Forward Primer GTGTTCCCAGCGACACAAA α7 EC3 Forward Primer GTGTTCGACAGAACCCTGTAT α7 EC4 Forward Primer CTCCACAGTTGACTCTCACTTCC α7 EC5 Forward Primer ACGTGAACGACAACGCC α7 EC6 Forward Primer GAGAACGACAACGCGCC α8 EC2 Forward Primer AGTGTTCCGGGTAAAAGACCA α8 EC3 Forward Primer TGAATGATAATGCTCCCACTTTC α8 EC4 Forward Primer TGATAACGTCCCTGAGATAGCAC α8 EC5 Forward Primer ACGTGAACGACAATGCTCC α8 EC6 Forward Primer CACTGCTGGAGCCTCGG α9 EC2 Forward Primer AGTGTTCCCAGCGACACAAA α9 EC3 Forward Primer GTGTTCGACAGAACCCTGTAT α9 EC4 Forward Primer TGCTCCACAGTTGACTATCAAAA α9 EC5 Forward Primer GTGAACGACAACGCACCA α9 EC6 Forward Primer CTGCTGACACCTCGGATGA α10 EC2 Forward Primer GCCCAGGTTCTCCGTAACA α10 EC3 Forward Primer ATGCCAATGATAACGCCCCTAT α10 EC4 Forward Primer TGAAAATGATAATTCACCTGAGG α10 EC5 Forward Primer GTGAACGACAACGCGCC α10 EC6 Forward Primer ACGAGAACGACAACGCTCC α11 EC2 Forward Primer ACATTAACGACAACCCGCC α11 EC3 Forward Primer CGACAATGATCCAGAGTTTGATA α11 EC4 Forward Primer ACACCAACGATAACTCTCCTGAA α11 EC5 Forward Primer GTTCGCACAGCCCGAGTA α11 EC6 Forward Primer GGGAGGCGCAGTTAACAAG α12 EC2 Forward Primer GGTGTTCAGAGAAAGGGAACAA α12 EC3 Forward Primer GCGTTTGATAAGCCCAGCTA α12 EC4 Forward Primer ACAATGTCCCTGAAGTAATGGTT α12 EC5 Forward Primer ACAATGCGCCTGCGTTC α12 EC6 Forward Primer GAGAACGACAACGCGCC α13 EC2 Forward Primer CCATATTCCCTGAAAGCAAGAA α13 EC3 Forward Primer CGGAATTTTACCAATCCGTTTAT α13 EC4 Forward Primer CCAGAGGTTACCATCACTTCTTT α13 EC5 Forward Primer GTGAACGACAACGCGCC α13 EC6 Forward Primer ACGAGAACGACAACGCTCC αc1 EC2 Forward Primer ACACCAATGACAACTCACCTCTC αc1 EC3 Forward Primer ACAACGCGCCTGTATTTGAG

xiv

αc1 EC4 Forward Primer CGAACTGGACTTCCTGACTCTT αc1 EC5 Forward Primer ATACACCAAACTTTCCTCAACCC αc1 EC6 Forward Primer TATCCGGTTATCTTGTTTCCCTT αc1 EC2 Forward Primer ACATCAACGACAACTCACCG αc1 EC3 Forward Primer ACTAACGACAACTCTCCTGCCT αc1 EC4 Forward Primer GAGGTGGTGCTCACGGAC αc1 EC5 Forward Primer ACATCAATGACAATCCACCAAG αc1 EC6 Forward Primer CATTCTGTACCCTACCTCAACCA α1 EC1 Reverse Primer CCTGAAGACGGGTGGATTATC α1 EC2 Reverse Primer AAACAGTGGGGCGTTATCATT α1 EC3 Reverse Primer CCAGTTCTGGAGCATTATCATTT α1 EC4 Reverse Primer CAGGCGCGTTGTCATTC α1 EC5 Reverse Primer CACTCGAGGCGCCAGCA α1 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α2 EC1 Reverse Primer CTTTACTGTCATTGGAAATATTGGC α2 EC2 Reverse Primer ATTGGGCAAAAGTTGGTTCAT α2 EC3 Reverse Primer GACGTTATTGAGACTTCTGGTGTG α2 EC4 Reverse Primer GTACTCAGGCTGTGCGAACG α2 EC5 Reverse Primer CGCGTTGTCGTTCTCGT α2 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α3 EC1 Reverse Primer TACAGCCATTGGAAAAACTGG α3 EC2 Reverse Primer ATAGATCGTCCTCTCAAACGCT α3 EC3 Reverse Primer TGATTGAATAACTAACTCAGGTACA α3 EC4 Reverse Primer AATGCCGGCGCATTGTCGTTCA α3 EC5 Reverse Primer ACCCGAGGCATCAGCAGT α3 EC6 Reverse Primer GATCAAGTACACGTTGACATCCA α4 EC1 Reverse Primer CTTTTGTGTTGCTGGGAACA α4 EC2 Reverse Primer TAAATGGTTCTGTCAAAAGCTGG α4 EC3 Reverse Primer AATCTGGGACATTATCGTTGTTG α4 EC4 Reverse Primer GAGCGTTGTCGTTCACAT α4 EC5 Reverse Primer CCCGAGGCGCTAGCAGT α4 EC6 Reverse Primer AGGTATACGTTGACATCCACCAG α5 EC1 Reverse Primer TTGTTCTTGTCTGGAGAACCTG α5 EC2 Reverse Primer AATGGATTTATCAAATTCTGGGG α5 EC3 Reverse Primer TGGCCATCTCTGGGGTA α5 EC4 Reverse Primer CGGAGCGTTGTCGTTCAC α5 EC5 Reverse Primer CACTCGAGGCACCAGCA α5 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α6 EC1 Reverse Primer TCCTCTACCGGGAACAAGG α6 EC2 Reverse Primer AGTGGGAGCATTATCATTCACAT α6 EC3 Reverse Primer CAGTGCTATCTCAGGGACGTTAT α6 EC4 Reverse Primer CGGAGCATTGTCATTCATGT α6 EC5 Reverse Primer AGTACCACCCACCCGAGG α6 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α7 EC1 Reverse Primer TTTGTGTCGCTGGGAACAC α7 EC2 Reverse Primer ACTGGGGCATTGTCATTGTT α7 EC3 Reverse Primer CAACTGTGGAGCATTGTCATTTA α7 EC4 Reverse Primer GGGGCGTTGTCGTTCAC α7 EC5 Reverse Primer AGTGCCACCCACCCGAG α7 EC6 Reverse Primer CAGAGTGGGCTTTACCAAACTAC α8 EC1 Reverse Primer TGGTCTTTTACCCGGAACACT α8 EC2 Reverse Primer AGTGGGAGCATTATCATTCACAT α8 EC3 Reverse Primer CAGTGCTATCTCAGGGACGTTAT

xv

α8 EC4 Reverse Primer CGGAGCATTGTCGTTCAC α8 EC5 Reverse Primer CACCAGTGCCACCCACC α8 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α9 EC1 Reverse Primer TTTGTGTCGCTGGGAACACT α9 EC2 Reverse Primer ACTGGGGCATTGTCATTGTT α9 EC3 Reverse Primer CAACTGTGGAGCATTGTCATTTA α9 EC4 Reverse Primer TGGTGCGTTGTCGTTCAC α9 EC5 Reverse Primer CTCATCCGAGGTGTCAGCAG α9 EC6 Reverse Primer GTGAGAACCAACAGGCTAGACAC α10 EC1 Reverse Primer TGTTACGGAGAACCTGGGC α10 EC2 Reverse Primer AGATAGGGGCGTTATCATTGG α10 EC3 Reverse Primer GAAGTGACAATCACCTCAGGTG α10 EC4 Reverse Primer TACTCGGACTGCGCGAACGCA α10 EC5 Reverse Primer ACTGACTGCACCGCCCG α10 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α11 EC1 Reverse Primer GGCGGGTTGTCGTTAATGT α11 EC2 Reverse Primer TGATTTATCAAACTCTGGATCATTGT α11 EC3 Reverse Primer TTCAGGAGAGTTATCGTTGGTGT α11 EC4 Reverse Primer GTACTCGGGCTGTGCGAA α11 EC5 Reverse Primer CTTGTTAACTGCGCCTCCC α11 EC6 Reverse Primer ATCAGGTACACGTTGACATCCA α12 EC1 Reverse Primer TTGTTCCCTTTCTCTGAACACC α12 EC2 Reverse Primer CTTATCAAACGCCGGACCAT α12 EC3 Reverse Primer GTAACCATTACTTCAGGGACATT α12 EC4 Reverse Primer GGCGCATTGTCGTTCAC α12 EC5 Reverse Primer CGCGTTGTCGTTCTCGT α12 EC6 Reverse Primer GCGATGATGAGGTACACGTTAAT α13 EC1 Reverse Primer TTCTTGCTTTCAGGGAATATGG α13 EC2 Reverse Primer ATTGGTAAAATTCCGGGGC α13 EC3 Reverse Primer AGAAGTGATGGTAACCTCTGGG α13 EC4 Reverse Primer GGCGCGTTGTCGTTCAC α13 EC5 Reverse Primer GCGTCAGCAGCGCCGGA α13 EC6 Reverse Primer CAAGTAAACATTGACATCCACCA αc1 EC1 Reverse Primer GAGAGGTGAGTTGTCATTGGTGT αc1 EC2 Reverse Primer AGCGCTCAAATACAGGCG αc1 EC3 Reverse Primer AAGAGTCAGGAAGTCCAGTTCG αc1 EC4 Reverse Primer GGGTTGAGGAAAGTTTGGTGTAT αc1 EC5 Reverse Primer GATAACCGGATAATTGTCATTCC αc1 EC6 Reverse Primer ACACACGAAGAAAAGTAAGCACC αC2 EC1 Reverse Primer CGGTGAGTTGTCGTTGATGT αC2 EC2 Reverse Primer ATAAGTGGACTGGTCAAAGGCAG αC2 EC3 Reverse Primer ACCTCTGGGGCATTGTCAT αC2 EC4 Reverse Primer CTTGGTGGATTGTCATTGATGT αC2 EC5 Reverse Primer TGGTTGAGGTAGGGTACAGAATG αC2 EC6 Reverse Primer ACCTTTCCCTTACTCCACAGAAG β1 1200 Forward Primer CATGGATACTGGACCACCTAGCTT β2 1200 Forward Primer AGACCAGATCCGAATACAACATCA β3 1200 Forward Primer CAGAGAGACCAGATCCGAGTACAA β4 1200 Forward Primer GTAACAGAGAGACCACTGGACCG β5 1200 Forward Primer CAGAGAACACTGGACAGAGAGAGC β6 1200 Forward Primer AGAGAGAGCAGAGCCGAGTACAAC β7 1200 Forward Primer GGTAACAGAGAAACCTTTGGATCG β8 1200 Forward Primer CAGAGCCGAGTACAACGTCACTAT

xvi

β8a 1200 Forward Primer CGACAGAGAAGCAAGAGCTGAATA β9 1200 Forward Primer GGACAGAGAGAGCAAAGCTGAGTA β10 1200 Forward Primer GCCGAGTACAACATCACTATCACC β12 1200 Forward Primer TACACTTTGGAAACAGAGAGACCG β13 1200 Forward Primer ACACCCTACTAACGGAGAGACCAC β14 1200 Forward Primer TGAAAAAGCACTGGACAGAGAGAG β15 1200 Forward Primer AGAGAGACCAGAGCCGAGTACAAC

Protocadherin CTF2 fragment cloning. Full-length clones of Pcdh-α6, -β15, -γb7, –γc3 and ActG1 were ordered from Open Biosystems (Clone IDs 5263226, 5767743, 4941167, 4153558, and 3865767). Clones were grown under ampicillin selection under standard conditions and DNA isolated using Invitrogen Maxi-Prep kits according to manufacturer’s directions. Putative ICD fragments were generated from PCR amplifications with protocadherin-specific primers which added the optimal Kozak sequence and ATG to the 5’ end (GCCGCCATG). Primers were designed using Primer3 (http://frodo.wi.mit.edu/primer3/) with the modification of several search parameters4. Primers were synthesized at Invitrogen and were resuspended at a concentration of 10 µM. The primers were designed to amplify at the junction between the putative transmembrane domain, as predicted by the SMART domain predictor (http://smart.embl-heidelberg.de/), and the cytoplasmic region. The 3’ primer was designed just upstream of the stop codon.

Pcdh α6 Forward GCCGCCATGACAGCGCTGCGGTGCTCG Pcdh αconstant Forward GCCGCCATGCCACGACAGCCCAACCCT Pcdh β15 Forward GCCGCCATGGTGGCAGTGCGGCTGTGC Pcdh γconstant Forward GCCGCCATGCAAGCCCCGCCCAACACGGACTGG Pcdh γb7 Forward GCCGCCATGCGCCTGCGACAGTCTTTC Pcdh γc3 Forward GCCGCCATGAAAGTTTACAAGTGGAAG Pcdh α Reverse CTGGTCACTGTTGTCAGT Pcdh β Reverse TTCTAGAAATTCTGATTT Pcdh γ Reverse CTTCTTCTCCTTCTTGCC ActG1 Forward GCCGCCATGGAAGAAGAGATCGCCGCG ActG1 Reverse GAAGCATTTGCGGTGGACGATGGAGGG

Primers were tested on genomic DNA and were all shown to produce bands of the predicted sizes.

PCR reactions were performed in 20 µl with 1 µM of each primer, 1-2 units of AmpliTaq Gold DNA Polymerase (Applied Biosystems), 1X GeneAmp PCR Gold Buffer (Applied Biosystems), 2 mM MgCl2, 5% DMSO, 2.5 mM of each dNTP and 5 ng isolated full length DNA. Reactions were performed on GeneAmp 9700 thermal cyclers (Applied Biosystems) by denaturing for 5 minutes at 94ºC, followed by 30 cycles of 94ºC for 30 seconds, 55ºC for 30 seconds and 72ºC for 30 seconds, followed by a final extension step of 10 minutes at 72ºC.

xvii

PCR products were gel purified using the QIAquick Gel Extraction Kit (Qiagen) by the manufacturer’s directions. DNA was cloned into the reading frame of the pcDNA3.1/V5- His TOPO expression vector (Invitrogen) to give COOH-terminal fusions of the protocadherin fragments with the V5 and His tags according to manufacturer’s directions using One Shot chemically competent E. coli (Invitrogen). Constructs were checked for correct integration by PCR insert check using T7 and BGH primers (T7_F: TAATACGACTCACTATAGGG, BGH_R: TAGAAGGCACAGTCGAGG). Positive colonies by insert check were then sequenced to ensure correct integration. FoxP2 in the pcDNA3.1/V5-His TOPO expression vector was generously provided by Simone Marticke.

Cell lines. BE(2)-C cells were grown in a 1:1 mixture of Gibco Advanced Minimum Essential Medium and Gibco F12 Nutrient Mixture. HEK293, HT-1080 and HTB-11 cells were grown in Gibco Dulbecco’s Modified Eagle Medium. Jurkat cells were grown in Gibco Advanced RPMI 1640 media. All media was additionally supplemented with 10% fetal bovine serum, 2 mM Gibco GlutaMAX-1, and 1% Gibco Penicillin- Streptomycin. For stable cell lines whose quantities were limited, plates were coated with 0.2% gelatin to help adhesion.

Fragment transfection. Cells were trypsinized from one plate of approximately 80% confluent cells using 0.25% Trypsin/EDTA and were harvested by centrifugation. The cells were resuspended in 5 ml culture media and cell density was counted using Trypan Blue. 2 x 106 cells were plated per plate in 10-cm tissue culture plates. Cells were allowed to grow for 48 hours at 37° in 5% CO2to approximately 80% confluency. 3:1 FuGene 6 Reagent:DNA complex reactions were prepared using 15 µl FuGene (Roche):5 µg DNA in a total reaction volume of 600 µl OPTI-MEM lacking serum (Gibco). Preparation of complexes followed the standard protocol. After a 30–40 minute incubation, the complex mixture was added drop-wise into each plate and gently swirled to mix. Media was changed daily after transfection. For each cell line transfected, one plate was transfected with green-flourescent protein in order to assure successful transfection had taken place.

Fragment expression. RNA from transiently transfected cells was collected using the RNeasy mini kit using on-plate lysis and on-column DNase digestion 24 and 48 hours post-transfection (Invitrogen). 5 µg of RNA was used to generate first strand cDNA using the SuperScript III First-Strand Synthesis System for RT-PCR (Invitrogen) using oligo- dT primers. GAPDH expression was assessed as a background expression control, eGFP and protocadherin primers were used to assess expression efficiency. GAPDH, eGFP and protocadherin primers were designed using Primer3 with the modification of several search parameters4. Primers were synthesized at Invitrogen or Operon and were resuspended at a concentration of 10 µM. They were tested on known concentrations of genomic DNA to ensure they could be used reliably for quantification. qPCR was carried out as described later in this section.

xviii

Pcdh α6 RT Forward CTAGGCTGCCAGATTCTGTGTTT Pcdh αconstant RT Forward ACCTAGAGGAGGCTGGCATTCTA Pcdh β15 RT Forward CGATCATTCTCCTGAGTTTCCTG Pcdh γconstant RT Forward CTCCCAAAATGGCGATGAC Pcdh γb7 RT Forward CAAAAGACCGAGGATCTCTCTCA Pcdh γc3 RT Forward TCTGGAGTTGGTAGTGGAGAACC Pcdh α3 RT Forward CAGCGTTTGAGAGGACGATCTAT Pcdh β3 RT Forward GCTATTCTGTGGCTGAGGAAAAA Pcdh γa5 RT Forward GGCTGACCTAGGCAGTATCAAGA Pcdh γb5 RT Forward CTGGCACTGGAGAAAACCTTAGA Pcdh α6 RT Reverse AGATGGAATTTGAGCCAACATCT Pcdh αconstant RT Reverse GATACTGTTGGCCACTGCTGAT Pcdh β15 RT Reverse CAAGGGAGCTAGTTTCTGGGATT Pcdh γconstant RT Reverse TTGCAGCATCTCTGTGTCAAACT Pcdh γb7 RT Reverse TATTTCTGGGCTGTTGTCGTTTT Pcdh γc3 RT Reverse AGGATTGTTGTCGTTGATGTCCT Pcdh α3 RT Reverse CACTAGGGTACCATTTGGTGCAT Pcdh β3 RT Reverse CAGATCCTTTGCTAGGTTGGCTA Pcdh γa5 RT Reverse CACTGCCACCACAAGATAGAGTG Pcdh γb5 RT Reverse AAGGCAGTCAGGACTAAACGATG

Protein was extracted from cell lines 48 hours post-transfection. Bradford protein assay (Bio-Rad) was used to quantitate protein for equal loading. Protein samples were dissolved in SDS-PAGE sample buffer with heat treatment. Western blotting was performed on Tris-glycine SDS-Polyacrylamide gels using 5% stacking and 12% separating. Positope samples were run on each gel to ensure effective antibody staining. Protein was transferred to polyvinylidenedifluoride (PVDF) membranes (Bio-Rad), and probed with antibody. Antibodies used included Clontech rabbit anti-GFP antibody, Invitrogen mouse anti-V5, Clontech 6xHis mAb/HRP conjugate, Chemicon rabbit anti- V5 antibody, Sigma mouse anti-V5, Chemicon polyclonal rabbit anti-NFκB p65 subunit, Chemicon mouse monoclonal anti-TAF250 at manufacturer recommended dilutions with appropriate secondary antibodies. Immunoblots were developed using an ECL system (GE Healthcare).

Immunofluorescence microscopy. Chamber slides were coated with 700 µl Poly-d-Lysine (Invitrogen) diluted to 0.1 mg/ml and incubated for 24 hours at room temperature. Poly-d- Lysine was removed and slides were washed three times with tissue culture-grade distilled water. HEK293 cells proved to adhere better to chambers that were not coated with poly-d-lysine, therefore, this step was omitted when dealing with these cells. Cells were trypsinized from one plate of approximately 80% confluent cells using 0.25% Trypsin/EDTA and were harvested by centrifugation. The cells were resuspended in 5 ml culture media and cell density was counted using Trypan Blue and 2.5 x 104 - 5 x 104 cells were plated per well in four-well chamber slides. Additional media was added for a total of 800 µl, and cells were grown at 37° in 5% CO2 overnight. The following day, transfection was performed using either FuGene or (Roche) or Lipofectamine LTX

xix

(Invitrogen). FuGene transfection took place as described previously with 2.25 µl FuGene (Roche):PBS for negative control, 750 ng eGFP for positive control, 750 ng FoxP for nuclear control, 750 ng ActG1 for cytoplasmic control, or 750 ng Pcdh-ICD in a total reaction volume of 100 µl OPTI-MEM lacking serum (Gibco). Lipofectamine LTX transfection was performed with an initial incubation step of 750 ng of DNA, 150 µl opti- MEM lacking serum and 0.75 µl PLUS reagent for 5 minutes. 1.25 µl Lipofectamine LTX reagent was then added to the slides and incubated at room temperature for 30 minutes. Slides were then returned to the incubator at 37° in 5% CO2. Cells were incubated for 24–72 hours before fluorescence was captured.

Cells were visualized under the fluorescent microscope (Zeiss AxioSkop2) at 400x magnification. To fix cells, chamber slides were washed three times with room temperature PBS and incubated in 750 µl 4% paraformaldehyde for 30 minutes at room temperature in the dark. Paraformaldehyde was removed, cells were washed once with PBS, and cells were permeabilized with 0.5% Triton X-100 (Roche). The slides were washed three times with PBS, and cells were blocked with 5-10% horse serum for 1 hour. Slides were then incubated with the Invitrogen anti-V5 monoclonal antibody at a dilution of 1:200 in PBS. Primary antibody was removed and cells were washed 4x with PBS for five minutes per wash. Incubation with Alexa Fluor 555 goat anti-mouse IgG secondary antibody (Molecular Probes) at a dilution of 1:400 in PBS was performed for 1 hour in the dark. The secondary antibody was removed and the cells were washed 4x with PBS for five minutes per wash in the dark. Cells were incubated with DAPI diluted 1:1000 in PBS for 5 min. Vectashield mounting media (Vector Labs) and cover slips were applied. Antibody and protocol suitability were initially tested by performing no primary and no secondary controls on all transfected cells. Background fluourescence was minimal and no antibody controls showed good antibody specificity (data not shown).

Establishment of stable cell lines. Untransfected cell lines were exposed to increasing concentrations of Geneticin (Gibco) ranging from 0 µg/ml to 1100 µg/ml. The lowest concentration that supported no cell growth for BE(2)-C cells was 300 µg/ml, and for HEK293 cells 600 µg/ml. The pcDNA3.1/V5-His-TOPO vector was linearized with SspI. 1 µl of SdpI enzyme, 2.5 µl SspI buffer and 1 µg of DNA were incubated in a 25 µl reaction volume for 1 hour at 37ºC. A total of four restriction digests for a total of 4 µg of input DNA were performed for each insert. DNA was phenol/chloroform extracted, ethanol precipitated, dried and resuspended in 15 µl of water. DNA was quantified on a nanodrop (Thermo Scientific) and 1 µg of DNA transfected using Fugene at a 3:1 ratio in a 500 µl Opti-MEM volume as described previously. Mock-transfected plates were also established to ensure optimal antibody concentration. After 24 hours Geneticin was applied at appropriate concentrations. Selection media was changed every 3 days.

Once colonies formed on plates cells were used to generate individual lines. In our original stable cell line derivations glass cloning cylinders were used to pick colonies.

xx

During re-derivation cells were counted and diluted for plating at 0.5 cells per well in 96- well format under Geneticin selection. Each well was individually checked and all those wells containing more than one cell were discarded. A sample of undiluted/unpicked cells was re-plated, expanded and frozen for use as a pooled stable sample. Once 80% confluence of the single colonies was reached, wells were individually trypsinized and expanded onto 12-well plates under Geneticin selection. Stable cell lines were checked for vector expression initially by Western Blot only. During our second round of stable cell line generation expression was checked by immunofluorescence microscopy only. The two highest expressing colonies for each construct were chosen to proceed.

Nuclear extraction. Nuclear extracts were prepared using the NE-PER Nuclear and Cytoplasmic Extraction Reagents (Pierce) according to manufacturer’s directions. Briefly, stable CTF2-expressing cells were grown to approximately 80% confluence and collected with trypsin-EDTA. Cells were washed in ice-cold PBS and pelleted. The pellet was resuspended in cold cytoplasmic extraction reagent I containing protease inhibitors. The mixture was vortexed vigorously and incubated on ice for 10 minutes. Cold cytoplasmic extraction reagent II with protease inhibitors was then added, vortexed and incubated cold for 1 minute. The mixture was briefly vortexed and then centrifuged and the supernatant removed as the cytoplasmic fraction. The remaining pellet was resuspended in cold nuclear extraction reagent with protease inhibitors, vortexed and incubated at 4 ºC for 40 minutes with additional vortexing at 10 minute intervals for the duration of the 40 minutes. The samples were centrifuged and the supernatant collected as the nuclear fraction.

Chromatin preparation. Methods adapted from the Farnham Lab and outlined in Trinklein et al. 2004, and Cooper et al. 2007 with several adaptations 9, 10. Formaldehyde was added to a final concentration 1% to cross-link proteins and DNA for 10 minutes. Glycine at a final concentration of 0.125M was added to de-activate. Plates were scraped and transferred to tubes with 2x107 – 5x107 cells per tube for centrifugation; the pellet was washed once with 1x phosphate-buffered saline (PBS). The pellet was resuspended and lysed in Farnham lysis buffer and incubated on ice for 10-90 minutes. This was centrifuged to collect the crude nuclear extract which was then resuspended in RIPA buffer. This chromatin solution was sonicated at 4 ºC.

Chromatin immunoprecipitation. Methods outlined in Trinklein et al. 2004, and Cooper et al. 2007 with adaptations 9, 10. Sonicated chromatin was incubated with 10 µg of mouse monoclonal V5 antibody (Invitrogen) coupled to sheep anti-mouse IgG magnetic beads (Dynal Biotech). MockIP samples for protocadherin experimentation were generated differently than in Trinklein et al. 2004 and Cooper et al. 2007. For protocadherins, sonicated chromatin was incubated with sheep anti-mouse IgG magnetic beads that had been subjected to the coupling protocol, but without the addition of antibody. MockIP samples for NRSF experimentation were identical to previously published protocols. The

xxi magnetic beads were washed five times with wash buffer and once with TE buffer. DNA was eluted by incubating with elution buffer for 1-2 hours. Cross-links were then reversed overnight at 65ºC. Phenol chloroform extraction was performed and the aqueous phase desalted and purified using the QIAquick PCR purification kit (Qiagen).

Immunoprecipitation and transformation. One plate of each Pcdhγc3 stable cell line was trypsinized, centrifuged and split onto five 15-cm plates. Plates were allowed to grow for 72 hours to approximately 80% confluence. Two plates were transfected with 1 µg empty pGL3 vector (Promega),and two more plates were transfected with 1 µg pGL3 vector containing the Pcdhγc3 promoter (a generous gift from Patrick Collins) using the Lipofectamine LTX transfection method described previously. Transfected cells were allowed to grow for 48 hours and then chromatin was prepared without sonication. Lysis was varied between 10 and 90 minutes. The chromatin extract was incubated with magnetic beads (Dynal Biotech) that had been coupled with 10 µg of anti-V5 monoclonal antibody (Invitrogen) overnight at 4ºC. The DNA was eluted from the beads and cross- links were reversed by incubation at 65 ºC overnight. Unbound DNA was also saved and reverse cross-linked, as the mockIP sample. DNA was purified by phenol-chloroform extraction and ethanol precipitation with the addition of glycogen to help in the visualization of the DNA pellet. This DNA was used to transform chemically competent cells which were grown under ampicillin selection, using pUC19 as a positive transformation control.

Column precipitation and DNA visualization. Antibodies were coupled to the column using the AminoLink Immobilization Kit (Pierce). We used a 1:1 dilution of Invitrogen anti-V5 mouse monoclonal antibody provided at a concentration of 1 mg/ml with coupling buffer at a total volume of 600 µl per column. Immobilization was performed according to manufacturer’s instructions. Briefly, the column was equilibrated by washing with coupling buffer. Subsequently the antibody/coupling buffer solution was added, along with 50 mM sodium cyanoborohydride and was incubated overnight at 4ºC with end-over-end rocking. The column was then drained and washed with coupling buffer and then quenching buffer. Quenching buffer with 50 mM sodium cyanoborohydride was then incubated for 30 minutes with end-over-end rocking. The column was washed with wash solution, then degassed buffer with 0.05% Tween-20. Columns were stored with at least 1 mL of degassed sodium azide buffer at 4ºC. All column flow-through was kept to calculate coupling efficiency.

We used Centricon columns for buffer exchange. Columns were washed once with 1% ethanolamine to block, three times with PBS and then were stored at 4ºC in fresh PBS. Chromatin samples were prepared as detailed previously with incubation in Farnham lysis buffer for 1 hour at 4ºC, samples with and without sonication were used. Samples were applied to the columns, washed three times with coupling buffer and eluted into 1 mL of coupling buffer.

xxii

Coupled columns were washed with binding/wash buffer. 300 µl of buffer exchanged chromatin was applied to antibody-coupled columns and incubated at 4ºC overnight. Columns were then centrifuged and the flow-through saved. An additional 300 µl of chromatin was applied to the same column and incubated at room temperature for two hours. The column was then washed with binding/wash buffer, and the bound sample eluted in elution buffer in three eluate fractions. Input chromatin sample before buffer exchange, buffer exchange flow through, buffer exchange eluate, the flow through after overnight incubation, the binding/wash buffer flow through and the three eluate fractions were all amplified using random hexamers and (Invitorgen) run on an agarose gel to look for evidence of DNA binding to V5-fragments. chIP-chip. We used six samples of Pcdh-γc3-CTF2 as our starting chromatin, as well as two mockIP chromatin samples. Each chromatin sample contained 2 x 107 cells. Our six pooled samples were prepared from two separate growths of cells with the chromatin and the immunoprecipitations prepared separately. Our two mockIP samples were also from these two separate preparations. Immunoprecipitations were carried out as described previously. Extracted DNA samples were pooled. Samples were amplified using ligation- mediated PCR. First DNA was blunted using T4 polymerase. Equal parts blunting mix (2X NEB buffer 2, 5 µg/µl BSA, 0.5 mM dNTP, 1.5 units T4 DNA polymerase) and immunoprecipitated DNA were incubated at 12 ºC for 20 minutes. Sodium acetate was added to 0.3M along with 10 µg of glycogen. DNA was phenol/chloroform extracted and ethanol precipitated, then resuspended in 25 µl water. Equal parts ligase mix (2.5X ligase buffer, 7.5 µM linkers, 1.5 units T4 DNA ligase) was incubated with extracted DNA overnight at 16ºC. Sodium acetate was added to 0.3M as well as 130 µl ethanol and incubated at -80ºC for 30 minutes. DNA was pelleted and washed as in ethanol precipitation. 25 µl sample was combined with 15 µl of PCR mix A (3X NEB Thermopol buffer, 1 mM dNTP mix, 3 mM oligo oJM102) and run on GeneAmp 9700 thermal cyclers (Applied Biosystems) under the following conditions: 4 minutes at 55ºC, 5 minutes at 72ºC, 2 minutes at 95ºC, 32 cycles of: 30 seconds at 95ºC, 30 seconds at 60ºC, 1 minute at 72º, cycling followed by 4 minutes at 72ºC, hold at 4ºC. Midway through the 4 minutes at 55ºC step, 1X Thermopol buffer and 2.5 units Taq polymerase were added. Samples were then run through the Qiagen PCR cleanup column using manufacturer’s directions with elution into 100 µl water.

Both IP and mockIP samples were quantitated and 1 µg was used for labeling. Labeling was carried out using the Invitrogen Bioprime Plus Array CGH Labeling Module according to manufacturer’s directions using Cy3 and Cy5 dyes for labeling (Amersham). Labeling was checked to calculate percent incorporation and yield. Dye swap was not performed due to low DNA yield. Array hybridization and scanning was carried out as described by Cooper et al. 2007 9. We used the Agilent 244K human promoter 2-chip array. These arrays contain approximately 17,000 well-defined RefSeq transcripts. 5.5

xxiii kilbases upstream and 2.5 kilobases downstream of the transcriptional start sites of each gene is covered by multiple probes.

Array analysis. Scripts used for analysis were generously provided by Simone Marticke. We first defined probe regions, where a region is defined as a set of at least three probes with adjacent probes being less than 400 base pairs apart. We then looked for regions with significant enrichment in the immunoprecipitated sample. A region was defined as significant if the moving average z-score was at least 2 (corresponding to a p-value of 0.02). qPCR Validation (Protocadherins, REST/NRSF and MGAM). We performed real-time PCR to measure enrichment at promoters as described in Trinklein et al. 2004 and Mortazavi et al. 2006 10, 11. Briefly, Primer3 software was used to design primers using 500 base pairs upstream of the predicted site or the center of the positive probe, and 500 base pairs downstream of the site/probe. Primers for protocadherin CSEs were generously provided by Patrick Collins. Amplicons were designed with an ideal size range of 60-100 base pairs, but some amplicons were up to 217 base pairs. For NRSF, each amplicon was measured on a standard curve of 50 nanograms, 5 nanograms, 500 picograms and 50 picograms mockIP DNA. For protocadherins and MGAM, the lowest concentration was dropped. Bio-Rad iCyclers were used to measure incorporation of SYBR-green. The threshold cycles of target DNA were fit to the standard curve for each amplicon to determine DNA quantity. For NRSF, this quantity was divided by the amplicon size. Fold enrichment was calculated by dividing the quantity of DNA determined by standard curve fitting by the average quantity of DNA calculated from two to five negative control primer sets. Negative control primers are against non-genic, non-conserved DNA regions.

Negative control primer 1 Forward TGTTTATCCTTCCAAGCAGCAGT Negative control primer 2 Forward TTTGCTTTGGTTTGGTTTGATTT Negative control primer 3 Forward GTCAAGCATTCCTGTAGGCTGAA Negative control primer 4 Forward ATTTCAACTCCCAGAAAACTGGC Negative control primer 5 Forward TAGCCAAAAGAAGGAAGCAACAG Negative control primer 1 Reverse AAGGTTCCTGCCTATTCTCCAAC Negative control primer 2 Reverse ATAAAGGCTAAGATGGGCTCCAA Negative control primer 3 Reverse AAAAATTGCTCATTGGAGACCT Negative control primer 4 Reverse TATAGAAGGGGGATAGGGGAACA Negative control primer 5 Reverse CTAAAGGTAGGGCTGGAAGCAAT

Pcdhα1 CSE RT Forward TGCAGACGGATATAGATTGCATTT Pcdhα3 CSE RT Forward TGAAAGAAAAGTTGTGAACTCATGG Pcdhα4 CSE RT Forward CACGTGAAATTCTGTGGTGGTAA Pcdhα8 CSE RT Forward AAGTGGCTAAACCGAAAAGAACC Pcdhα10 CSE RT Forward CATCAATGGAAAATATGAAGACTGA Pcdhα11 CSE RT Forward GGAAAGTTCAAACGACATACAAGG Pcdhα12 CSE RT Forward ACAGTAGAGTGTGTGGGGGTTTC

xxiv

Pcdhα13 CSE RT Forward TAGAAGAACCCAGATATTGCGGA Pcdhβ1 CSE RT Forward CCTGAAAGGGAATTAACAGGTGA Pcdhβ3 CSE RT Forward TACACACACCCAAATGGGTACAA Pcdhβ4 CSE RT Forward TGGAATTTAGGTAGCTTTCACCAA Pcdhβ5 CSE RT Forward TCCGGATGTATTTATGAATTACTTTTT Pcdhβ6 CSE RT Forward GCAAAATAAAAACACCTTCTCCAA Pcdhβ7 CSE RT Forward AACTTCCCAAATACGTTGTCTTGA Pcdhβ9 CSE RT Forward AGAATTTAGGATGAAATGGCAAA Pcdhβ10 CSE RT Forward TGAAATGCTATTGACCTGAAAACA Pcdhβ11 CSE RT Forward TCCGCCATCTTTTTGTTCTAGTC Pcdhβ14 CSE RT Forward ATTTTGCCTCCAATATTCAACCA Pcdhγa1 CSE RT Forward TGAGAAGGTTTTCAGTATTTGCTAGG Pcdhγa2 CSE RT Forward TTTACTTTCCATTGCATGTATCACTT Pcdhγa6 CSE RT Forward TGAGCGTAATCATTTCTTCTGGA Pcdhγa7 CSE RT Forward CGAAATATCCTTTCTGGGAGTTCA Pcdhγa9 CSE RT Forward TCAACAAAGGCAAACAAACAAGA Pcdhγa10 CSE RT Forward GCAGAAATAAAATCCTCTGTGTGA Pcdhγa11 CSE RT Forward TCAGTGTATTGTGTGCATCAATGT Pcdhγa12 CSE RT Forward GTAGTGTTTTCTTTATCAGCCATCTG Pcdhγb1 CSE RT Forward TAGATTCAAGAGGTTGCGCTTTG Pcdhγb2 CSE RT Forward TTCTTAGCAAATATTGACAAGCCA Pcdhγb3 CSE RT Forward TTTGGAGCTGATTTAGTCACCATT Pcdhγb6 CSE RT Forward GGAGACCGAATTCAAAATGAAAA Pcdhγb7 CSE RT Forward TCCTTGAAAGAGGTAGAGAAAAGTCA Pcdhγc4 CSE RT Forward ACTCTCCAATGGCTACTCTCCCT Pcdhγc5 CSE RT Forward CCCCAGCTCCACTCAAATTC Pcdhα1 CSE RT Reverse TGATCAGTAGCTCCCTTCTATTGAC Pcdhα3 CSE RT Reverse AGTCCTTGAGAAACTGGCTGATG Pcdhα4 CSE RT Reverse CCTCACTGTTCAGACCAACCTTT Pcdhα8 CSE RT Reverse AATGTGGGTGCGGAGAAAAA Pcdhα10 CSE RT Reverse TTCTGCAGGAAGAAAGAAACGG Pcdhα11 CSE RT Reverse CGCTGTTATTTCTTTTGTGGAGAA Pcdhα12 CSE RT Reverse TCCCTTCTTCCTCCTTCTCTCAT Pcdhα13 CSE RT Reverse AAGACTTTGAACATTGACCAACCA Pcdhβ1 CSE RT Reverse AACTCCCCAGTCTAGCGAATGTA Pcdhβ3 CSE RT Reverse ATAGCATCCTCAGAGCAATTCCC Pcdhβ4 CSE RT Reverse ATATCAGTGTTCACAACGCCATC Pcdhβ5 CSE RT Reverse ACCGTGCCAGCCTATGTACTG Pcdhβ6 CSE RT Reverse ATGCACAATGCCTGTGGTTTAAT Pcdhβ7 CSE RT Reverse GCGAGCTTTTTAGGGCTCATTA Pcdhβ9 CSE RT Reverse CTTAATGGCGTCTGAGTTATCCG Pcdhβ10 CSE RT Reverse ACCGAGTCTGCAGAGCTTTCATT Pcdhβ11 CSE RT Reverse TCTGTAACATCCTTTCTTTGTGAAAT Pcdhβ14 CSE RT Reverse GGAGCTGCGTAGGCAGTTGTATT Pcdhγa1 CSE RT Reverse CCTCCTCCTGACCTGTGTGTACT Pcdhγa2 CSE RT Reverse GAGCTCTGGCTGCGTTTTCT Pcdhγa6 CSE RT Reverse GAGGGAGGGACTCAATGGC Pcdhγa7 CSE RT Reverse AGCAGGATCTCCGCTTTTCTC Pcdhγa9 CSE RT Reverse ATTGGATGTCTCGCTGCCTTTTA Pcdhγa10 CSE RT Reverse GGATGTAACTTGAGTTTCCGTTGA Pcdhγa11 CSE RT Reverse GTAAAACTCGGTGGTTTCTGTGG Pcdhγa12 CSE RT Reverse TCCGGATGGAGTCCTGATTTT Pcdhγb1 CSE RT Reverse CTCCTGCAGAATTGAGAAGCC

xxv

Pcdhγb2 CSE RT Reverse CTAAGAGGCTCCCAAACGCTTCT Pcdhγb3 CSE RT Reverse ACTTTTCTTCCGAGCTGGCTG Pcdhγb6 CSE RT Reverse AGGCCAGAGGCTGACGGAT Pcdhγb7 CSE RT Reverse CAGGCTAGAGGCTGAGGGAT Pcdhγc4 CSE RT Reverse TTTTCCTAACACACCGACTGACC Pcdhγc5 CSE RT Reverse CCTCTTAAAAACTGCTCGGAGGT

Proteasome inhibition. 2.5 x 104 Pcdhα6-CTF2 rederived stable cells were plated per chamber. After 24 hours of growth cells were incubated with MG132 (Z-Leu-Leu-Leu- aldehyde) (Sigma) at 10 µM for 24 hours or 50 µM for 12 hours. Colonies were also separately incubated with Lactacystin (Sigma) at 5 µM for 24 hours or 25 µM for 12 hours. Both proteasome inhibitors were diluted in DMSO. Control colonies were also plated at the same time and mock treated with equivalent amounts of DMSO for equal times. In both cases the higher concentration for shorter periods of time preserved cellular morphology and prevented cell death as compared to the lower concentrations. Cells were then visualized by immunofluorescence.

ChIP-Seq. Chromatin immunoprecipitation was carried out as described previously. Library construction was performed as detailed in Johnson et al. 2007 12. All reagents are found in the Illumina DNA Library Construction Kit (Illumina). Libraries were constructed for immunoprecipitated and total chromatin from Pcdhα6-CTF2 stable cell lines. Briefly, end-repair and the addition of dA was performed. Adapters were then ligated to the fragments and the library amplified by PCR. Size selection of the library was carried out by gel electrophoresis, with excision of DNA in the 150-300 range, and purification using the QiaEXII agarose gel purification kit (Qiagen). A second round of PCR amplification was then performed. The DNA was quantified and then used for Solexa/Illumina sequencing.

ChIP-Seq peak calling and identification of enriched regions. Immunoprecipitated and control reads were analyzed together to identify regions with significant enrichment of reads in the immunoprecipitated sample versus the control sample. Peak calling was performed with the QuEST ChIP-Seq data analysis method outlined in Valouev et al. 2008 13. Briefly, separate profiles are generated for forward and reverse tags and the experiment-specific peak shift value is determined by calculating the distance between forward and reverse read pile-ups around predicted transcription-factor binding sites. The forward and reverse profiles are then shifted to produce one profile of enriched peaks that incorporates the forward and reverse profiles taking into account the peak shift. Peak calling was then performed where the program looked for enrichment of peak profiles in the immunoprecipitated data compared to the control data. QuEST reports enrichment scores and genome coordinates for each peak identified. False positive rates are estimated by splitting the control data into pseudo-IP and pseudo-control samples. Peak calling was then performed on this sample set and any peak predicted deemed a false positive.

xxvi

Cistematic. Information about Cistematic and the scripts used in the generation of REST/NRSF data can be found in Mortazavi et al. and at http://cistematic.caltech.edu and http://cistematic.caltech.edu/~alim/cispaper 11.

MGAM genotyping. PCR optimization was carried out on samples from Coriell’s International HapMap project including samples from each of the four represented populatios: Yoruba, CEPH European, Japanese and Chinese. Kidney DNA was graciously provided by Heather Wheeler. The collection and description of these samples can be found in a manuscript draft available publicly online14. Kidney genotyping was performed in independent triplicate to ensure reproducibility. Genotypes were called independently by four individuals. Only unambiguous genotypes, those called by each person and obvious in 2/3 of the PCR runs were used for correlation with expression. Only 2 of the 70 samples genotyped were excluded due to ambiguity.

Final PCR reactions were performed in 10 µl with 1 µM of each primer, 1 unit of

Platinum Taq DNA Polymerase (Invitrogen), 1X PCR Buffer (Invitrogen), 2 mM MgCl2, 2.5 mM of each dNTP and 10 ng genomic DNA. Reactions were performed on GeneAmp 9700 thermal cyclers (Applied Biosystems) by denaturing for 5 minutes at 94ºC, followed by 30 cycles of 94ºC for 30 seconds, 65ºC for 30 seconds and 72ºC for 3 minutes, followed by a final extension step of 7 minutes at 72ºC. Primers were designed using Primer3 (http://frodo.wi.mit.edu/primer3/) with the modification of several search parameters4. Primers were synthesized at Operon and were resuspended at a concentration of 10 µM.

MGAM deleted allele forward CTTTACTCCCTGGCTGTTGC MGAM deleted allele reverse GAGACCCATGGGCTTAACAA MGAM non-deleted allele forward CAGTACCCACAGGCAGGATT MGAM non-deleted allele reverse AGGAGGCCAGAGACTGACAA

MGAM qPCR. cDNA synthesis and quantitative PCR were carried out as described previously for the protocadherins.

MGAM exon 16 qPCR forward AATCCTGGATGGGTACCTGTTCT MGAM exon 40 qPCR forward TTAATGCAAGAGGAGAGTGGAAGA MGAM exon 48 qPCR forward AAGCATCGATGTGACTGACAGAA MGAM exon 16 qPCR reverse ATGTCATACTGCTTGCCCCAGT MGAM exon 40 qPCR reverse CGGACATGAAGATTAATGTGGTC MGAM exon 48 qPCR reverse CACAGAGTGCTTATCCACGTCAA

xxvii

Chapter 1: Introduction 1 The central nervous system in mammals is an immensely complex structure. It contains billions of neurons with trillions of specific synaptic connections. While much is known about the various subtypes of neurons present, there is very little known about how neurons recognize their targets to make appropriate synapses. It has long been proposed that the genome must code for an extraordinarily diverse set of cell-surface molecules, whereby each neuron expresses only a subset of molecules and makes synapses with target neurons through recognition of these cell-surface molecules in a

“lock and key” manner15. The cadherin superfamily, due to their diversity and cell adhesion properties, have been hypothesized to be the lock and key that neurons need to form complex, specific interactionssee, for example 16-19.

The cadherin superfamily of proteins are transmembrane cell-surface proteins that function in calcium-dependent cell-cell interactions in the nervous systemreviewed in 16, 20, 21.

Cadherins are divided into classic and nonclassic cadherin subgroups, with the classic cadherins being further grouped into type I and II moleculesreviewed in 22,23. Several dozen cadherins are expressed in the vertebrate brain and variation in their cytoplasmic domains is thought to confer varying functions which are utilized for diversification in the brain16.

The classic cadherins are thought to participate in the compartmentalization of the CNS as well as acting to mediate synapse target recognition during the formation of neural circuits24-26. The classic cadherins N- and E-cadherin are localized to synaptic junctional complexes in a mutually exclusive pattern, although many synapses appear to contain neither27. While the pattern of N- and E- cadherin expression may help to delineate some

1

classes of neurons, it is obvious that a more diverse set of proteins would be required for specification of individual synaptic connections.

A large group of clustered protocadherins may contain the molecular diversity necessary for this task. The clustered protocadherins are the largest subgroup of nonclassic cadherins and are expressed predominantly in the brain28, 29. Additional single protocadherins exist throughout the genome, but protocadherin gene clusters are present at three chromosomal locations: 5q31, 13q21 and Xq2130, 31, 32, 33, 34, 35. The largest of the clustered protocadherins are the 5q31 protocadherins which include more than 50 members and span approximately 700 kb 31. In this manuscript we will use the word

‘protocadherin’ to refer to the clustered protocadherins on 5q31, and the word ‘cadherin’ to refer to the classical cadherins, unless otherwise indicated. The 5q31 protocadherins are arranged as a tandem group of closely-linked clusters of paralogs and can be sub- categorized in humans into three groups based on spatial organization and sequence similarity: Pcdh-α (also sometimes called the cadherin-related neuronal receptors

[CNRs]), Pcdh-β and Pcdh-γ31, 36,31. While they share characteristic features with the broader group of protocadherins, evolutionary analysis reveals that they comprise their own group of paralogous proteins which cluster separately from other human protocadherins37.

Due to their relative divergence from other members of the protocadherin cluster based on both sequence and functional behavior, the 5q31protocadherins are generally assumed to play a different role than other protocadherins. Their function remains unknown, but multiple lines of evidence suggest they are the lock and key molecules necessary for the establishment of specific neural circuits during synaptogenesis17, 18, 36.

2

They are structurally similar to the classical cadherins known to be involved in adhesion, are diverse, they are expressed predominantly in neurons, and are differentially expressed by neurons of the same population38, 39, 40, 41, 42, 43, 44, 45. Structure, diversity and expression of the protocadherins and how these factors relate to their proposed roles in synaptogenesis are detailed below.

Protocadherin diversity Analysis of the genomic organization and expression of the protocadherin cluster reveals the protocadherins comprise a diverse set of tightly regulated molecules. The human, mouse and rat have a conserved genomic organization of Pcdh-α, -β, and -γ clusters31,40. Elephant sharks have paralogs distributed among four subclusters (Pcdh-δ, -

ε,- µ, and -ν), with no obvious orthologous relationship to any of the originally identified vertebrate clusters (Pcdh-α, -β,or -γ)46. Coelacanths have three protocadherin subclusters,

Pcdh-α, -β, and -γ, as well as one paralog orthologous to the Pcdh-δ cluster in elephant sharks46, 47. Due to the teleost whole genome duplication, fish like fugu and zebrafish contain two unlinked protocadherin clusters with orthologs for the Pcdh-α, Pcdh-δ and

Pcdh-γ clusters only46, 47, 48, 49, 50, 51. Thus, the Pcdh-δ cluster was lost in mammals, the

Pcdh-β cluster was lost in ray-finned fish, the Pcdh-α, -β, and –γ clusters were lost in cartilaginous fish, and the Pcdh-ε, -µ, and –ν clusters were lost in bony vertebrates

(Figure 1)46. This raises the possibility that the ancestral vertebrate genome contained many protocadherin clusters that have experienced great loss and expansion in a lineage- specific manner46.

3

The Pcdh-α and Pcdh-γ clusters are composed of a tandem array of variable exons and a set of three constant exons at the 3’ end of the variable exons for that cluster

(Figure 1)52. Each variable region exon codes for the extracellular domain of the protocadherin protein, the transmembrane domain and a short cytoplasmic segment.

These unusually large variable exons are uninterrupted by introns, a characteristic feature of protocadherins37. The portion of the cytoplasmic tail coded for by the variable exon is conserved between Pcdh-γ members of the same subclass type –a, -b, and –c. The constant region provides a common cytoplasmic tail for all members of the cluster. The

Pcdh-β cluster contains only the tandem variable exons. The Pcdh-α and Pcdh-γ clusters additionally contain c-type variable exons which are more similar to each other than to other members of their own clusters52. They are located downstream of the variable exons, nearer to the constant exons31. They lack conserved motifs and their expression pattern is temporally different from the other protocadherins18, 43, 53.

Unlike other protocadherins, the 5q31 protocadherins appear to be a vertebrate- specific innovation50. The clustered protocadherin orthologs are absent from invertebrate genomes54, 55, 56, 57. Sequencing of multiple vertebrate species have revealed that the protocadherin cluster is present in all vertebrates and due to its tandem organization and predicted redundancy, is subject to dynamic lineage-specific gene loss, tandem duplication and gene conversion31, 47, 48, 49, 50, 51, 52, 58, 59, 60. Even closely related species such as chimpanzee/human and mouse/rat have different numbers of protocadherin genes50. The promoter and variable exon appear to be the fundamental unit of duplication, which are free to diversify subsequent to their duplication. The high level of

4 conservation observed among all mammalian protocadherin promoters indicates that protocadherin regulation must be highly conserved49.

In humans, there are approximately 53 tandemly arrayed variable exons in total, due to the prevalence of predicted pseudo-genes it is difficult to determine the exact number of functioning proteins31, 58. Each variable exon has its own 5’ promoter, with all but one isoform (Pcdh-αc2) containing a 22-nucleotide conserved sequence element

(CSE) within the promoter37, 45, 61. Deletion of the CSE significantly reduces expression of the downstream protocadherin paralog45. The presence of this CSE may explain the overlapping pattern of expression of most cluster members.

In addition to the diversity provided by paralog number, the protocadherins may further increase overall diversity by alternative splicing. In the Pcdh-α and Pcdh-γ clusters each isoform is generated by splicing the first variable exon in the transcript to the constant exons45, 61. Most protocadherins are generated by cis-splicing, although mRNAs generated by trans-splicing have been noted in 0.5-10% of transcripts45, 62.

About 0.2% of transcripts are chimeric, either Pcdh-α variable exons spliced to Pcdh-γ constant exons or Pcdh-γ variable exons spliced to Pcdh-α constant exons, also generated by trans-splicing45. Due to their relatively low abundance, trans-spliced transcripts are not thought to significantly increase protocadherin diversity.

Alternative splicing, however, does appear to confer diversity. Pcdh-α and Pcdh-γ mRNAs have been discovered that lack the constant region exons, referred to as “O- type”, instead containing the variable exon and some intronic sequence, but it is unclear whether these are translated to make functional proteins31. “B-type” α-protocadherins

5 have also been reported which are derived from five exons instead of the normal four, and contain a shorter portion of the constant region than the classical “A-type” protocadherins58. The B-type α-protocadherins are likely to be functional as they are detected across multiple species, are translated and bind to a different set of cytoplasmic binding partners than their A-type counterparts63, 64. The B-type isoform appears to be absent when full-length α-protocadherins are expressed in human cells, although amplification from human brain cDNA reveals the presence of both A- and B-type transcripts58, 65. These different splice variants of the protocadherins provide abundant diversity and combined with their combinatorial expression pattern it is clear that millions of combinations of synaptic barcodes are possible.

6

Figure 1: Genomic organization of protocadherin clusters in multiple species

Comparison of human, chimpanzee, rat, mouse, chicken, coelacanth, fugu and elephant shark protocadherin clusters. Paralogs are indicated by colored boxes. Orthologs are indicated by colored bars connecting the paralogs. Variable and constant region exons of the human protocadherin clusters are labeled at top. Black boxes represent relic or pseudogene sequences.

7

Choice of protocadherin paralog expression by individual neurons It is not clear how promoters are chosen during neural development, and thus it is unknown how neurons destined to synapse with each other “know” to express the same subset of protocadherins. However, since they are clustered genomically one does not have to envision complex interactions of transcription factors regulating each gene, rather, epigenetic changes such as DNA methylation or histone acetylation could explain their varied expression patterns. It has been demonstrated in human cell lines and in mouse cell lines and brain tissue that the active promoters and 5’ regions of expressed protocadherins are hypomethylated and their silent counterparts extensively methylated61,

66, 67, 68.

In vivo DNA methylation occurs mosaically in non C-type α-Pcdhs, while the C- type α-Pcdhs are constitutively unmethylated67. Demethylation of cells upregulates α-

Pcdh expression and hypermethylation of α-Pcdh promoters represses their activity consistent with a role for methylation in the regulation of at least some protocadherin paralogs67. Interestingly, cell lines subcloned from demethylated cell lines express different combinations of Pcdh-α isoforms than the parental lines indicating that promoter choice and methylation are dynamically regulated67. The CpG present in the

CSE shows differential methylation among paralogs, but its methylation state appears to be correlated with expression levels only in some isoforms, indicating that methylation at this position is not required for downregulation of expression and instead may be read by the cell only in the context of the general methylation state of the entire promoter66. So, while methylation may not be the only mechanism governing protocadherin expression, protocadherin expression is sensitive to the local methylation pattern.

8

Individual neurons may increase the diversity of the protocadherin repertoire by a unique method of regulation. One group of researchers has shown by single cell PCR that members of the Pcdh-α and -γ family are expressed monoallelically, with multiple members expressed in each cell stochastically41, 42. There was no obvious co-expression preference, with one member being co-expressed with another member more frequently than expected, but the number of cells analyzed was small42. From their single cell co- expression data they estimated that approximately two Pcdh-α variants, one Pcdh-γA variant and one Pcdh-γB variant is expressed per cell42. Cultured neurons and adult mice brains are known to co-express Pcdh-α and Pcdh-γ variants, but not in every cell69.

However, to our knowledge, the monoallelic expression result has not been independently duplicated and thus its significance remains unknown.

There are strain-specific differences in expression of the various α-Pcdh paralogs in mice that appear to be the result of strain-specific differences in the protocadherin cluster, such as polymorphisms and methylation state, and not differences in the overall genetic background66. A long-range cis-regulatory element, HS5-1, composed of a cluster of five DNase I hypersensitive sites, is required for wildtype expression in the CNS of

Pcdh-α1-α12 and –αc1, but not Pcdh-αc270. A second enhancer, HS7, appears to regulate protocadherin-α expression in the sensory organs70. Both HS5-1 and HS7 contain a CSE similar to the CSE present in the protocadherin promoters, consistent with their roles as an enhancer as cis-regulatory elements and their targets often share binding sites70. Thus, monoallelic expression may be attained by competition of each promoter for the one

HS5-1 element on each chromosome70.

9

Protocadherin protein structure and homophilic interactions In order to fulfill their “lock and key” role the protocadherins must be capable of recognizing protocadherins, or other molecules, on opposing neurons. It is proposed that synapse adhesion is achieved via homophilic and/or heterophilic interactions of the expressed protocadherins between two neurons, such that only neurons expressing the same subset (under a strictly homophilic interaction hypothesis) or complementary subsets (allowing for heterophilic interactions) of protocadherins would maintain synaptic connections17, 31, 38. The cadherins display a strong preference for homophilic interaction which is achieved by assembling into cis dimers, which then bind in trans71,

72. Mutation of residues required for cis-dimerization abolish E- and N-cadherin mediated cell adhesion73. Interestingly, cells expressing different amounts of the same cadherin, also segregate from each other, indicating a differential adhesiveness that would grant extra diversity in the “lock and key” model74. Weaker heterophilic interactions between some cadherins have been reported in vitro75. Cadherins undergoing heterophilic interaction also form functional cis dimers which interact in trans with other cis heterodimers across synapses75.

Similarly, protocadherin homophilic and heterophilic interactions before insertion into the membrane have been reported41, 42, 45, 61,69. The protocadherin-α and -γ proteins may form heterodimers that are required for proper localization of the complex to the cell membrane69, 76. Other evidence suggests that the protocadherins bind in a homophilic fashion in cis63. The protocadherin family in general appears to mediate only weak trans homophilic adhesion, and the adhesive capabilities of the 5q31 protocadherins are particularly questionablereviewed in 28.

10

Knowledge of classic cadherin structure and how structural characteristics contribute to binding specificities may provide clues as to how protocadherin binding may be achieved. Classic cadherins contain five extracellular ectodomain (EC) repeats, while the protocadherins contain six or more. The first extracellular ectodomain repeat

(EC1) is responsible for binding specificity in cadherins in trans, while EC2 has more recently been shown to participate in the cis interaction71, 72. The structure of Pcdhα EC1 has been determined by NMR and reveals that it is composed of two β sheets that are stacked face to face, similar to the classical cadherins77. The loop structures connecting the β-strands are very different between the classical cadherins and the protocadherins, and it is these structures that are known to mediate homophilic interaction in the cadherins77. Therefore, it is not surprising that the homophilic interactions so obvious between the cadherins have not been demonstrated for the protocadherins.

Standard aggregation assays have also failed to convincingly display homophilic protocadherin interactions. Transfected protocadherins are not inserted well into the plasma membrane of cultured cells43, 69, 76. In cells that do localize protocadherins to the surface, aggregation in the presence of calcium is observed, but not to the degree observed for classical cadherins29, 43, 63, 76. However, in vitro bead aggregation assays fail to demonstrate any homophilic interaction between the protocadherins76. It remains possible that what little aggregation is observed in cellular systems is not due to homophilic interactions at all, or that strong homophilic interactions are only initiated under specific conditions or in specific cell types.

One possible condition that may induce homophilic interaction is protocadherin cleavage. The γ-protocadherins are known to undergo presenilin-dependent

11 intramembrane cleavage in vitro and in vivo78, 79, 80. Deletion of the Pcdh-γ constant or intracellular domain on transfected constructs induces a dramatic localization of the transfected extracellular domain to cell-cell junctions compared to transfected full length constructs, suggesting that intracellular cleavage is necessary for extracellular adhesion81.

This finding, however, is in direct contradiction to previous experimentation demonstrating that protocadherins deleted for the cytoplasmic domain do not localize well to the cell surface, and are unable to mediate homophilic interactions69. If cleavage is necessary for proper localization then the inability of cells to localize un-cleaved products to the membrane likely explains why cell aggregation assays have failed to convincingly demonstrate homophilic interaction29, 43, 77, 82. Alternatively, the localization failure could be explained by a requirement for cis-heterodimerization. It has been shown that membrane localization is enhanced by co-expression of Pcdh-α and Pcdh-γ, as they form heteromers that more readily localize not only to the plasma membrane, but to cell- cell junctions69. In vivo, multiple combinations of Pcdh-α and Pcdh-γ co- immunoprecipitate regardless of the presence of calcium with both the intracellular and extracellular protocadherin domains appearing to be required for this association69. The finding that some, but not all cells co-express Pcdh-α and Pcdh-γ variants makes it seem unlikely that cis-heterodimerization is necessary in all cells for protocadherin function69.

Cadherins in synaptogenesis To simplify the many adhesive changes that occur during synaptogenesis, it is believed that synaptogenesis can be thought of as a three stage process requiring different adhesion molecules at different times44. Indeed, multiple adhesion molecules have been

12 shown to be involved in various aspects of synaptogenesis including the integrins, neurexins and neuroligins, nectins, synaptic cell-adhesion molecules (SynCAMs) and

EphB receptorsreviewed in 44. Initially, non-specific adhesion molecules distributed evenly along axons and dendrites might be used to promote general adhesion resulting in the formation of early synapses. Synaptogenesis occurs promiscuously both in vitro and in vivo, indicating that specific lock-and-key interactions, such as those thought to be provided by the protocadherins, are not required for this initial phase of synapse formation.

In the second stage, appropriate early synapses would stabilize by recruiting molecules necessary for the assembly of the synaptic junction. Cadherins are known to be recruited to synapses by mobile transport packets during the early stages of synaptogenesis83. Since neurons in the same general region are known to express similar repertoires of cadherins and cadherins are known to make strong homophilic intercellular connections, only synapses where both the pre- and post-synaptic neuron express the same cadherin(s) are thought to be maintained.

In the final stage, further refinement would occur after the establishment of the synaptic junction by the structural adhesion molecules by recruiting “molecular barcode” adhesion proteins with more complex, specific expression patterns capable of specifying more complex patterning44, 84. It is during this stage that the protocadherins are hypothesized to be active. N-cadherin and Pcdh-γ molecules have been shown to co- localize at synapses, demonstrating that these two adhesive systems can operate simultaneously at the same synapse84. Once all the adhesive proteins are in place, synapse maturation is thought to take place, perhaps acting by changing the distribution of the

13 necessary adhesion molecules. This change in distribution may be necessary for changes in synapse strength as a result of activity at the synapse.

Activity at the synapse is known to affect the strength of the synapse, a phenomenon referred to as ‘activity-dependent synaptic plasticity’. The cadherin superfamily appears to be necessary for activity-based refinement, and this activity- dependence is known to operate not only during synapse maturation, but also during synapse formation. This is another area in which the protocadherins could act to modulate synapses. One of the best studied forms of activity-dependent synaptic plasticity is long- term potentiation (LTP), an experimentally induced form of synaptic plasticity where a stimulus is applied and leads to a sustained increase in synaptic strengthreviewed in 85. LTP can be used to measure both synapse formation in the mature brain, in the case of long- term synaptic plasticity, and synapse maturation during short-term synaptic plasticity.

Cell-adhesion molecules appear to be involved in both processes86.

Early long-term potentiation (E-LTP), a form of short-term synaptic plasticity that acts on pre-existing synapses, is known to be dependent on several cadherins86. When antibodies or blocking peptides against the protocadherin arcadlin, N-cadherin or E- cadherin are applied to hippocampal slices, E-LTP is blocked or attenuated87, 88. It is believed this is due to the modulation of the adhesive affinity of these adhesion proteins by synaptic activity. N-cadherin localization, stability and adhesion are known to be modulated by activity at the synapse. Depolarization induces N-cadherin stabilization, homodimerization, localization from synaptic puncta to dispersal along the plasma membrane, and resistance to proteases, all indicative of strong adhesive forces forming in response to activity89. The protocadherins are known to be cleaved in response to neural

14 activity, resulting in changes in cell adhesion82. It is therefore possible that the protocadherins are necessary for the cellular changes necessary to increase synapse strength in response to neural activity.

N-cadherin also appears to be involved in the longer-lasting late-phase LTP (L-

LTP), which is associated with the formation of new synapses86. N-cadherin is synthesized and recruited to new synapses during L-LTP90. Antibodies blocking N- cadherin prevent late-phase LTP altogether90. While the potential role of the protocadherins in long-term potentiation has not, to our knowledge, been explored, this remains one mechanism by which the protocadherins could be used to couple extracellular signals to intracellular changes needed to initiate the formation of new synapses.

Under this three stage synaptogenesis hypothesis, the lock and key molecules would not have to provide strong adhesive forces, as the synaptic junction would already be held in place by molecules like the cadherins. As the protocadherin trans-cellular interactions are weak, if at all present, this point is an important one. If the primary role for the protocadherins is to act as a ‘molecular barcode’ for a neuron, the neuron must necessarily be held together by stronger adhesive forces with the protocadherins acting only to recognize their opposing target protocadherin, no matter the strength of that interaction.

15

Protocadherin expression and subcellular localization Paramount to the proposed role of the protocadherins in specifying neuronal connections is their differential expression in different brain regions, neuronal classes or neuronal circuits at times that are concomitant with synaptogenesis. Protocadherin expression is complex, with a synergistic view of expression complicated by the fact that different groups use different organisms, developmental stages, localization techniques, and look for expression of different protocadherin subsets. α-protocadherins are expressed primarily in the nervous system, while γ-protocadherins exhibit more ubiquitous expression38, 40, 43, 45, 78, 91, 92. In multiple species, both the α- and γ- protocadherins are localized to areas of the brain with a high degree of organization- the cerebral cortex, hippocampus, cerebellum and olfactory bulb36, 45, 65, 93. Temporally, Pcdh-

γ expression is strong in mouse central and peripheral nervous systems from embryonic day 14 through birth with weaker expression persisting into adulthood in most regions of the brain and spinal cord39, 91. Neurons most strongly express the protocadherins at the end of the first postnatal week, a time when appropriate synaptic connections are being strengthened43, 94. Chick protocadherin expression is also highest at the time when synapse maturation occurs65. The expression of the protocadherins in the brain, especially in highly organized regions, during synaptogenesis and into adulthood puts them in the right place at the right time to both specify and/or maintain neuronal complexity.

Cadherin expression obviously places them as important determinants of the organized structures of the brain. Neurons integrated into a neuronal circuit express the same cadherin subclasses and different brain regions express different complements of cadherins(reviewed in 16, 20, 21, 24), 95. In some cases expression of the cadherins is required to delineate functional brain regions25. Expression of a transgenic cadherin in migrating

16 neurons is sufficient to change the axonal targets of that neuron such that the axons only target to fiber tracts expressing the same cadherin26. And while individual neurons are known to express only a subset of the protocadherin paralogs, the spatial expression pattern differences are much more subtle than for the classic cadherins, with no easily distinguishable expression differences between neuronal circuits or neuronal populations22, 38, 39, 40, 41, 42, 43. In the mouse and rat, the Pcdh-α and –γ members are expressed in subpopulations of cells of the same types39, 40, 43. Their differential expression pattern supports the idea that the protocadherins are necessary to provide specific identities to otherwise identical neurons, but their distribution does not distinguish between known subclasses or circuits of neurons.

Subcellular localization of the protocadherins is consistent with a role in synaptogenesis as well. In mice and human cell lines Pcdh-α and Pcdh-γ proteins are axonal and enriched in synaptic material, but are present at only some synapses36, 38, 39, 43,

65, 84, 96, 97. Immunoelectron microscopy and subcellular fractionation techniques show protocadherins are present at synaptosomes and the postsynaptic density fraction, suggesting that they are specifically associated with synapses, though some protein remains non-synaptic39. Non-synaptic γ-protocadherins are present in tubulovesicular structures within axon terminals and dendritic branches81, 84. Consistent with a role for protocadherins in the strengthening of already established synapses, it is hypothesized that these pools are used for insertion into the plasma membrane; modifying synapses that are adhered by other adhesive molecules only if the appropriate homophilic interactions are made44, 81.

17

The protocadherins appear to localize to presynaptic and postsynaptic densities, the synaptic cleft, the non-synaptic plasma membrane, as well as to transport and synaptic vesicles in both axons and dendrites in culture84. Because the protocadherins are localized to transport vesicles, it is hypothesized that they are dynamically inserted and recycled from the plasma membrane and thus act to maintain appropriate synapses84.

Cultured hippocampal neurons ubiquitously express γ-protocadherins in a punctuate pattern before synaptogenesis, with very little expression seen at the growth cones, indicating that protocadherins are unlikely to play a role in neurite guidance or extension84. Once mature synapses have formed, all interneurons display significant

Pcdh-γ expression, while about half of pyramidal cells show low-level expression of

Pcdh-γ proteins84. Although aggregation assays fail to demonstrate homophilic interactions, when both neurons of a synapse are Pcdh-γ-positive in vitro, Pcdh-γ localization is significantly more likely to be synaptic84. This may indicate that Pcdh-γ proteins are actively recruited to synapses from internal pools in order to mediate homophilic interaction between the two membranes.

Lessons on protocadherin function from model organisms Genetic disruptions of the protocadherin cluster have bolstered the hypothesis derived from localization and expression results that the protocadherins are not required for synapse establishment, but may be critical for synapse survival or maturation. Mice deleted for the entire Pcdh-γ cluster, before their death shortly after birth, exhibit various motor defects, presumably due to the observed massive spinal cord interneuron apoptosis39, 45, 98. Mutant spinal cords are smaller than wildtype due to degeneration of

18 already formed synapses subsequent to spinal interneuron apoptosis39. Other populations of neurons in the brain also exhibit neurodegeneration and apoptosis, but none as dramatically as the spinal interneurons39. It is important to note that without these protocadherins neurogenesis, neuronal migration, axon outgrowth and synapse formation take place appropriately in neurons which normally express Pcdh-γ39. Synaptogenesis also takes place in spinal neurons cultured from deleted mice with synaptic junctions, synaptic vesicles, presynaptic and postsynaptic densities forming abundantly and correctly, but massive degeneration and apoptosis occurs after several days in culture39,

45. Neurons from the hippocampus, where protocadherin expression is high in vivo, appear unaffected by protocadherin loss both in the animal and in culture39. This result indicates that protocadherins are directly necessary in only some neurons for synapse maturation or survival, but are not required for synapse formation.

It has recently been shown that the neurodegeneration observed in knockout mice is actually an exacerbation of the normal ventral to dorsal developmental apoptosis in the spinal cord99. Using conditional knockout of the γ-Pcdhs in different interneuron populations it was found that the protocadherins promote spinal interneuron survival in a non-cell-autonomous manner99. If homophilic adhesion across interneuron synapses by protocadherins are required for all synapse survival, it would be expected that the protocadherins would promote neuron survival cell-autonomously, which is not the case.

It is instead hypothesized that only some interneuron synapses use the protocadherins as part of their adhesive milieu, such that mutant neurons could be rescued from apoptosis if surrounded by enough normal neurons because they would be integrated into a circuit with enough synaptic activation provided by other molecules99.

19

Mice with conditional knockouts of the Pcdh-γ cluster are also beginning to illustrate a differential role for the protocadherins in different subsets of neurons.

Protocadherins appear to be required for the survival of some spinal interneurons, interneurons in the postnatal retina, and olfactory neurons93, 99, 100. Interestingly, not all neurons of any of the before-mentioned classes are completely lost in conditional knockout mice. Mice with loss of the Pcdh-γ cluster in the retina with a genetic loss of apoptosis do not experience a decrease in synapse density, while spinal interneurons lacking Pcdh-γ, also lacking apoptosis, show significantly decreased densities98, 100. Thus, the protocadherins appear to regulate neuronal survival and synapse maturation differently in these two populations. In mice completely lacking the Pcdh-γ cluster, there are classes of neurons that highly express the protocadherins but do not experience obvious synaptic deficits39. Whether the protocadherins are truly dispensable in these neurons, or if their phenotype is just masked by the severe phenotype caused by the degeneration of spinal interneurons is unknown.

Mice with defects of the Pcdh-α cluster show a much more subtle phenotype than their Pcdh-γ knockout counterparts. Hypomorphic Pcdh-α mice show defects in fear- conditioning learning and working memory, indicating a role for these proteins in the hippocampus64. Mice lacking the constant region exons of the Pcdhα cluster are viable, but show defects in axonal coalescence of olfactory sensory neurons in the olfactory bulb.

Groups of neurons that should have been eliminated appear to escape apoptosis, perhaps because they are unable to form appropriate synapses, but in the olfactory bulb are erroneously maintained instead of being erroneously eliminated as is the case in spinal interneurons93. This finding also supports the hypothesis that protocadherins are

20 necessary for synapse maintenance once synaptogenesis has linked together two appropriate cells. Whereas in spinal interneurons the lack of appropriate protocadherin interaction induces apoptosis, olfactory sensory neurons with inappropriate synapses appear to be unable to induce the normal program of apoptosis without the protocadherins. More extensive study into the effect of Pcdh-α loss in mice on other classes of neurons has not been published.

In contrast to the Pcdhγ deleted mice where neuronal apoptosis was observed after synaptogenesis, zebrafish unable to express full length Pcdh1α proteins experience early neuronal apoptosis79. This is also in contrast to the finding that mice deleted for the constant Pcdhα exons are viable without any grossly visible neuronal apoptosis. This may indicate both that the protocadherin α and γ clusters have distinct functions from each other and that these clusters may have evolved to function differently in different species.

The sum of these data from model organisms suggests that polymorphisms in the protocadherins might be attractive candidates for psychological disorders. Several polymorphisms which would be predicted to eliminate paralog function or significantly alter expression have been identified, but no striking associations have been reported4, 101.

The human 5q31 region is a possible susceptibility locus for schizophrenia102. A deletion eliminating Pcdh-α8, -α9, and –α10 is present at 11% frequency in Europeans, but has been shown not to be significantly associated with autism, schizophrenia or bipolar disorder4, 103. There may be an association with homozygosity of a rare SNP in a putative

Pcdh-α enhancer and bipolar disorder104. However, it is possible that due to the incredible diversity of the protocadherins loss of a handful of paralogs is effectively neutral. Some studies in mice indicate that the phenotypes that result from perturbations of

21 protocadherin paralogs are subtle. The BLG2 and BFM/2 wild mouse strains both have frameshift mutations in Pcdh-α9, which, by nonsense mediated decay, results in lower expression levels than other mouse strains analyzed66. Both strains are hypoactive, and like the hypomorphic Pcdh-α mutants, BLG2 mice exhibit impairments in avoidance learning64, 105. It is possible that the lack of functional –α9 expression is related to their behavioral deficits. Human Pcdh–α10 is the ortholog of mouse Pcdh-α9. Thus, it remains unclear whether the loss of Pcdh-α8, -α9, and –α10 in the common deletion has no effect, or only subtle effects on behavior in humans.

Protocadherin binding partners Due to the substantial differences in intracellular domains between the Pcdhα and

Pcdhγ proteins, cytoplasmic binding partners are hypothesized to differ between the two families31. The α-protocadherins were originally identified as binding partners of Fyn, a member of the Src family of protein tyrosine kinases38. Fyn knockout mice show defects in hippocampal and olfactory bulb development, suggesting a role for Fyn in brain patterning106, 107. The structural defects are apparent as behavioral deficits in spatial learning, suckling and fear response106, 107, 108. These behavioral deficiencies are remarkably similar to the phenotype of hypomorphic Pcdh-α mice, as would be expected if the two proteins were part of the same signaling cascade64.

One possible link between Fyn and the protocadheirns is reelin. The mouse α- protocadherins, but not the Pcdh-γ cluster, Pcdh-αc1 or –αc2, may be receptors for reelin, a protein involved in organization of the brain cortical layer109. α-protocadherins are expressed adjacent to reelin-positive motoneurons and antibodies which inhibit the

22 interaction of Pcdhα with reelin inhibit aggregation of Pcdhα cells92, 109. mDAB1 is known to be the downstream effector of reelin, and is also capable of binding to Fyn110.

Antibodies against the putative reelin binding domain of the protocadherins block signaling through mDAB1 and disrupt cortical neuron arrangement 109. It has been hypothesized that the extracellular domain of the protocadherins interacts with reelin, which activates Fyn kinase, which activates mDAB1, which can activate p35-CDK5 kinase, which associates with N-cadherin, preventing its association with β-catenin, thereby suppressing adhesion111. How homophilic protocadherin interaction and the known role of protocadherins in synapse maturation would fit into this signaling cascade has not been outlined. However, this interaction is of questionable biological relevance as the Pcdhα/Reelin interaction is not always reproducible53, 65, 112. So, while the interaction of reelin with the protocadherins, one potential signaling link between the Pcdhs and Fyn, is questionable, the interaction of Fyn with Pcdhs has never, to our knowledge, been challenged, and further research into how these two proteins might be integrated into a single signaling cascade has not been published53, 65, 111, 112.

Analysis of protocadherin cytoplasmic binding partners strongly suggests that the protocadherins couple extracellular signals to intracellular functions necessary for axonal function via the cytoskeleton. The cytoplasmic tail of the protocadherin-α A-type molecules interact with neurofilament M (NFM), while the hypothetical protocadherin-α

B-type molecules can interact with fascin63. Neurofilament M is a component of the major filament in mature neurons113. These filaments play a structural and scaffolding role and are known to complex with N-methyl-D-aspartate (NMDA) receptors, known to regulate synaptic plasticity and learning and memory in the hippocampus113. The learning

23 deficits observed in mice lacking the Pcdh-α cytoplasmic domain may therefore be explained by the perturbation of neurofilament interaction with NMDA receptors.

Although protocadherins are not known to be required for this interaction, it is possible that complex signaling events involving Pcdh-α, NFM and NMDA receptors are affected by the loss of the protocadherin cytoplasmic domain.

Fascin is an actin bundling protein and is required for some forms of actin-based cell movement114. When transfected into HEK293 cells the Pcdh-α constant region is co- localized with actin65. It is known that synaptic activity causes conformational changes in

N-cadherin that lead to changes in cell adhesion89. It is possible that the protocadherins are also affected by synaptic activity and affect adhesion through their ties to the actin cytoskeleton. However, since the finding of the interaction between Pcdh-α and NFM and fascin, no further research has been published to explain how this interaction contributes to protocadherin function.

Investigation into protocadherin function My project has focused on trying to glean information about protocadherin function by exploring two of the known features of protocadherin biology: the hypothesis that they are subject to frequent gene conversion events, and the observation that fragments of the protocadherin protein generated by presenilin-dependent intramembrane proteolysis localize to the nucleus49, 50, 51, 60, 78, 79, 80, 101 , 115. I demonstrated that gene conversion events are obvious within the protocadherin cluster by direct sequencing and that these events change functional elements of the protocadherins which may contribute to protocadherin diversity. I also show that the nuclear localization of protocadherin

24 cleavage fragments is a common feature of the Pcdh-α and Pcdh-γ clusters. I found no evidence of DNA binding of these fragments, but was unable to determine whether this was the result of an actual lack of binding in the biological framework, or the result of a failure of our system to detect binding.

REST/NRSF and neuronal fate determination and maintenance The central nervous system relies on the cell-specific gene expression of a set of genes that are expressed only in neurons. These genes are necessary for the specialized functions of the neuron and are regulated by activator and repressor systems that ensure the timing and specificity of their expression116. One of the best characterized members of the repressive systems is the RE1-silencing transcription factor (REST), also called neuron-restrictive silencer factor (NRSF). REST/NRSF is a member of the GLI-Krüppel family of proteins that was originally identified as the “master negative regulator of the neuronal phenotype”116, 117. As a master regulator its sole function was believed to be as a repressor of neuron-specific genes in nonneuronal cells. While this is certainly part of REST/NRSFs function, it is not the whole story. REST/NRSF is also expressed in neural progenitor cells, neurons and neuronal cell lines118, 119, 120. Several genes known to be regulated by REST/NRSF are not neuron-specific, REST/NRSF has been shown to function as a repressor in neurons, mediates transcriptional responses important for neural plasticity and is important in controlling the spatial and temporal expression of neuronal genes in neural progenitors121, 122, 123.

REST/NRSF displays a complex system of repression that is sensitive to developmental stage, cell type, co-factors and concentration. REST/NRSF represses

25 transcription in a location and orientation independent manner by binding to its highly conserved 21 to 23-base pair binding site (RE1 or NRSE)124, 125. REST/NRSF is composed of independent repressor domains at both the N and C termini, with its DNA binding domain contained within a cluster of eight zinc fingers117, 125, 126. Each repressor domain interacts with different repressive co-factors including CoREST, N-CoR, mSin3A and SCPs119, 127, 128, 129, 130. These co-repressors recruit histone deacetylases to mediate repression on the chromatin level, although REST/NRSF is capable of repressing reporter genes without native chromatin structure117, 129. It is unknown as to why

REST/NRSF would need two functional repressor domains since both domains are required for full repression, but each domain is also sufficient for repression126. The N- terminal domain interacts with the mSin3A/HDAC1 repressor complex and may function primarily to mediate transient repression129, 130. The C-terminal repression domain interacts with CoREST, which appears to be involved in more stable repression129. In differentiated non-neuronal cells CoREST recruits additional silencing machinery including MeCP2 and histone methyltransferases that silence across a larger chromosomal interval, allowing repression of genes without RE1/NRSE sites127, 131, 132,

133.

REST/NRSF displays differential regulation of target genes depending on the developmental stage and its expression level in differentiating cells appears to be a major determinant of differentiation. In non-neuronal tissues repression is maintained by histone methyltransferases119. Histone methylation is present at binding sites in embryonic stem cells, but these binding sites are more reliant on histone deacetylation for repression119. The chromatin state of ES cells allows basal transcription of target genes

26 and apparently allows the binding sites to be quickly modified for activation during the differentiation cascade119.

Upon transition from ES cells to neural progenitors REST/NRSF is downregulated posttranslationally, which allows further differentiation to proceed rapidly as

REST/NRSF is quickly released from its binding sites without having to wait for transcriptional repression to take place119. Of course not all target genes respond identically to the reduction of REST/NRSF, and there appears to be variation in the affinity of binding sites for the repressor134. Thus, the posttranslational downregulation of

REST/NRSF in progenitors might maintain repression at genes with high REST/NRSF affinity while allowing activation of other genes with lower affinities119.

Small noncoding RNAs that match the RE1/NRSE are hypothesized to further help activate REST/NRSF-regulated target genes in neural progenitor cells by binding to

REST/NRSF and preventing association with co-repressor proteins135. The expression of

RE1/NRSE dsRNAs is sufficient to activate REST/NRSF target genes and induce differentiation135.The induction of RE1/NRSE dsRNA expression may therefore act in conjunction with the posttranslational modification of REST/NRSF to push neural stem cells into differentiation136.

Once progenitors have differentiated into mature neurons the REST/NRSF repression complex has been released from binding sites of neuronal genes and REST/NRSF becomes transcriptionally repressed, allowing the continued expression of some

REST/NRSF target genes119. Consistent with a dual-role for REST/NRSF repression domains, REST/NRSF targets are repressed via two different mechanisms. Some targets,

27 like the archetypal REST/NRSF target sodium type II channel genes, are repressed independently of histone deacetylation or DNA methylation, while others are maintained by chromatin silencing not dependent on continual occupation by REST/NRSF137. The continual occupation of some promoters by CoREST and MeCP2 at methylated sites adjacent to NRSE/RE1 sites causes additional repression. CoREST itself functions as a repressor and upon differentiation CoREST and MeCP2 remain associated at NRSE/RE1 adjacent sites to maintain repression119, 131. With specific stimulation, such as membrane depolarization, MeCP2, but not CoREST is liberated119. It is thought that the persistence of MeCP2 and CoREST after REST/NRSF release may contribute to neuronal plasticity by poising a subset of promoters for repressor complex assembly and disassembly119.

REST/NRSF is also known to repress a family of microRNA genes, but this repression is lifted by the time differentiation into mature neurons has occurred138. These microRNAs posttrascriptionally downregulate non-neuronal transcripts in mature neurons, which may help the neurons preferentially express neuron-specific gene sets, which may be required for the maintenance of the neuronal phenotype138.

As a transcription factor which helps coordinate the complexity of tissue-specific gene expression programs, dysregulation of REST/NRSF or its repression complex partners causes dramatic and widespread phenotypic changes. REST/NRSF is implicated in the pathogenesis of multiple diseases including Down syndrome, X-linked mental retardation, cardiomyopathy, cancer and Huntington’s disease139, 140, 141, 142. In Down syndrome REST/NRSF expression is dramatically altered with reduced levels seen in embryonic neurons and increased levels in adult neurons, both perturbations are apparently due to the presence of trisomic levels of DYRK1A which deregulates the

28 stoichiometry of the REST/NRSF repression complex139, 143. This downregulation of

REST/NRSF results in dysregulation of its target genes and appears to significantly contribute to the neural changes that characterize Down syndrome143.

REST/NRSF, in addition to regulating neuronal genes also regulates cardiac fetal gene expression and appears to be required for the maintenance of normal cardiac structure and function141. When hearts are stressed, REST/NRSF appears to leave its targets, leading to an aberrant reactivation of the fetal cardiac gene expression program which can lead to cardiac dysfunction and heart failure141. REST/NRSF interacts with huntingtin, inhibiting REST/NRSF function through cytoplasmic sequestering and allowing target transcription in neurons 142. Interestingly, mutant huntingtin is unable to sequester REST/NRSF and may contribute to the pathology of Huntington disease through the loss of transcription of neuronal genes in neurons142.

The identification of the entire repression network controlled by REST/NRSF would not only illuminate the biology of this interesting transcription factor, but may lead to the identification of target genes for the treatment of various cancers and disturbed neural phenotypes. Initially, some REST/NRSF targets were identified based on the presence of

RE1/NRSE sites in candidate neuronal genes followed by validation of binding and/or repression capabilities122, 144, 145, 146, 147, 148, 149, 150 . Subsequently, more than 1000 putative targets of REST/NRSF have been predicted by bioinformatic analyses124, 127, 134. Many of these targets had obvious neuronal functions- neuronal receptors, synaptic vesicle proteins, adhesion molecules, ion channels, neurotransmitter receptors134. However, these bioinformatic analyses were limited by the use of a strict consensus RE1/NRSE binding site sequence which may not be biologically relevant.

29

My project aimed to better predict the target sites for REST/NRSF by utilizing a more sophisticated binding site prediction model and using experimental validation to set biologically relevant threshold scores for inclusion of putative binding sites. To do this we collaborated with Ali Mortazavi to use and test his binding site prediction software

Cistematic. We were able to demonstrate that binding site predictions using position specific frequency matrices based on known positive binding sites is effective. Not only does this approach integrate more binding site information, but it outputs a score for each predicted site allowing the determination of a threshold of biological relevance for inclusion. This approach allowed us to identify 660 likely target genes for REST/NRSF, many of them novel. We demonstrate that the Cistematic software performs better than previously published consensus site model predictions by capturing more true positives and fewer false positives. We uncovered possible microRNA feedback loops relevant to

REST/NRSF regulation and target gene expression and corroborated evidence of

REST/NRSF function in brain, pancreas and cardiac gene expression programs.

Maltase-glucoamylase polymorphism and starch digestion Carbohydrates are found in a variety of foods that form the human diet and comprise a major part of our energy source151, 152. With the exception of lactose and some glycogen, our carbohydrates are derived primarily from plant sources151. These digestible plant carbohydrates exist as sucrose, a relatively easily digested disaccharide, and starch, complex glucose polymers found in plant semicrystalline storage structures which must undergo extensive modification and digestion before they can be assimilated as glucose152. The salivary and pancreatic α-amylases hydrolyze starches to short, soluble

30 glucose oligomers but produce little free glucose152. The short oligomers are not absorbable into the bloodstream and require further hydrolysis. Two small intestinal mucosal α-glucosidases, sucrase-isomaltase (SI) and maltase-glucoamylase (MGAM) are required to produce large amounts of free glucose from the oligomers produced by α- amylases152, 153.

MGAM and SI are anchored to the brush border epithelial cells of the small intestine153. Their substrates overlap somewhat and the activities of both are primed by the action of the α-amylases152, 153. SI is responsible for approximately 80% of maltase activity, all sucrase activity and nearly all isomaltase activity while MGAM accounts for the remaining 20% of maltase activity, a small amount of isomaltase activity and all glucoamylase activity154. It has been hypothesized that MGAM also exists as an alternate pathway for starch digestion when the α-amylases are inhibited due to immaturity and malnutrition154. While it was long thought that MGAM could only act on oligomers released by the α-amylases, it has been demonstrated that MGAM can act on native starch granules as well, though not as efficiently153. So, while the α-amylases increase digestion rate, they are not expressly required for the digestion of all starches.

MGAM activity is highest under low oligomer concentrations, such as those at the beginning of a meal, but then becomes inhibited at higher concentrations152. SI activity is not affected by oligomer concentration, but operates at approximately 5% of the activity of uninhibited MGAM152. There are a relatively small proportion of MGAM molecules, less than 1/20th the number of SI molecules in the intestinal mucosa, even though MGAM is more effective at releasing glucose from oligomers152. This coordinated system of action whereby a low number of MGAM molecules display high activity under low

31 substrate conditions, while a large number of SI molecules display low activity throughout digestion is thought to provide a steady increase in available glucose while preventing dangerously rapid increases in blood glucose concentrations152.

The study of sugar metabolism is paramount to many aspects of nutrition and disease intervention. Sugar metabolism is important in human infant brain development, diabetes, cardiovascular disease and obesity. The brain depends almost exclusively on glucose as an energy source and its energy consumption, relative to its size, is much larger than any other organ155. Furthermore, the proportion of total body energy use devoted to the brain is much higher in humans than in any other species assayed155. At birth glucose utilization is lower than that seen in young adults but dramatically climbs until age four when utilization rates are more than twofold higher than adults 156. This increased utilization is maintained through age 10 where there is then a gradual decline in utilization until adulthood156. The areas of the brain with the highest degrees of glucose metabolism change rapidly over the first year of life and correspond with the emergence of various behaviors such as spatial and sensorimotor integration, extinguishment of reflex behaviors, development of stranger anxiety and delayed response performance156.

Thus, brain function during childhood depends intimately on the digestion of food starches in order to produce the high levels of glucose needed for brain development. It is believed that the oxidative needs of children’s brains can only be met by the use of the highly active MGAM.

Contemporary diets are rich in energy-dense carbohydrates and may contribute to the adult degenerative diseases diabetes, cardiovascular disease and obesity157. The α- glucosidases are responsible for producing about two thirds of the blood’s glucose

32 concentration157. In fact α-glucosidase inhibitors are one class of drugs used to help treat type 2 diabetes. It is likely that the substrate reactive nature of MGAM evolved in the context of low carbohydrate diets where it would be active constitutively during a low carbohydrate meal, quickly releasing what little glucose was present. Now that our diets contain a much higher concentration of starches, MGAM becomes deactivated much more quickly than originally intended. Thus, SI may act as a constraint for glucose- associated degenerative diseases associated with high starch diets by limiting the amount of free glucose available, especially once MGAM activity has dropped152.

Deficiency of MGAM and/or SI has been reported and causes a malabsorption disorder characterized by chronic diarrhea in children that responds well to dietary carbohydrate elimination158, 159, 160. Increases in copy number of the salivary amylase gene (AMY1) are more common in populations with high-starch diets161. The activity of

MGAM does not seem to categorically differ between populations, but polymorphisms in this gene that are not disease-causing are not well explored152. We became interested in

MGAM when a large, common deletion was identified as part of an effort by Jun Li and

Devin Absher at the Stanford Center to screen for patterns of nucleotide polymorphism variation in almost 1000 diverse individuals162, data not published. The deletion removed approximately 30 kb of the largest MGAM intron and showed the highest prevalence, approximately 5%, in European populations. Analogous to the finding that amylase copy number is correlated with starch ingestion, we hypothesized that the

MGAM intron deletion downregulated MGAM expression, and that this loss of MGAM function was only tolerated in populations in which starch ingestion was high. In those populations with low dietary starch a loss of MGAM would be predicted to significantly

33 decrease the production of glucose, while those with high dietary starch intakes may not need as many functioning MGAM molecules, and in fact are likely to shut them down quickly after the start of a meal. We mapped the endpoints of the deletion, ascertained the mechanism of the deletion, and demonstrated that the deletion had no effect on RNA structure or expression levels. We therefore concluded that this deletion is likely effectively neutral.

34

Chapter 2: Identification of Coding Polymorphisms in the Protocadherin Cluster Gene conversion is a dual functioning process of non-reciprocal recombination that can act to homogenize genes or promote diversity163. Double stranded breaks initiate gene conversion events and strand invasion by a homologous sequence repairs the break.

Strand invasion can occur with sequence from the same locus (allelic gene conversion) or with sequences at other loci on the same or different (ectopic gene conversion). Intrachromosomal conversion events are more common than interchromosomal events, and the majority of intrachromosomal events are between genes less than 10 kb away from each other163. Gene conversion is more likely to occur between sequences with high similarity, and the length of the gene conversion event is very strongly correlated with the percent sequence identity163. Since gene conversion is more common between similar, closely linked genes it is not surprising that conversion is known to occur in clustered genes such as β-globin and the olfactory genes163, 164,

165. Indeed, gene conversion events appear to be common in the protocadherin clusters of human, mouse, rat, chicken, fugu and zebrafish as well 49, 50, 51, 60, 101, 115. These events occur only in coding sequences and are restricted to particular ectodomains. Hypotheses about the function of the protocadherins have been put forth based on the patterns of presumed gene conversion events alone.

The hypothesis that protocadherin paralogs are subject to gene conversion is supported by several lines of evolutionary evidence: orthology relationships, distribution of diverse sequences, and the pattern of neutral diversity and third position GC content.

35

First, orthology between protocadherin genes in different species cannot always be assigned. The human, chimpanzee, rat, and mouse protocadherins display obvious orthologous relationships, while the chicken, zebrafish and coelacanth protocadherins do not47, 48, 49, 50, 52, 59, 60. For example, the mouse and rat Pcdhα1s are more similar to each other than to human Pcdhα1, and all three Pcdhα1s are more similar to each other than they are to any other Pcdhα paralog49, 50, 52. In contrast, the Pcdhα1 allele in chicken, zebrafish or coelacanth is more similar to all other Pcdhα paralogs in their own species than they are to Pcdhα1s across species47, 48, 60. Analysis of the variable region exons reveals that there are long stretches of DNA that are nearly identical among paralogs of the same species49. Gene conversion could explain this phenomenon where conversion events would “copy” identical pieces of DNA onto paralogs within a species. Gene conversion, however, is not the only satisfactory explanation for this observation. Loss of orthology could also be due to rapid duplication and loss of protocadherin paralogs followed by rapid divergence of duplicated paralogs.

Second, the amount of diversity present between paralogs varies widely by ectodomain. Phylogenetic analysis of the relationship between paralog ectodomains of different species reveals which ectodomains are apparently subject to homogenization and which maintain diversity. With rapid diversification and paralog duplication and loss, one would expect that orthologous relationships would be lost among highly diversified ectodomains while more homogenized ectodomains, due to conservation, would maintain orthology. However, human, mouse and rat Pcdhα EC2 and EC3 display orthologous relationships between the species even though they are on the whole more diverse, while

EC5 and EC6 are homogenized but lose their orthologous relationships and display only

36 paralogous relationships between species58. The third ectodomain in each protocadherin subgroup (-α, -β, -γ) in zebrafish, human, mouse and rat is never homogenized, providing much of the signal that distinguishes orthologs between species49. This pattern is more consistent with gene conversion where species-specific homogenization events erase the signature of orthology and make paralogs more identical to each other than to their orthologs.

As predicted by gene conversion, the homogenization of ectodomains is lineage- specific, such that paralogs in the same species have almost identical ectodomains that are diverged from the independently homogenized ectodomains of another species49.

While analysis of full-length sequences of human, mouse and rat protocadherins easily reveals their orthology, analysis of some individual ectodomains in these same species reveals a complete breakdown of orthology. Again, this apparent homogenization within species does not appear to be the result of functional constraint on the protein sequences as regions of homology are not conserved between species49.

Finally, different ectodomains show different patterns of neutral diversity, with reduced neutral diversity occurring in more homogenized domains (Figure 2)49. This pattern is consistent with gene conversion, but not functional constraint, as synonymous sites should be free to mutate without constraint. Species-specific homogenization can also not be due to residual similarity between recently duplicated paralogs as the regions of homology are clustered at the 3’ ends, not over their entire length, as would be expected49. Gene conversion events are known to cluster at the 3’ end of genes163. Gene conversion events are also known, by unknown mechanisms, to lead to increased GC content at third codon positions166, 167, 168. This GC-content bias is only apparent if

37 conversion is very frequent168. Across subgroups and species, those ectodomains which show decreased neutral diversity also have increased third-position GC content, indicating the homogenization is likely due to gene conversion49. (Figure 2)

What do gene conversion events tell us about protocadherin function? Gene conversion events may be necessary for homogenizing parts of the protein that have species-specific, but identical, functions among paralogs49. EC6 may be necessary for adhesion, and thus must be identical between molecules, while EC3 may be necessary for homophilic interactions, and must be diverse in order to provide specificity. EC2 shows significant diversity as well and is known to participate in cis-interactions in classical cadherins40, 72. Almost all of the positively selected codons in the protocadherins are contained on the surface of EC2 and EC3, indicating that these diverse surface residues may take part in cell-cell interactions50. The first cadherin ectodomain is unusually well conserved among Pcdhα family members36, 77. The EC1 domain is required for homophilic interactions in the classical cadherins and may therefore be involved in other aspects of binding in the protocadherins71.

All of the evidence supporting the role of gene conversion in the protocadherin cluster came from evolutionary and sequence analysis that revealed signatures of gene conversion. We wanted to prove that gene conversion events occurred in the cluster by identifying obvious conversion events. We reasoned that sequencing of multiple individuals would reveal conversion events that are polymorphic in our set of samples.

We could visualize these conversion events by identifying single nucleotide polymorphisms (SNPs) or clusters of SNPs among otherwise identical stretches of DNA that could be attributable to strand invasion by another paralog (Figure 3). Since strand

38 invasion is known to occur most often between more similar paralogs, we hypothesized that most conversion events would appear as single mutations163. Gene conversion tracts are known to stop once local sequence similarity drops below a certain threshold, so we believed that gene conversion events making many changes would be unlikely to occur169. We would not be able to say with certainty that these SNPs were attributable to gene conversion events and not just mutation, but an overabundance of such events would be indicative of gene conversion. In order to maximize the number of SNPs between samples we relied on the known haplotype structure of the protocadherin α cluster.

In humans the protocadherin α cluster is in extensive linkage disequilibrium (LD), with many promoter polymorphisms, and one large deletion identified separating haplotypes4, 101. Two 48-kb haplotypes extending from Pcdh-α1-α7 are present at about equal frequency worldwide4. The two haplotypes differ at every common Pcdh-α promoter SNP and the SNPs comprising the haplotypes are in significant linkage disequilibrium4. The two major haplotypes are very old and are found in every human population group assayed but are missing in chimpanzees, indicating that they emerged after the human-chimpanzee divergence4. The binary haplotype structure must have emerged very early as all minor haplotypes appear to be derived from one of the two major haplotypes4. Balancing selection appears to have acted in the past to maintain the two haplotype structures, perhaps as a means to maintain protocadherin diversity4.

Linkage disequilibrium extends at least through Pcdh-α11, but is not as significant, recombination events 3’ to α7 are obvious in all populations studied4. LD is not a

39 common feature of the protocadherins since the Pcdh-β and –γ clusters show no apparent linkage disequilibrium4.

No coding polymorphisms had been identified that were known to track with the haplotypes, but because of the age of the haplotypes we believed that many coding polymorphisms were likely to exist between the haplotypes. Because our gene conversion identification strategy relies on polymorphisms, we believed that sequencing only individuals known to be homozygous for one of the two major haplotypes would maximize the number of gene conversion events we would be able to detect while simultaneously contributing to the body of research on the protocadherin haplotype structure by identifying coding polymorphisms between them.

40

Figure 2: Paralog sequence diversity is negatively correlated with third-position GC content

From Noonan, J.P., Grimwood, J., Schmutz, J., Dickson, M. & Myers, R.M. Gene conversion and the evolution of protocadherin gene cluster diversity. Genome Res 14, 354-66 (2004).

Neutral paralog sequence diversity and third-position GC content for ectodomains 1-6 and the cytoplasmic domain from various species and protocadherin paralog subgroups. (A) Human, mouse, and rat Pcdhα. (B) Human, mouse, and rat Pcdhβ. (C) Human, mouse, and rat PcdhγA. (D) Human, mouse and rat PcdhγB. (E) Zebrafish Pcdh1γ. (F) Zebrafish Pcdh1α and Pcdh2α. Neutral paralog sequence diversity, shown as the number of synonymous substitutions per site (dS), is negatively correlated with third-position GC content (GC3), indicative of gene conversion.

41

Figure 3: Hypothetical gene conversion event visualization

a)

## ## ## ##

a) Two closely related paralogs differ at four locations (differences indicated by #). b) Strand invasion between the two paralogs occurs. c) Resolution of the strand mismatches, in this case, results in gene conversion of paralog 1 to paralog 1* (containing paralog 2 sequence), but no change in paralog 2. d) Sequencing of paralog 1 in an individual without the conversion event (A) and one with the conversion event (B) would reveal a cluster of SNPs that are attributable to conversion with paralog 2.

42

Protocadherin-α sequencing reveals polymorphism and conversion Before beginning our search for gene conversion events we wanted to further define the protocadherin-α haplotypes. Because the haplotypes had been identified using promoter polymorphisms only, we aimed to determine whether the two common haplotypes had accumulated mutations in their coding regions, and if so, to further define sub-haplotypes based on coding sequence polymorphisms. In order to simplify the phasing of identified polymorphisms we only sequenced individuals known to be homozygous for one of the two major haplotypes. This strategy would also help us to find gene conversion events. The two haplotypes do not recombine, and thus are predicted to have accumulated different common SNPs, so gene conversion events that occurred on only one of the haplotypes would be more easily visible4.

Initially, we sequenced select ectodomains in the protocadherin α cluster, focusing on ectodomains 2-4 due to their higher levels of paralog sequence diversity, in four Caucasian haplotype homozygous individuals, two individuals for each haplotype.

Ectodomains 2-4 were sequenced in these four individuals in all 13 Pcdhα paralogs. We identified twelve possible SNPs as well as a cluster of seven SNPs. These nineteen SNPs were distributed among only four protocadherin paralogs. Because we only used four individuals, and one sample failed in all sequencing reactions, we were unsure of the validity of some of our SNPs due to low sequence quality in some individuals in some regions.

In order to verify our SNPs and determine their worldwide distribution we expanded our sample set to 86 individuals and reduced our sequencing to just the four

43 paralogs that we identified polymorphisms in during our first round of sequencing:

Pcdhα2, Pcdhα3, Pcdhα4, and Pcdhα10. Again, we limited our sequencing to the diverse ectodomains 2-4. SNPs identified in Pcdhα2-4 fall into the “core” haplotype region of the

Pcdhα haplotype blocks and are therefore important for identifying coding SNPs that track with the haplotypes and those that sub-define haplotypes. We again sequenced only individuals known to be homozygous for one of the two major Pcdhα haplotypes 1, 2, 3.

We verified seven of our original twelve SNPs, as well as the cluster of seven SNPs, and identified six more high quality SNPs in our expanded sample.

Surprisingly, of the thirteen SNPs identified in this round of sequencing, nine were non-synonymous (Table 1). Six SNPs track perfectly with the previously established promoter haplotypes, three subdefine the haplotypes, and four more are singletons that are present in only one individual sequenced. We identified one individual identified previously to be homozygous for haplotype 1 that has apparently undergone recombination with haplotype 2 and is heterozygous for many of the SNPs that otherwise track with the haplotypes.

The series of seven polymorphisms occurred in the Pcdhα4 ectodomain 1 coding sequence and tracked exclusively with Pcdhα haplotype 2. Hypothesizing that this series of SNPs could be the result of a gene conversion event we performed a BlastN search on the Pcdhα4 variant sequence comparing it to a database of human alpha protocadherin sequences derived primarily from the reference sequence which happens to be a haplotype 1 homozygote. Indeed, the Pcdhα4 variant sequence appears to be derived from the first ectodomain of Pcdhα8 (Figure 4). The polymorphisms are present in a region that is otherwise identical between the two paralogs and clearly seem to be the

44 result of a gene conversion event. The conversion tract was present on all individuals of haplotype 2 and is thus very old, being present in Caucasian, African and East Asian populations.

Polymorphisms in the Pcdhβ cluster We focused on sequencing Pcdhβ because we hypothesized that 3’ gene conversion events might be more common in this cluster in order to homogenize the cytoplasmic tail. In the Pcdhα and Pcdhγ clusters the downstream constant exons serve to provide an identical tail between paralogs, however, the Pcdhβ cluster lacks constant exons and thus each paralog has a unique cytoplasmic tail.

For Pcdhβ, we used an initially expanded sample set of eight Caucasian individuals since this cluster is not in extensive linkage disequilibrium and thus we cannot rely on haplotype structure to help us diversify our sample set. We identified six high quality SNPs in five paralogs- Pcdh-β3, -β7, -β8a, -β13, and -β14. Of these SNPs, five are synonymous, only one is non-synonymous. Five were also transitions, with only one transversion. Although five of the six SNPs could possibly be explained by conversion by another paralog, we did not believe there was sufficient evidence of gene conversion to warrant doing an expanded sequencing effort.

Expanded protocadherin-α sequencing and polymorphism analysis We then began an extensive sequencing project to sequence all ectodomains of all thirteen Pcdhα paralogs in twelve individuals homozygous for the Pcdhα haplotypes, six individuals for each haplotype. We focused our sequencing on ectodomains 1 and 6 to

45 begin as these are the ectodomains that show the most evidence of gene conversion49. We hoped to identify more possible gene conversion events in this greatly expanded sequencing set.

We did not finish sequencing all ectodomains in every individual, because the major finding of our initial sequencing efforts was published by Miki et al., while we were performing our expanded sequencing. These results will be discussed later in this chapter. We got good sequence data for EC1 and EC6 in every protocadherinα paralog,

EC2 and EC5 for most protocadherinα paralogs, and EC3 and EC4 in only a handful of protocadherinα paralogs. From these sequencing efforts we verified many of our previously identified SNPs, the gene conversion event, as well as 40 new SNPs. Because ectodomains 2-4 were not extensively covered in this sequencing effort, we could not verify all thirteen previously identified SNPs. In total we identified 53 SNPs confirmed by at least two sequencing reads exhibiting high sequence quality (Table 1).

We found that 33 of the 53 SNPs are non-synonymous changes, 18 are synonymous changes, and 2 are not translated. On average, non-synonymous substitutions are less common than synonymous substitutions in protein-coding regions170. Thus, the presence of almost twice as many non-synonymous substitutions as synonymous ones is surprising. This result is consistent with gene conversion and/or positive selection where non-synonymous, diversifying changes are preferred, whether they occur by gene conversion or not. We believe that gene conversion contributes to this process as the transition to transversion ratio is smaller than is average for human SNPs.

We observed that 53% of our mutations were transitions, while the average percent transition rate in humans is between 66% and 71%171, 172, 173. This result is significant by χ2

46 at both the upper and lower bounds of the average transition to transversion ratio (p= 0.04 at 66% and p=0.006 at 71%). The frequency of transitions and transversions during gene conversion appears to be the same174. Our data is consistent with many of the SNPs we have identified being the result of gene conversion and not random mutation.

We next considered whether any of the 53 SNPs could have resulted from gene conversion. We did this by performing BlastN using the reference and polymorphic sequences to search a library of human protocadherinα haplotype 1 reference sequences.

We used a window of 101 base pairs, with 50 base pairs upstream and downstream from the SNP in the search. For SNPs that tracked with the haplotypes, we considered gene conversion events to be possible if either the reference (haplotype 1) or polymorphic

(haplotype 2) sequence matched significantly with another protocadherin paralog. For

SNPs occurring in individuals and those that subdefine haplotypes we only considered gene conversion events if the polymorphic sequence matched another paralog.

Many of these possible gene conversion events will not be actual conversions, and instead are mutations that just happen to resemble other paralogs. There are several instances (SNPs α2-415, α3-2411, α4-1728, α8-332, α8-1350 α9-1387, α9-1545, α10-

1575) where either the reference or polymorphic sequence is identical or nearly identical at the SNP and in the surrounding sequence to many other protocadherin paralogs, while the other sequence has no matches to other paralogs. It seems more likely in these instances that the sequence with many paralog matches is identical to those sequences because of conservation, and that the other sequence represents an independent mutation.

It is possible, however, that both the reference and polymorphic sequence were once identical at that SNP, but different from every other paralog, and that gene conversion

47

“repaired” the SNP on only one haplotype. While only 25 of our SNPs are flagged as possible gene conversion events by our analysis, it remains possible that many of the other SNPs were generated by gene conversion. The database that we used to search for similar stretches of DNA is composed only of reference (haplotype1) sequence.

Therefore, only gene conversion events where the donor sequence was haplotype 1 are visible. Furthermore, older conversion events may be undetectable because the donor paralog has now been lost, or significantly diverged.

Consistent with the previous report that linkage disequilibrium breaks down 3’ to

Pcdhα7, all common SNPs in Pcdhα9-α13 do not track neatly with the haplotypes, with many individuals displaying chromosomes obviously generated by recombination events between the two haplotypes (Figure 5)4. Surprisingly, only two SNPs within the core haplotype region define sub-haplotypes. SNP α4 264 subdefines haplotype 2. SNP α3

1001 obviously arose on haplotype 1 and is present only in heterozygous form in multiple individuals (Figure 5).

As previously mentioned, a paper by Miki et al., sequenced DNA from 104

Japanese individuals across the Pcdh-α and -β clusters101. Consistent with our results, they found that non-synonymous changes were about twice as likely as synonymous ones in the Pcdh-α, but not -β cluster101. As they were able to extensively cover the ectodomains, they found that non-synonymous changes were more common in EC1 of the Pcdh-α paralogs101. They identified 418 coding SNPs in total, including 15 polymorphisms differentiating the two major Pcdh-α haplotypes101. They noted that the frequency of coding polymorphisms was negatively correlated with paralogous sequence diversity, consistent with gene conversion101. They identified the Pcdhα4/Pcdhα8 gene

48 conversion event as well as a gene conversion event between Pcdhβ9 and Pcdhβ7 101. Of our 53 identified Pcdhα SNPs, their sequencing revealed 41 of them101. They did not find any of our same Pcdhβ SNPs, presumably because they used an exclusively Japanese population, while we used exclusively Caucasians101. They found an excess of intermediate variants, an analysis we could not perform as we had used only homozygotes, indicative of balancing selection101. Their analysis did not include any discussion of the functional effect that the Pcdhα4/Pcdhα8 gene conversion event might have, a point that we feel is of particular interest as the conversion event changes a highly conserved motif implicated in protocadherin function.

These sequencing efforts revealed previously unpublished polymorphisms in the protocadherin-α cluster. Coding polymorphisms that differentiate between the two major haplotypes were identified. Analysis of these polymorphisms reveals further evidence of gene conversion as there are more non-synonymous mutations and transversions than would be expected by random mutation. We identified one obvious gene conversion event with possible functional consequences.

49

50

Table 1: Table of confirmed Pcdhα single nucleotide polymorphisms 51 Paralog Nucleotide Synonymous Change Sample Haplotype Possible gene Position or Non- Transition or Set Distribution conversion? Synonymous Transversion Used α1.4/ 1444 Non- A/C HVP(8) SNP verified No α1.5 Synonymous: Transversion in only one Linker Asparagine individual  Histidine α1 3’ of 1687 Synonymous C/T HVP Tracks with Yes, ECs Transition (8), haplotypes reference found sequence by Miki matches et al. Pcdhα -2, -5, and -12 α1 3’ of 2294 Synonymous A/G HVP Tracks with No ECs Transition (8), haplotypes found by Miki et al. α1 3’ of 2375 Non- G/T HVP Tracks with Yes, ECs Synonymous: Transversion (8), haplotypes reference Cysteine  found sequence Phenylalanine by Miki matches et al. Pcdhα -10 α2.1 172 Non- A/G HVP Tracks with Yes, Synonymous: Transition (8), haplotypes polymorphic Glutamic acid found sequence  Lysine by Miki matches et al. Pcdhα-6 and -8. Reference sequence matches Pcdhα-1, -10, and -12 α2.1 223 Non- A/G HVP (8) SNP verified No Synonymous: Transition in only one Glycine  individual Serine α2.1 415 Non- C/G HVP/A Tracks with Yes, Synonymous: Transversion T (86) haplotypes polymorphic Valine  and sequence Leucine HVP matches (8), Pcdhα-3 found through -α13 by Miki et al. α2 3’ 2581 Not A/G HVP (8) SNP verified No UTR Translated Transition in only one individual α3.3 1001 Non- C/T HVP/A Subdefines No Synonymous: Transition T (86), haplotype 1 Isoleucine  found Threonine by Miki 51

et al. α3.3 1087 Non- A/G HVP/A Tracks with No Synonymous: Transition T (86), haplotypes Isoleucine  found Valine by Miki et al. α3.3 1117 Synonymous C/T HVP/A SNP verified No Transition T (86) in only one and individual HVP (8), found by Miki et al. α3.4 1245 Synonymous C/T HVP/A SNP verified Yes, Transition T (86), in only one polymorphic found individual sequence by Miki matches et al. Pcdhα-4 α3.4 1427 Non- A/G HVP/A SNP verified No Synonymous: Transition T (86) in only one Glycine  and individual Asparagine HVP (8), found by Miki et al. α3.4 1454 Non- G/T HVP/A Tracks with Yes, Synonymous: Transversion T (86) haplotypes reference Serine  and sequence Isoleucine HVP matches (8), Pcdhα-10 and found -13 by Miki et al. α3 3’ of 2411 Non- A/G HVP Tracks with Yes, ECs Synonymous: Transition (8), haplotypes reference Tyrosine  found sequence Cysteine by Miki matches et al. Pcdhα-2, -4, - 5,-7, -8, -9, - 11, -12, -13 α4.1 264 Non- G/T HVP/A Subdefines No Synonymous: Transversion T (86) haplotype 2 Glutamic and Acid  HVP Aspartic Acid (8), found by Miki et al.

α4.1 375 Synonymous C/G HVP/A Tracks with Yes, Transversion T (86), haplotypes polymorphic found sequence by Miki matches

52

et al. Pcdhα-3, -6 through -11, and -13. Reference sequence matches Pcdhα-2,-4, - 5. α4.1 429 Synonymous A/G HVP/A Tracks with Yes, Transition T (86), haplotypes polymorphic found sequence by Miki matches et al. Pcdhα-2,-3,- 5,-6, -8, -10, - 11, -12, and - 13. Reference sequence matches Pcdhα-9 α4.2 649 Non- C/T HVP/A Tracks with No Synonymous: Transition T (86), haplotypes Phenylalanine found  Serine by Miki et al. α4.3 991 Non- A/G HVP/A Subdefines No Synonymous: Transition T (86) haplotype 2 Isoleucine  Valine α4.3 1019 Non- A/G HVP/A SNP verified No Synonymous: Transition T (86), in only one Tyrosine  found individual Cysteine by Miki et al. α4.5 1710 Synonymous G/T HVP Tracks with No Transversion (8), haplotypes found by Miki et al. α4.5 1728 Synonymous A/G HVP Tracks with Yes, Transition (8), haplotypes polymorphic found sequence by Miki matches et al. Pcdhα-1 through -13 α4.6 1902 Synonymous C/G HVP Tracks with Yes, Transversion (8), haplotypes polymorphic found sequence by Miki matches et al. Pcdhα-12. Reference sequence matches Pcdhα-1, -8, - 9 α4.6 1992 Synonymous A/C HVP (8) SNP verified Yes,

53

Transversion in only one polymorphic individual sequence matches Pcdhα11 α5 5’ 159 Not translated C/T HVP (8) Tracks with No UTR Transition haplotypes α5 3’ of 2171 Non- C/T HVP Tracks with No ECs Synonymous: Transition (8), haplotypes Alanine  found Valine by Miki et al. α5 3’ of 2412 Non- C/G HVP (8) SNP verified No ECs Synonymous: Transversion in only one Phenylalanine individual  Leucine α6.4 1350 Non- A/C HVP SNP verified Yes, Synonymous: Transversion (8), in only one polymorphic Serine  found individual sequence Arginine by Miki matches et al. Pcdhα8 α6.6 1890 Synonymous G/T HVP Tracks with Yes, Transversion (8), haplotypes polymorphic found sequence by Miki matches et al. Pcdhα-1, -2, - 8, -9, -10. Reference sequence matches Pcdhα-6, -7 α7.1 238 Non- C/T HVP (8) SNP verified No Synonymous: Transition in only one Arginine  individual Cysteine α7.1 401 Non- A/G HVP (8) SNP verified No Synonymous: Transition in only one Glutamic individual Acid  Glycine α7.1/ 512 Non- A/G HVP Tracks with Yes, α7.2 Synonymous: Transition (8), haplotypes polymorphic Linker Arginine  found sequence Lysine by Miki matches et al. Pcdhα9 α7.3/ 1126 Non- A/G HVP SNP verified No α7.4 Synonymous: Transition (8), in only one Linker Valine  found individual Isoleucine by Miki et al. α7.5 1650 Synonymous A/G HVP (8) SNP verified No Transition in only one individual α8.1 332 Non- A/G HVP Tracks with Yes, Synonymous: Transition (8), haplotypes polymorphic Serine  found sequence

54

Asparagine by Miki matches et al. Pcdhα-1 through -13 α8.2 606 Synonymous C/T HVP (8) SNP verified No Transition in only one individual α8.4 1350 Non- A/C HVP Tracks with Yes, Synonymous: Transversion (8), haplotypes polymorphic Arginine  found sequence Serine by Miki matches et al. Pcdhα-2 through -6 and -12 α8.5/ 1834 Non- A/G HVP Tracks with No α8.6 Synonymous: Transition (8), haplotypes Linker Lysine  found Glutamic by Miki Acid et al. α9.1 183 Non- A/C HVP Generally Yes, Synonymous: Transversion (8), tracks with polymorphic Serine  found haplotypes sequence Arginine by Miki but matches et al. heterozygotes Pcdhα-7. seen in all Reference three sequence populations matches Pcdhα-1, -2, - 3, -5, -6, -8 through -12 α9.3 1105 Non- C/G HVP Generally Yes, Synonymous: Transversion (8), tracks with polymorphic Leucine  found haplotypes sequence Valine by Miki but matches et al. heterozygotes Pcdhα-7 seen Asians and Caucasians α9.4 1332 Non- A/C HVP (8) SNP verified Yes, Synonymous: Transversion in only one polymorphic Arginine  individual sequence Serine matches Pcdhα-7 and -10 α9.5 1387 Non- C/G HVP Generally Yes, Synonymous: Transversion (8), tracks with reference Glycine  found haplotypes sequence Arginine by Miki but matches et al. heterozygotes Pcdhα-1, -2, seen in all 3, -5 through three -10, -13 populations α9.5 1545 Synonymous G/T HVP Generally Yes, Transversion (8), tracks with polymorphic found haplotypes sequence by Miki but matches

55

et al. heterozygotes Pcdhα1, -3 seen in all through -13 three populations α9 3’ of 2389 Non- A/C HVP Generally No ECs Synonymous Transversion (8), tracks with (Considered (with 2390): found haplotypes with SNP Serine  by Miki but 2390) Arginine et al. heterozygotes seen Asians and Caucasians (always tracks with SNP at position 2390) α9.6 3’ 2390 Non- A/C HVP Generally No of ECs Synonymous Transversion (8), tracks with (Considered (with 2389): found haplotypes with SNP Serine  by Miki but 2390) Arginine et al. heterozygotes seen Asians and Caucasians (always tracks with SNP at position 2389) α10.4 1416 Non- C/G HVP Generally Yes, Synonymous: Transversion (8), tracks with polymorphic Serine  found haplotypes sequence Arginine by Miki but matches et al. heterozygotes Pcdhα-1 and seen in all -9. Reference three sequence populations matches Pcdhα-3, -5, - 9, -10, -13 α10.4 1575 Synonymous G/T HVP Tracks with Yes, Transversion (8), haplotypes polymorphic found except in sequence by Miki Asians both matches et al. haplotypes Pcdhα-2 have through -7, -9 haplotype 1 through -13 SNP α10.6 2014 Non- A/G HVP Generally No Synonymous: Transition (8), tracks with Threonine  found haplotypes Alanine by Miki but et al. heterozygotes seen in all three populations

56

α11.5 1663 Synonymous C/T HVP Heterozygote Yes, Transition (8), s seen in all polymorphic found three sequence by Miki populations matches et al. on both Pcdhα-2 haplotypes through -6, - 8, and -10. Reference sequence matches Pcdhα-7, -9. - 11, -13. α11.6 1882 Non- G/T HVP SNP verified No Synonymous: Transversion (8), in only one Alanine  found individual Serine by Miki et al. α12.2/ 834 Synonymous A/C HVP Generally No α12.3 Transversion (8), tracks with Linker found haplotypes by Miki but et al. heterozygotes seen in all three populations α13.5 1536 Synonymous G/T HVP Generally No Transversion (8), tracks with found haplotypes by Miki but et al. heterozygotes seen in African and Caucasian β2 254 Synonymous C/T AT (8) Homozygotes Yes, Transition and polymorphic heterozygotes sequence observed matches Pcdhβ-16 β7 231 Synonymous A/G AT(8) Homozygotes No Transition and heterozygotes observed β8a 330 Non- A/G AT(8) Homozygotes Yes, Synonymous: Transition and reference Alanine  heterozygotes sequence Valine observed matches Pcdhβ-2, -3, - 4, -6, -7, -9 through -14, 16 through - 18 β13 401 Synonymous A/G AT(8) Multiple Yes, Transition heterozygotes reference observed sequence matches

57

Pcdhβ-2,-3, - 9, -10 β14 523 Synonymous A/G AT(8) Only Yes, Transition homozygotes polymorphic observed sequence matches Pcdhβ-15. Reference sequence matches Pcdhβ-2, -3, - 5 through - 13, -17 β14 571 Synonymous A/C AT(8) Homozygotes Yes, Transversion and polymorphic heterozygotes sequence observed matches Pcdhβ-9, -10. Reference sequence matches Pcdhβ-2 through -7, - 11 through - 18. All identified SNPs were of high sequence quality and were confirmed in both forward and reverse reads. The ectodomain location, nucleotide position relative to the transcriptional start, whether the SNP results in a synonymous or non-synonymous change, which DNA sample was used to identify and/or verify the SNP, and the distribution of the SNP among haplotypes are shown. HVP/AT(86)- the panel of 86 individuals from the human variation panel and autism families used in initial sequencing, HVP(8)- the panel of 8 individuals from the human variation panel used in our expanded sequencing, AT(8)- the panel of 8 individuals from autism families.

58

Figure 4: Identified gene conversion event may transfer functional domains between human protocadherin paralogs 59

a)

b) Haplotype 1&2 α8 tccaaAAGACACCGGGAcctt

Haplotype 2 α4 tccaaAAGACACCGGGAcctt

Haplotype 1 α4 tccaaGGGCCGCGGAGGcctt

Haplotype 2 α4 SKGRGGL

Haplotype 1 α4 SKRHRDL

59

c ) Haplotype 2 α4 α8

TCCAAGGGCCGCGGAGGCCTT TCCAAAAGACACCGGGACCTT

AGGTTCCCGGCGCCTCCGGAA AGGTTTTCTGTGGCCCTGGAA

TCCAAGGGCCGCGGAGGCCTT

AGGTTCCCGGCGCCTCCGGAA

TCCAAAAGACACCGGGACCTT

AGGTTTTCTGTGGCCCTGGAA

α4* α8

TCCAAAAGACACCGGGA CCTT TCCAAAAGACACCGGGACCTT 2

AGGTTTTCTGTGGCCCT GGAA AGGTTTTCTGTGGCCCTGGAA

a) Pcdhα4 ectodomain 1 sequence reads from two individuals homozygous for the Pcdhα haplotype 1 (C07) or haplotype 2 (G07). The gene conversion event appears as a series of polymorphisms flagged by PolyPhred. b) The Pcdhα4 EC1 haplotype 2 sequence is identical to the Pcdhα8 EC1 reference sequence (conversion event highlighted in bold text with nucleotide changes highlighted in bold red text). The conversion event changes the highly conserved RGD sequence (highlighted in bold red text) present in all Pcdhα first ectodomains. c) Schematic representation of the Pcdhα4-Pcdhα8 conversion event.

60

Figure 5: Tracking identified SNPs onto known promoter-defined Pcdhα haplotypes 61

SNPs 3’ to Pcdhα8 do not track as neatly with the established haplotypes. Recombination events are obvious in every population. SNPs within the core haplotype region track well with the haplotypes.

61

Chapter 3: The Protocadherin Cytoplasmic Domain and Fragment 62 Localization The “molecular barcode” hypothesis focuses on the functional role of the extracellular domain of the protocadherins, and suggests that homophilic interactions may maintain the adhesion of synapses. However, some hypothesize that, due to the protocadherins early non-synaptic, axon-only localization and the fact that the cytoplasmic domain alone is sufficient to greatly ameliorate all knockout and knockdown phenotypes, the protocadherins play an axonal or presynaptic role and do not function at all as trans-synaptic homophilic adhesion molecules79. Instead, this hypothesis suggests that there are potential signaling roles for the protocadherins. One might then hypothesize that the diversity of the protocadherins results in a diversity of signaling events rather than a diversity of homophilic interactions.

Classic adhesive molecules such as the selectins, classical cadherins and integrins are known to be involved in cell signaling175. Part of the function of E-cadherin appears to be sequestration of β-catenin at the membrane. When signaling occurs, β-catenin is released into the cytoplasm where it localizes to the nucleus and interacts with transcription factors to affect embryogenesis, organogenesis and tumorigenesis176, reviewed in177 and 178. In response to cadherin adhesion the classical cadherins act as ligand-activated signaling receptors, signaling through Rac, a member of the Rho family of GTPases. This signaling cascade modulates a membrane-adjacent set of processes which appear to regulate cadherin localization and adhesiveness, as well as cell shape and actin organizationreviewed in 179.

62

Little is known about the intracellular processes linked to the protocadherin molecules. However, cytoplasmic cleavage products of the protocadherins are known localize to the nucleus in vitro and in vivo in multiple organisms78, 79, 80, 180. Similar to other cell surface receptors, the γ-protocadherins are subject to presenilin-dependent intramembrane cleavage (PS-IP)78, 80. Presenilin cleavage is primed by an extracellular cleavage event by the matrix metalloprotease ADAM10 82. This initial cleavage generates a soluble extracellular fragment deemed the N-terminal fragment (NTF) and a membrane-tethered C-terminal fragment (CTF1). The presenilin cleavage event creates a smaller intracellular domain fragment (ICD), also called CTF2, of about 20 kDa which localizes to the nucleus78, 79 , 80. The γ- protocadherin CTF2 (γ-CTF2) contains a bi-bartite nuclear localization sequence, while α-CTF2 lacks this sequence. However, α-CTF2 does contain a lysine-rich sequence which resembles a nuclear localization signal78. α-Pcdh cleavage by γ-secretase had not been demonstrated at the start this project. If the nuclear localization of the CTF2 fragment were shown to mediate signaling cascades as the output for protocadherin function, this would allow the coupling of hypotheses about extracellular binding events contributing to neuronal diversity and diverse intracellular signaling events initiated by extracellular cues.

Several lines of evidence suggest that these cleavage events are significant for protocadherin function. First, the γ-CTF2 is not an artifact of experimentation. It is generated and localized to the nucleus in vivo in multiple organisms78, 79 , 80. Second, the cytoplasmic domain, including the sequence for γ-CTF2, is necessary for protocadherin function. Mice lacking a functional γ-CTF2 exhibit the neonatal lethality of the complete knockout mice78, 98. Like the cultured neurons of the complete knockout, cultured neurons

63 from mice lacking the γ-Pcdh carboxyl terminus show a reduction in the amplitude of spontaneous synaptic transmissions, indicating that synaptic function is compromised98.

Thus, the function of the protocadherins is dependent on the presence of a functional γ-

CTF2 either because the γ-CTF2 is responsible for protocadherin functionality, or because the protein is non-functional when not in its full-length form. Third, γ-CTF2 appears to be necessary for protocadherin expression. γ-CTF2 knock-out mice in heterozygous or homozygous form exhibit severely reduced γ-protocadherin, but not α- protocadherin, mRNA levels78. Similarly, the inhibition of γ-secretase decreases γ- protocadherin locus expression, showing that not only must the γ-CTF2 sequence be present, but it must also be properly cleaved in order to sustain protocadherin expression78. This may be executed by a positive feedback mechanism where the γ-CTF2 binds to protocadherin promoters to positively affect transcription. γ-protocadherin promoter activity is known to be increased by expression of the γ-CTF2 fragment78. We believed these lines of evidence suggested a nuclear role for the γ-CTF2, and possibly α-

CTF2, which is central to protocadherin function.

Several lines of evidence suggest that the α-CTF2 fragment, although not initially known to be generated in vivo, is likely to be functional and nuclearly localized. A substantial portion of the endogenous immunoreactivity observed for Pcdhα proteins early in zebrafish development is in the nucleus79. Zebrafish unable to splice the intracellular domain to their Pcdh1-α cluster as a result of morpholino injection experience neuronal apoptosis in the brain and spinal cord, with the apoptosis limited to only some subsets of neurons79. Interestingly, this phenotype can be attenuated by expression of the protocadherin-α cytoplasmic fragment79.

64

Many cell-surface receptors are processed by PS-IP including amyloid beta (A4) precursor protein (APP), Notch1, and E- and N-cadherin. The functions of the ICD/CTF2 in these systems may provide tractable hypotheses about the function of the protocadherin fragment. Notch is activated by ligand binding, which induces two extracellular cleavage events necessary for the induction of PS-IP181. The consequence of the γ-secretase cleavage is the generation of the Notch ICD which translocates to the nucleus and binds directly to transcription factors of the C promoter binding factor family and participates in transcriptional regulation of Notch target genes181, 182, reviewed in 183.

Similar to the protocadherins, only a small portion of the Notch protein is detected at the cell surface and Notch protein is continually turned over, itself having a very short half- life181, 182, reviewed in 183. APP’s 47 amino-acid ICD also translocates to the nucleus where it becomes associated with other proteins in a transcriptionally active complex which apparently acts to regulate phosphoinositide-mediated calcium signalingreviewed in 183. Other

γ-secretase targets which send ICD fragments to the nucleus include the receptor tyrosine kinase ErbB4, the cell-adhesion molecule CD44 and the low-density lipoprotein receptor

LRP reviewed in 183. CD44 and LRP ICD fragments, like Notch ICD, appear to act as transcriptional activatorsreviewed in 183.

PS-IP is induced, in the cadherins, by an initial cleavage carried out by a matrix metalloprotease which results in the shedding of an N-terminal fragment from the membrane. The membrane-bound fragment is then further processed by γ-secretase, as has been demonstrated for the protocadherins. These initial cleavage events, in E- cadherin, are induced by calcium imbalance or apoptosis184. The CTF2 fragment of N- cadherin does not localize to the nucleus, instead binding to the transcription factor cyclic

65

AMP response element binding protein (CBP) in the cytoplasm and promoting its degradation, in turn inhibiting CBP/CREB-mediated transcription185. PS-IP liberates the

CTF2 fragment of E-cadherin from β-catenin resulting in an increase in soluble β-catenin levels. Solubilized β-catenin then localizes to the nucleus, complexes with transcription factors, and activates expression of target genes184.

γ-secretase targets, in general, appear to mediate signaling not by activation of protein signaling cascades but instead by the localization of the cytoplasmic ICD fragment or ICD fragment binding partners to both nuclear and cytoplasmic transcription factors to control transcription. We therefore hypothesize that protocadherin CTF2 fragment nuclear localization affects transcription in a similar manner. Since its nuclear localization is already known, we believe that the CTF2 fragment affects transcription in the nucleus by itself binding to DNA or by interacting with transcription factors to indirectly affect expression.

Why would adhesion molecules be cleaved? One hypothesis postulates that cleavage is initiated by a lack of binding, which initializes intracellular events which eventually promote expression. A lack of homophilic interaction would induce the production of the Pcdh-CTF2 fragment, which localizes to the nucleus and increases expression of the protocadherins. This increased expression would increase the chance of binding to cells with appropriate partners, thereby strengthening the connection between cells78. We would then have to hypothesize that the protocadherins, and perhaps the protocadherin CTF2, are linked to a second, as yet undetailed, signaling cascade that would then be activated upon appropriate binding. Studies using chemical treatments to study under which conditions cleavage is promoted or inhibited indicate that cleavage

66 may be linked to neural activity, but no connection between binding and cleavage has yet been demonstrated.

In general, for the cadherins, adhesion is decreased as a result of ADAM10- dependent cleavage, although the liberated N-terminal fragment appears to be capable of promoting adhesion186, 187. Similarly, promotion of the ADAM10 cleavage event in protocadherin-expressing cells decreases cell aggregation82. This appears to be in contrast to the hypothesis that cleavage would serve to increase adhesion, although it is unclear whether this decreased adhesion is a necessary step towards the ultimate goal of increased adhesion. But what conditions promote cleavage in the first place? Treatment of cells with an agent to stimulate calcium influx increases both N-cadherin and protocadherin ADAM10 cleavage82, 186. The application of either potassium chloride or glutamate, to mimic depolarization or glutamate receptor stimulation respectively, induces cleavage in N-cadherins and the protocadherins82, 186. Thus, both calcium influx and the stimulation of neural activity induce cleavage of the protocadherins, which in turn appears to result in decreased adhesive activity. How extracellular binding, these cleavage-inducing factors and the CTF2 all interact to produce protocadherin functionality is still a huge unanswered question. If the nuclear targets of CTF2 were known, the connection between all of this data might be revealed.

At the start of this project, nuclear localization of the γ-CTF2 had only been demonstrated by one group and was only visible when the putative γ-CTF2 was over- expressed in tissue culture under proteasome inhibition80. We first aimed to verify that over-expressed γ-CTF2 fragment localized to the nucleus in vitro. We then wanted to determine whether overexpressed α-CTF2 was capable of nuclear localization. Once

67 fragment localization was demonstrated we began a series of experimentation to determine the possible function of fragment localization. These experiments focused on the hypothesis that CTF2 localization directly or indirectly affected transcription.

Protocadherin CTF2 fragments localize to the nucleus The initial report by Haas et al. detailing the nuclear localization of γ-CTF2 fragments focused their experimental research on the CTF2 fragment of Pcdh-γc380. We therefore chose to verify putative γc3-CTF2 localization, as well as γb7-CTF2, α6-CTF2 and β15-CTF2. We wished to demonstrate that a non-c-type γ-protocadherin-CTF2 (γb7-

CTF2) also localized to the nucleus as c-type protocadherins like Pcdh- γc3, do not share many of the common characteristics of the rest of the Pcdh-γ family52. We also wished to determine whether the intracellular domains of an α-protocadherin (α6-CTF2) and β- protocadherin (β15-cytoplasmic) also localized to the nucleus, as this had not yet been demonstrated. The use of these particular paralogs- Pcdh-γb7, Pcdh-α6 and Pcdh-β15, was decided based on the availability of full-length clones from Open Biosystems that fully matched the public consensus sequence.

γ-secretase has no known consensus cleavage site, although the cleavage of its products tends to occur at one or more specific locations188. Therefore, we decided to generate our putative CTF2 fragments using all sequence downstream of the putative transmembrane region, excluding the stop codon in order to allow expression of 3’ V5 and 6x-His tags. In order to determine whether the paralog-specific sequence of the intracellular domain was necessary for nuclear localization and/or function, we also

68 generated fragments consisting of only the sequence coded for by the constant domain exons of the protocadherin-α and –γ clusters (Figure 6).

Transfection of full-length protocadherins into cultured cells often results in mis- localization of the protein43, 69, 76. HEK293 cells are one of the few cell lines that reliably localize the protocadherins to the membrane, and are also the cell line used in the initial report of protocadherin fragment nuclear localization80. We therefore decided to use

HEK293 cells, as well as two neuronal-like cells lines known to transfect well in our laboratory, the HTB-11 and BE(2)-C lines, to investigate protocadherin CTF2 localization. Expression of the transiently transfected CTF2 fragments was verified by

RT-PCR and Western Blot. While some groups were unable to see discernible levels of exogenously expressed Pcdh-CTF2 fragments without proteasome inhibition, we were easily able to see expression of the Pcdh-CTF2 in our system with the exception of

Pcdhβ15 which was detected neither by RT-PCR nor Western blot (Figure 7 and data not shown). Experimentation on the cytoplasmic region of Pcdhβ15 was therefore discontinued.

We then performed immunofluorescence microscopy on transiently transfected cell lines to assess fragment localization. Cytoplasmic actin (ACTG1) and forkhead box

P2 (FOXP2) were used as cytoplasmic and nuclear controls, respectively. HTB-11 cells did not transfect as well with eGFP as the other cell lines used and were thus not used for further experimentation. HEK293 cells transfected with eGFP showed nearly 100% of the cells expressing, but under immunofluorescence were shown to only display prominent

Pcdh-CTF2 expression in 5-10% of cells. In these cells expression seemed to be localized to the nucleus, but the cytoplasm of fixed HEK293 cells was very thin, making definitive

69 nuclear localization difficult to determine (data not shown). Transfected BE(2)-C cells were even more difficult to analyze. They did not transfect as well with eGFP as the

HEK293 cells and displayed even lower expression of transfected Pcdh-CTF2 fragments, with less than 1% of cells showing obvious expression (data not shown). Even after changes to the fixation and staining protocol, a high amount of background staining was observed in BE(2)-C cells making localization determinations impossible. We therefore chose to use HT1080 cells who displayed no background fluorescence and transfected as well as HEK293 cells for visualization of constructs.

All Pcdh-CTF2 constructs transfected into HT1080 cells showed evidence of nuclear localization. In the 5-10% of cells displaying significant Pcdh-CTF2 expression, all showed staining which overlapped with the DAPI nuclear stain and resembled the staining of the known nuclear transcription factor FoxP2 (Figure 8). Faint cytoplasmic staining was also observed for all constructs. The observation that both Pcdh-constant domains and the putative Pcdhα-CTF2 fragment localize to the nucleus is novel. While the Pcdhα and Pcdhγ CTF2 fragments display no significant similarity to each other, they appear to share a common mode of action that involves localization to the nucleus. This localization does not require the presence of the variable cytoplasmic region; fragments containing only the constant cytoplasmic domain are capable of localizing to the nucleus on their own. It has subsequently been published that Pcdhα-CTF2 fragments also localize to the nucleus180. Further experimentation on Pcdhγ fragments by other groups has further confirmed nuclear localization of various Pcdhγ paralogs78, 79, 80, 180, . Confident that our generated fragments displayed their expected localization patterns we proceeded to make stable cell lines.

70

Figure 6: Schematic of protocadherin constructs

Schematic of the tagged protocadherin constructs. S= signal peptide, EC= ectodomain, TM= transmembrane, V= variable cytoplasmic domain, CD= constant cytoplasmic domain, V5 His= V5-6x His C-terminal tag Figure 7: Protocadherin CTF2 fragments are expressed by transiently transfected HEK293 cells

Primar y 31 kDa antibod y used was 17 kDa against the 3’ V5 tag. The expected sizes of the generated ICD fragments are as follows: Pcdh-α6-ICD: 28 kDa, Pcdh-αconstant: 20 kDa, Pcdh-β15-cytoplasmic: 13 kDa, Pcdh-γb7-ICD: 27 kDa, Pcdh-γc3-ICD: 28 kDa, Pcdh-γconstant: 18 kDa. The expected eGFP size is 27 kDa. The higher apparent molecular weight of Pcdhα6-ICD is likely due to glycosylation as the Pcdhα members are known to be extensively glycosylated38, 39, 43. The non-specific band at approximately 30 kDa could not be eliminated by changing blot conditions or antibodies.

71

Figure 8: Protocadherin CTF2 fragments are not excluded from the nucleus of transiently transfected HT1080 cells

72

Transiently transfected HT1080 cells expressing ActG1, FoxP2, Pcdhα6 CTF2, Pcdhα constant region, Pcdhγb7 CTF2, Pcdhγc3 CTF2 or Pcdhγ constant region. Staining is for the V5 tag or DAPI nuclear stain.

73

Stable cell lines expressing protocadherin CTF2 fragments BE(2)-C and HEK293 cells were used for the creation of stable cell lines. We hypothesized that stable cell lines would show far more homogenous expression of the

Pcdh-CTF2 fragments than we were able to obtain by transient transfection. Multiple stable colonies were obtained for each construct and their expression was assayed by

Western blot (Figure 9). The two highest expressing stable lines were expanded and maintained for each construct. One colony of each construct was used for nuclear extraction to ensure correct fragment localization. For each construct expression was seen both in the cytoplasm and nucleus, as was indicated in transient immunofluorescence experiments. Therefore, further experimentation using these stable cell lines was initiated.

74

Figure 9: Stable HEK293 cell lines express protocadherin CTF2 fragments

Stable colonies express the CTF2 fragment. Expression for individual stable colonies, pooled stable colonies and transiently transfected cell lines are shown for comparison. Arrows indicate the expected position of each CTF2 fragment. 75

Exploration of transcriptional roles for protocadherin fragments While it appears clear that protocadherin fragments are localized to the nucleus, there are only hints as to how the fragments function. For the reasons discussed in the introduction to this chapter, namely a decrease in expression in heterozygous γ-cluster deleted mice and PS-IP inhibited cell lines, and an increase in expression when CTF2 fragments are overexpressed, we believed that there was significant evidence for a direct or indirect role for the protocadherins in transcription78. We approached this question from three directions. First, we asked whether protocadherin CTF2 fragments are capable of binding to DNA. Next, we explored the effect that CTF2 fragment expression had on endogenous protocadherin expression, including determining whether CTF2 fragments were bound at protocadherin promoters. Finally, we performed chromatin immunoprecipiation followed by hybridization to promoter arrays (chIP/chip) to determine whether CTF2 fragments were found at any endogenous promoters. We chose to focus our experimentation on Pcdh-γc3 as it was the paralog whose nuclear localization was best established in the literature.

First, we used an unbiased approach to explore CTF2 binding to DNA. We coupled an anti-V5 antibody to columns and applied chromatin samples prepared from

Pcdh-γc3-CTF2 stable HEK293 lines, FoxP2 transiently transfected HEK293 cells as a positive control, and untransfected HEK293 cells as a negative control. Samples were eluted from the columns, amplified and run on agarose gels. After standard chromatin preparation and immunoprecipitation there is not sufficient DNA present to be visualized on a gel without amplification. If Pcdh-γc3-CTF2 binds to DNA we would expect the protein to be bound on the column and for DNA to elute with the protein off of the

76 column. Even with sonication we were unable to liberate a sufficient amount of DNA from HEK293 nuclei to be visualized on a gel; our input chromatin sample before buffer exchange, after amplification did not show any DNA. Therefore, this line of investigation was abandoned.

In in vitro promoter assays the Pcdh-γ promoter is more active when co- transfected with a γ-CTF2 overexpression plasmid78. We hypothesized that HEK293 and

BE(2)-C cells transiently transfected with any of our Pcdh-CTF2 overexpression plasmids would show an increase in RNA levels of protocadherins compared to untransfected controls. While primers for the Pcdhα6 variable region and Pcdhα constant region performed well for standard curve generation, they failed to provide threshold cycles for nearly every cDNA sample tested, including the untransfected control, and were therefore excluded from analysis. Overall, we found no evidence of an increase in endogenous expression of γ-Pcdh expression in lines transfected with any of our CTF2 constructs as compared to untransfected parent lines (Figure 10).

We next aimed to determine whether the protocadherin CTF2 is capable of binding to protocadherin promoters, as is suggested by the finding that CTF2 fragments increase transcription from protocadherin promoters in vitro78. To this end we transfected stable cell lines expressing the γc3-CTF2 with a plasmid carrying the γc3 promoter, hypothesizing that regulation was likely to be a positive feedback loop with paralogs most likely controlling their own expression. Chromatin was isolated from transfected stable cells, subjected to immunoprecipitation and the resulting precipitated DNA was used to transform chemically competent bacteria. If the γc3-CTF2 fragment were capable of binding to the γc3 promoter, we would expect the γc3 promoter to immunoprecipitate,

77 bringing with it the gene for ampicillin resistance contained on the plasmid, thereby allowing bacterial growth on IP sample transformation plates. If the γc3-CTF2 fragment was not capable of binding to the γc3 promoter, we would expect the γc3 promoter to remain in the unbound, mockIP, fraction of DNA and allow growth on mockIP sample transformation plates.

We observed no colony growth under either IP or mockIP sample transformations. Increasing lysis and adding sonication had no effect- colonies failed to grow even with these experimental changes. We have observed that chromatin from

HEK293 cells is particularly difficult to isolate and sonicate to smaller fragments. Under our standard lysis and sonication procedures, HEK293 chromatin failed to achieve the fragment size range averaging at 500 base pairs that we saw for other cell lines.

Furthermore, using the gentle lysis necessary for nuclear preparation appeared to be unable to fully liberate nuclei in HEK293 samples, and instead we were forced to rely on increased sonication to finish the lysis. Significantly increasing both lysis and sonication times ameliorated the problem, but we were generally unable to sonicate down to fragments shorter than 1000 base pairs (data not shown). Because we did not wish to sonicate in this experiment, fearing that sonication would separate the γc3 promoter from the ampicillin resistance cassette, we chose to use other techniques to further explore the possibility of Pcdh-CTF2 DNA binding.

We next used immmunoprecipitated chromatin from Pcdh-γc3-CTF2 cells to determine whether CTF2 fragments were capable of binding to any of the protocadherin promoters. We hypothesized that binding would be most likely to occur around the conserved sequence element (CSE) present in all protocadherin promoters as deletion of

78 the CSE is known to reduce expression of the downstream protocadherin exon45. If the protocadherins are truly regulated by a positive feedback mechanism whereby cleaved

CTF2 fragments directly regulate paralog expression, then binding to the CSE may be one mechanism by which this positive regulation is achieved. We were unable to design unique primers around the Pcdh-γc3 CSE, but were able to test binding to thirty other paralog CSEs. We found no compelling data for Pcdh-γc3-CTF2 binding to protocadherin CSEs (Figure 11). While the enrichment of the Pcdhα3 CSE was significant over the mean of the negatives, given the non-significance of every other CSE tested, we felt that this was likely an experimental artifact and not of biological significance.

It remains possible that the γc3-CTF2 is capable of binding to the γc3-CSE, but we feel this is unlikely. The CSEs are highly conserved among the protocadherins and

γc3-CSE is very similar to many other paralogs, which is why we could not design primers that would amplify only the γc3-CSE. Previous experiments demonstrated an overall increase in protocadherin expression when CTF2 is overexpressed that is not limited to the specific paralog that the CTF2 fragment is derived from78. We were also unable to reproduce the finding that CTF2 expression increases locus expression.

Therefore, it is possible that the CTF2 fragment can bind to CSEs under some conditions, but does not appear to do so promiscuously.

Because directed approaches for assessing protocadherin CTF2 binding failed to yield data definitely in the positive or negative direction, we decided to take a more agnostic approach. We used immunoprecipitated chromatin from independent growths and precipitations of Pcdh-γc3-CTF2 stable HEK293 cell lines to hybridize to a promoter

79 array. More than 1000 sets of probes, which we defined into probe regions, were significantly enriched in the immunoprecipitated sample. In previous chIP-chip experiments in our lab, comparable numbers of significant regions were identified for known transcription factors such as SRF, NRSF and FoxP29, unpublished data. However, the average z-scores for the protocadherin significant regions were not as high as for these known transcription factors and there was not an obvious skew of the data towards positive Z-scores as would be expected for a protein that is bound to DNA (Figure 12). A complete list of all positive regions and their corresponding region length and Z-scores can be found in Supplement 1. The top twenty array hits are listed in Table 3. Three protocadherin regions also emerged as positive array hits: Pcdhγa3, Pcdhγb5 and

Pcdhγc3. The positive probes for Pcdhγb5 and Pcdhγc3 were localized intergenically, while the Pcdhγa3 positive probes were in the promoter region. Localization of the positive probes for the top twenty array hits were also distributed both in promoter regions and intergenically.

To verify enrichment of positive regions we designed primers tiling across the positive regions of the top 17 array hits and the three positive protocadherin regions. We then performed quantitative real-time PCR using these primers on five chromatin IP samples: amplified immunoprecipitated chromatin from Pcdh-γc3-CTF2 stable HEK293 lines (the same sample labeled and hybridized to the array), amplified mock immunoprecipitated chromatin from Pcdh-γc3-CTF2 stable HEK293 lines (the same sample labeled and hybridized to the array), unamplified immunoprecipitated chromatin from Pcdh-α6-CTF2 stable HEK293 lines, unamplified immunoprecipitated chromatin from Pcdh-γb7-CTF2 stable HEK293 lines, and unamplified immunoprecipitated

80 chromatin from Pcdh-γc3-CTF2 stable HEK293 lines (a separate growth and IP from the samples used on the arrays). 18 of the 20 positive regions showed significant enrichment in the amplified Pcdh-γc3-CTF2 IP sample but not the mockIP sample, the other two regions failed to validate. None of the unamplified samples showed enrichment (Figure

13).

There are several explanations for why the unamplified samples failed to validate the array and amplified sample results. First, because the immunoprecipitations for these three unamplified samples were all performed at the same time, it is possible that a gross experimental error caused immunoprecipitation failure. Second, without known protocadherin CTF2 targets we are unable to test how successful the chromatin immunoprecipitation procedure is with our particular cell lines and antibodies. It is therefore possible that our antibody, though rated for immunoprecipitations by the manufacturer, is unable to access our protein or does not significantly enrich our protein and any enrichment seen in the amplified samples is purely due to the amplification of noise from the immunoprecipiation. Finally, it is possible that protocadherin CTF2 fragments do not bind to DNA at all and again the enrichment is due to noise amplification.

Figure 10: No significant increase in protocadherin paralog expression is observed in the presence of CTF2 expression

81

Protocadherin Paralog Expression in CTF2 Transfected Cell Lines 3 HEK293 untransfected

2.5 HEK293 Pcdhα6-CTF2 Transfected HEK293 Pcdhα-constant Transfected 2 HEK293 Pcdhγb7-CTF2 Transfected HEK293 Pcdhγc3-CTF2 Transfected 1.5 HEK293 Pcdhγ-constant Transfected BE(2)-C untransfected 1 BE(2)-C Pcdhα6-CTF2 Transfected BE(2)-C Pcdhα-constant 0.5 Transfected Expression Relave to Untransfected Control BE(2)-C Pcdhγb7-CTF2 Transfected BE(2)-C Pcdhγc3-CTF2 0 Transfected GAPDH Pcdhα3 Pcdhβ3 Pcdhγa5 Pcdhγb5 Pcdhγb7 Pcdhγc3 BE(2)-C Pcdhγ-constant Transfected Gene Assayed for Expression by RT-PCR

cDNA was prepared from transiently transfected cell lines and qPCR was performed to assay protocadherin locus expression. Enrichment varied substantially between cell lines and paralogs, but no obvious trend towards increased expression is seen.

82

Figure 11: Protocadherin promoter CSEs are not enriched by IP of CTF2 fragments

Top and bottom graphs are identical, with bottom graph zoomed in to show detail. Lines on bottom graph indicate two standard deviations above the mean of the negatives, our lower threshold for calling positive enrichment.

83

Figure 12: Probe Z-scores are not skewed in the positive direction

Number of Probes

Z-score Table 2: Top twenty positively enriched regions identified by chIP-chip

Gene Gene Name No of P-value GO Term Probes CRYGD Gamma-crystallin D 3 3.0x10-5 visual perception NUDT12 Peroxisomal NADH 6 5.1x10-5 nucleus pyrophosphatase NUDT12 MBNL1 Muscleblind-like protein 1 5 5.7x10-5 muscle development/dsRNA binding KIAA1622 5 5.9x10-5 HEAT-like repeat- containing protein EIF2A Eukaryotic translation initiation 6 6.5x10-5 translation initiation factor 2A factor 2 complex ALCAM Activated leukocyte cell adhesion 4 9.8x10-5 cell adhesion molecule GSPT2 G1 to S phase transition protein 2 5 1.4x10-4 translation release factor homolog activity ARHGAP29 Rho GTPase activating protein 29 5 1.4x10-4 intracellular signaling cascade FAM89A Family with sequence similarity 89, 5 1.6x10-4 member A TEP1 Phosphatidylinositol-3,4,5- 3 1.6x10-4 phosphatidylinositol-3- trisphosphate 3-phosphatase phosphatase activity TARSL2 Threonyl-tRNA synthetase-like 2 5 1.7x10-4 tRNA aminoacylation for translation NID2 Nidogen 2 4 1.8x10-4 cell adhesion/collagen binding

84

hsa-mir- 4 1.9x10-4 507 FGF14 Fibroblast growth factor 14 5 1.9x10-4 nervous system development YIPF6 Yip1 domain family, member 6 5 2.0x10-4 endoplasmic reticulum TLE1 Transducin-like enhancer protein 1 5 2.0x10-4 negative regulation of transcription ASTE1 Asteroid homolog 1 10 2.0x10-4 DNA repair THAP1 THAP domain containing, apoptosis 5 2.2x10-4 DNA binding/metal ion associated protein 1 binding OR5V1 Olfactory receptor, family 5, 3 2.2x10-4 G-protein coupled subfamily V, member 1 receptor signaling TAS2R38 Taste receptor, type 2, member 5 2.4x10-4 taste receptor activity 138

85

Figure 13: Selected positively enriched regions identified by chIP-chip are validated by qPCR in amplified samples

Amplified Pcdhγc3-CTF2 IP Amplified Pcdhγc3-CTF2 mockIP

MBNL1_4

Positively Enriched Region Probes

DNA from amplified IP and mockIP samples was assayed for the presence of positively enriched regions identified on the promoter array by qPCR. Amplified IP samples validate the array well.

Stable cell lines are non-homogenous While we had established that our stable cell lines expressed our CTF2 fragments well and localized them to the nucleus by Western blot, we had yet to examine them by immunofluorescence. Before performing replicate chIP-chip experiments to validate our identified positive regions, we wanted to ensure we were using a pure population of cells that expressed the CTF2 fragment and localized it to the nucleus. Immunofluorescent microscopy revealed this was not the case. The two highest expressing cell lines per construct as determined by Western Blot were evaluated by IF and revealed that, as was

86 seen in the transient transfections, 10% of cells were expressing the CTF2 fragment at detectable levels. Those cells that were expressing tended to localize the fragment to the nucleus, though non-exclusively, as was seen in transient transfections (Figure 14). The

Pcdh-γc3-CTF2 stable cell line used for chIP-chip was our lowest expressing cell line by

IF with apparently less than 1% of the cells showing expression. Either very few cells in the stable colony are actually expressing CTF2, or expression is down regulated in most cells. Either way we feel that our chIP-chip positive probe regions are actually the result of noise and do not represent regions of CTF2 binding. In order to continue with our investigation of CTF2 binding we decided to re-derive all of our stable cell lines.

Fearing that our stable cell lines were non-homogenous, as the result of contamination by cells with de-activated construct expression, or cells that had escaped

Geneticin selection in some other manner, we re-derived all of our stable cell lines using our non-homogenous initial stable cell lines as starting material and deriving colonies from single cells. Surprisingly, nearly all newly derived stable colonies exhibited the same approximately 10% expression phenotype (Figure 15). We could not find any re- derived Pcdhγc3-CTF2 stable colonies that had high expression. This may be due to the fact that such a small percentage of cells were expressing the fragment in the original batch of cells. Since we knew that each of these colonies were derived from a single cell we hypothesized that either expression and nuclear localization were cell-cycle dependent or that the proteolysis of CTF2 fragments seen by other groups was also active in our system. To address these questions we used only our best expressing newly derived

Pcdhα6-CTF2 stable cell line. As an ad hoc measure of whether expression and localization were regulated in a cell-cycle dependent manner we plated stable cells at

87 varying concentrations for visualization by immunofluorescence. At every concentration

CTF2 expression was the same- whether the cells were very sparse or at complete confluence (Figure 16).

To determine whether proteolysis was a contributing factor to decreased fragment expression we incubated the Pcdhα6-CTF2 stable colony with one of two proteasome inhibitors and then visualized by immunofluorescence. Treatment with either proteasome inhibitor did not increase fragment expression or localization (Figure 17). Therefore, either there is a selective advantage for stable cells to repress fragment expression in vitro, or the observed pattern of CTF2 expression is biologically relevant. We did not tend to see clusters of cells expressing CTF2, but we did often see adjacent cells highly expressing CTF2 in the nucleus. Because other groups were able to exogenously express

CTF2 homogenously in a cell population we believe it is most likely that our construct was repressed by cells because its expression hindered cell growth, rather than it being a biologically significant observation.

88

Figure 14: Initially derived stable cell lines show non-homogenous expression and localization of CTF2 fragments

Parent HEK293 Line

Merged DAPI and Anti-V5

Pcdhα6-CTF2 Stable HEK293 Line

Merged DAPI and Anti-V5

Pcdhγc3-CTF2 Stable HEK293 Line

Merged DAPI and Anti-V5

89

Figure 15: Rederived stable cell lines show non-homogenous expression and localization of CTF2 fragments Figure X: Redrived stable cell lines show non-homogenous expression and localization of CTF2 fragments - - -

Rederived Pcdhα6 CTF2 Stable Cell Line Rederived Pcdhα6 CTF2 Stable Cell Line Rederived Pcdhγc3 CTF2 Stable Cell Line

Merge

V5 - Anti

DAPI

90

Figure 16: Confluency has no effect on CTF2 fragment expression or localization

32,000 Cells Plated 64,000 Cells Plated 10,000 Cells Plated

Merge

V5 - Anti

DAPI

91

Figure 17: Proteasome inhibition has no effect on CTF2 expression or localization

Mock MG132 Treated Treated

Merge

V5 - Anti

DAPI

92

Treated Mock Treated Lactacystin

Merge

V5 - Anti

DAPI

93

ChIP-Seq reveals no evidence of CTF2-DNA binding Although our stable cell lines were non-ideal, we decided to proceed with our best expressing Pcdhα6-CTF2 colony for use in chromatin immunoprecipitation, libaray construction and direct sequencing (chIP-Seq). DNA was present after amplification for library construction, and fragments in the correct size range were excised. Libraries constructed from immunoprecipitated and total chromatin from Pcdhα6-CTF2 stable cell lines were sequenced. While more than six million reads were collected for each library, more than enough for adequate genomic coverage, no significant enrichment was observed in immunoprecipitated libraries. 38 peaks were called with an estimated false discovery rate of 37% (Table 3). For known transcription factors several thousand peaks are generally called with false discovery rates around 1%. Looking at the reads aligned for the control library and IP library, in every case where an enriched peak was called in the IP library, there were similar peaks observed in the control library (Figure 18). We therefore believe that none of the peaks we see represent actual enrichment by the CTF2 fragment. We cannot, however, distinguish between a failure of CTF2 fragments to bind to DNA and a failure of CTF2 fragments to immunoprecipitate as we have no known positive binding sites. We cannot exclude the possibility that the CTF2 fragment binds to complexes which affect transcription, but that the immunoprecipitation of the complex is not possible under the current experimental conditions, possibly due to steric hindrance of either the bioavailability of the V5 epitope or the conformation of the potential DNA binding ability of CTF2.

In order to have further investigated the role of the CTF2 fragment in the nucleus we feel that we would have needed to move to a different system using different

94 expression vectors and possibly different cell lines. Because this process would have been a lengthy and time consuming undertaking, we decided against further experimentation at this time.

Table 3: Table of significant peaks by ChIPSeq

Chromosome Position: 100 base Nearest gene (within 10 kb) Peak Height pairs surrounding peak chr20:26138063-26138163 Upstream of BC036544 4.025 chr20:26137765-26137865 Within BC036544 18.476 chr12:36830032-36830132 None 21.819 chr5:174474313-174474413 None 21.657 chr12:126216407-126216507 None 13.344 chrY:10646073-10646173 None 21.319 chr16:46096645-46096745 Intron of PHKB 21.240 chr2:132752276-132752376 None 21.159 chr9:69254090-69254190 None 21.152 chr19:23975481-23975581 None 2.813 chr19:23046001-23046101 None 21.041 chr4:48820675-48820775 None 20.911 chr5:177945257-177945357 Intron of COL23A1 20.877 chr10:42125492-42125592 Intron of AK131313 20.814 chr14:87307467-87307567 None 2.104 chr16:33872924-33873024 None 20.776 chr17:19296249-19296349 Upstream of CR602473 6.927 chr16:33873532-33873632 None 20.747 chr18:16771267-16771367 None 20.690 chr20:29277480-29277580 None 20.688 chrY:10645356-10645456 None 3.152 chr19:21457789-21457889 Upstream of AK097381 20.675 chr2:132755338-132755438 None 20.662 chrY:10645018-10645118 None 20.648 chr1:121185648-121185748 None 20.614 chr1:121055328-121055428 None 20.608 chr5:46396825-46396925 None 20.581 chr10:38854613-38854713 None 20.570 chr6:133635733-133635833 Intron of EYA4 20.569 chr10:41715179-41715279 None 20.560 chr16:33803472-33803572 None 20.550 chrY:11921026-11921126 None 3.581 chr14:89411110-89411210 Intron of C14orf143 20.544 chr19:32562104-32562204 None 20.527 chr16:33873306-33873406 None 20.527 chr7:57990545-57990645 None 20.509

95

chr2:132739632-132739732 Upstream of NCRNA00164 2.952 chr16:33866962-33867062 None 7.772

Figure 18: Significant peaks called are also present in the control library

Control Library Forward Reads

Control Library Reverse Reads

Pcdhα6-CTF2 IP Forward Reads

Pcdhα6-CTF2 IP Reverse Reads

Genome browser shot of one enriched peak area. Peaks represent the number of probes sequenced that map to this region of the genome. While there are more reads in the 3’ region of the peak called in the IP sample, there is obviously a large number of reads in the control sample as well. This pattern is observed for all 38 of the called peaks.

96

Chapter 4: Prediction and Validation of REST/NRSF Binding Sites REST/NRSF is an important regulator of neuronal fate. While much of transcription factor biology has focused on the role of activators, REST/NRSF is a well studied repressor of transcription, canonically turning off neuronal genes in nonneuronal tissues. However, it is also known to function in neurons, pancreatic islet cells and cardiac tissue 141, 189, 190. We believed that the discovery of the network of REST/NRSF targets would reveal important information about the function of REST/NRSF as well as possibly shedding light on the mechanisms of target choice utilized by “master negative regulators”. The most extensive bioinformatic study to predict RE1/NRSEs was performed by Bruce et al. and identified 1,892 putative human RE1/NRSEs by searching for a core 17 nucleotide sequence derived from 32 known RE1/NRSE sequences134.

Target genes were called by requiring that the target have the core RE1/NRSE within 10 kb of their 5’ region. Using this criterion 355 putative target genes were predicted134.

Assigning ensemble genome annotations revealed many genes known to be expressed in the nervous system and important for neuronal function, but also many others with no known neuronal-specific function134. Gel shifts and chromatin immunoprecipitation of a few evolutionarily conserved RE1/NRSEs were used to validate REST/NRSF binding to the putative sites134.

We felt there were several areas in this analysis that could be improved upon and sought to create our own set of predicted target genes. First, we believed we could improve on the site definition used to identify targets by incorporating more information than just a core sequence. Second, we sought to use a site prediction which would rank

97 putative binding sites such that a threshold could be experimentally determined which would give us a better idea of which predictions were likely to be true positives and which were more likely to be false positives.

In collaboration with Ali Mortazavi, we developed a genomics modeling program to predict target genes for transcription factors with large binding sites. We felt the development of such a model, outside of its utility for REST/NRSF target prediction, was important because the target genes of most transcription factors are completely unknown.

Our collaborators developed the computational framework for the binding site prediction model, the output of which we subsequently tested and validated. REST/NRSF is an ideal transcription factor for the testing of such a computational model, not only because of its fascinating biology, but because it has a long, well-characterized binding site which is known to operate in a location and orientation independent manner, and a large number of RE1/NRSE sites are known and experimentally validated for activity124, 125, 134.

Development of a transcription factor binding site prediction tool: Cistematic Cistematic is an open-source set of software tools that allows the generation and refinement of a set of binding site predictions for a target transcription factor. It utilizes a two-part method of binding site prediction. It requires, as input, a set of functionally validated known binding sites. In the first part of binding site prediction these sites are used to build a seed position specific frequency matrix (PSFM) which is then used to search multiple genomes for other conserved binding sites. The algorithm used to search these other genomes relies, as a default, on the assumption that binding sites are likely to be embedded within a longer stretch of conserved sequence. This default setting can be

98 removed in order to find binding sites that are only themselves conserved, but are not within a larger stretch of conservation.

The second part of the prediction builds a refined PSFM based on the predicted binding sites generated with the seed PSFM. The specific requirements used to build the

PSFM can be selected by the user- various conservation and geography criteria can be chosen. Once this PSFM is built a set of predicted binding sites can be generated by searching the genome(s) of choice with inclusion data again chosen by the user. The use of a PSFM model instead of the traditional consensus site model allows for the definition of a more refined binding site and allows for ranking of potential binding sites based on

PSFM score. Further details on the Cistematic software can be found in Mortazavi et al. and the most recent Cistematic software can be downloaded at http://cistematic.caltech.edu/.

Prediction of RE1/NRSE sites The Cistematic software was used to predict REST/NRSF binding sites in order to validate the utility of Cistematic and to expand the repertoire of predicted RE1/NRSEs.

Several individual well-validated seed RE1/NRSE sites from the human, mouse and dog genomes, as well as the entire set of 33 validated human RE1/NRSE sites together, were used to generate seed PSFMs in order to determine how sensitive the Cistematic software was to differences in input sites. Once a seed PSFM was generated similar putative binding sites were identified by searching genomes for matches with the following characteristics: 87.5% match to the seed PSFM, 85% similarity over a 25-65 base pair window, and shared in at least two of the three genomes queried (human, mouse, and

99 dog). All matches in repetitive regions were discarded. The conserved motifs identified were used to derive a refined PSFM. All of the refined PSFMs generated by the various input validated RE1/NRSEs were very similar to each other and gave us confidence that the software was generating a model that was close to the biological RE1/NRSE.

We then began to set a threshold score for inclusion of putative binding sites.

First, we plotted the PSFM match scores for several known RE1/NRSE sites, as well as some sites that resemble RE1/NRSE but are known to not be bound by REST/NRSF. We chose 84% as our initial match threshold as this value excluded all false positives (sites known not to bind) while including as many true positives as possible (Figure 19). Using previously published reporter assay values we were able to plot match score against repression strength and found that there is a significant correlation between these two values (R2= 0.82) (Figure 19)124. These data also suggest the use of a PSFM match score of approximately 84%. More importantly, it suggests that both PSFM values and repression strength are correlated with REST/NRSF binding strength at the associated predicted site.

The human, mouse and dog genomes were then searched for every match to the refined PSFM generated by using the SCG10 RE1/NRSE as a seed motif. Predicted sites had to meet our 84% inclusion threshold. For each predicted site all genes within a 10-kb radius were annotated to the site as possible targets. It has been demonstrated that a single

RE1/NRSE can silence multiple genes; we therefore did not want to include only the closest gene to a prediction as we would likely be missing other targets127. We then screened these putative targets and included only those genes which also had an

RE1/NRSE within 10-kb of an orthologous gene in the mouse and/or dog. This genome

100 search identified 660 genes with a putative RE1/NRSE within 10 kb. A second group of targets was collected that included genes with conserved PSFM matches that were located further than 10 kb from the gene model, but were not included in initial analysis of the putative targets as we believed they were more likely to be false positives.

101

Figure 19: Selection of a threshold and PSFM match score is negatively correlated with promoter activity of REST/NRSF repressed genes

From: Comparative genomics modeling of the NRSF/REST repressor network: from single conserved sites to genome-wide repertoire. Mortazavi A, Leeper Thompson EC, Garcia ST, Myers RM, Wold B.

Blue triangles: RE1/NRSE sites known to be bound by REST/NRSF. Red circles: RE1/NRSE-like sites known not to be bound by REST/NRSF. (A) A threshold of 84% includes the most known positive sites while excluding all known false positives. (B) The PSFM score for known positive RE1/NRSE sites plotted against their repression in a transient reporter assay124. A value of 100% indicates no repression.

Validation of predicted RE1/NRSE sites

In order to determine whether our 84% threshold value was biologically relevant, we experimentally validated 113 putative target RE1/NRSEs of various PSFM scores.

We performed chromatin immunoprecipitation (ChIP) on a cell line known to express

REST/NRSF endogenously at high levels. The binding of REST/NRSF to candidate sites was evaluated by quantitative PCR. We calculated fold enrichment of putative sites by dividing the number of genomic equivalents of the potential NRSE, calculated from a standard curve, by the mean of the number of genomic equivalents of five random

102 negative nongenic, nonconserved regions. If the enrichment was more than three standard deviations above the mean of the negatives, we concluded that the putative target site was, in fact, occupied in vivo by REST/NRSF. Our analysis revealed that 84% is a very reasonable threshold value for REST/NRSF, although it was clear that there was no threshold value that could be chosen which would eliminate false positives or false negatives (Figure 20). Of the 71 candidate sites which scored higher than 84%, 70 of them were positive by chIP/qPCR. Of the 42 sites which scored lower than 84%, only 13 of them were positive.

These chIP/qPCR experiments also revealed that PSFM match score is positively correlated with fold enrichment. This again may indicate that both values are correlated with binding strength. This may be due in part to the extra information that the PSFM model captures over typical consensus site calls. It is important to note that while this general positive correlation is significant, there is still high variability in the absolute values obtained for chIP enrichment. There are many explanations for what determines fold enrichment including epitope availability due to the presence of additional co- factors, percentage of sites bound by the transcription factor, as well as binding strength.

We had previously believed that the biological significance of fold enrichment was low considering the number of experimental factors contributing. It was therefore surprising for us to learn that some aspects of binding strength appear to be captured quantitatively in the fold enrichment value.

As REST/NRSF is known to repress microRNAs, we searched for microRNAs in the UCSC microRNA registry that were within a 25-kb radius of a predicted NRSE138, 191.

103

We identified 21 microRNAs as part of 16 distinct families, 20 of which have already been implicated in neuronal differentiation138, 191. We assayed binding at the predicted

RE1/NRSEs of 11 of these microRNAs by chIP/qPCR; 10 showed significant enrichment indicative of REST/NRSF binding. Two of these chIP/qPCR positive microRNAs are those previously published to be regulated by REST/NRSF and are important for the transition from neural progenitors to mature neurons138. Possible candidate targets for our

21 microRNAs include REST/NRSF itself and the corepressor Co-REST.

If the putative target sites identified in two previously published papers are analyzed by the PSFM generated by Cistematic it is revealed that these target sets include many putative sites which score relatively poorly, less than 80%, using our PSFM model124, 134. Based on our chIP/qPCR data these sites are unlikely to actually be true positives and conclude that the Cistematic software performs better at predicting true

RE1/NRSE sites than the traditional site models used previously.

104

Figure 20: PSFM match score is correlated with fold enrichment by chIP/qPCR

From: Comparative genomics modeling of the NRSF/REST repressor network: from single conserved sites to genome-wide repertoire. Mortazavi A, Leeper Thompson EC, Garcia ST, Myers RM, Wold B.

The PSFM match score (expressed as a percentage of the maximal score) is plotted against the fold enrichment calculated from chIP/qPCR experiments. The green vertical line represents the 84% threshold value used for membership. The red horizontal line represents the fold enrichment value that is three standard deviations above the mean of the negative genomic regions used in chIP/qPCR.

Characterization of predicted RE1/NRSE sites

When distal sites, those located more than 10 kb away, are combined with the more proximal sites we find that our putative RE1/NRSEs are still enriched within 5 kb of gene start sites, with 40% of them located in these regions. However, more than 25% of putative sites are more than 10 kb from any gene model. Using the GO analysis module within Cistematic on the genes with RE1/NRSEs within 10 kb these genes show significant enrichment for the GO categories for “synaptic transmission”, “neurogenesis”,

105 and “transporter activity”. This is consistent with previously published literature that

REST/NRSF functions canonically as a repressor of neuronal genes in nonneuronal tissue116, 117.

We then used the GNF gene atlas module in Cistematic to examine the expression patterns of putative REST/NRSF target genes in 79 human tissues192. Clustering was performed on the 660 putative targets which revealed that brain tissues cluster together in two clusters which display an expression pattern that is specific to those tissues (Figure

21). The weights of representative objects from each cluster, medoids, are plotted in

Figure 21b, and show that brain tissues have a different pattern of expression than other tissues. Pancreatic islet cells and cardiac myocytes also show positive weights and

REST/NRSF is known be involved in the repression of both pancreatic islet and some cardiac genes141,189, 190. Approximately 40% of genes with associated RE1/NRSEs show enrichment for brain expression, as would be expected for a repressor that turns off neuronal genes in non-neuronal tissues. It is important to note, however, that most neuronal genes do not have a putative RE1/NRSE nearby.

That putative REST/NRSF targets are generally located close to transcription start sites, show significant enrichment of GO terms for neuronal genes, and show downregulation of expression in tissues with high REST/NRSF expression indicates that not only are our predicted RE1/NRSE sites capable of being bound by REST/NRSF, but are also biologically functional. These findings are consistent with the known role of

REST/NRSF as a potent repressor of neuronal genes in nonneuronal tissues. In addition to validating the canonical role of REST/NRSF in our analysis we revealed a larger

106 network of microRNA regulation than has previously been demonstrated, showed that while target genes are enriched for neuronal terms, there are a large number of target genes that are not neuron-specific, and established that the values assigned to target sites by our new site prediction model is well correlated with experimental values for binding.

107

Figure 21: Brain-specific expression of putative REST/NRSF target genes

From: Comparative genomics modeling of the NRSF/REST repressor network: from single conserved sites to genome-wide repertoire. Mortazavi A, Leeper Thompson EC, Garcia ST, Myers RM, Wold B.

108

Putative REST/NRSF target genes, those genes within 10 kb of a predicted RE1/NRSE, with expression patterns listed in the GNF gene atlas were clustered. (A) The second and fifth clusters show brain-specific expression. (B) Weights of the medoids for cluster 2, brain tissues highlighted in black.

109

Chapter 5: A Common Deletion in the Maltase Glucoamylase gene Recently, copy number variation has emerged as an important determinant of human health. The full spectrum of phenotypic outcomes has been associated with copy number variance ranging from no detectable health effect to lethality193. Numerous copy number variants (CNVs) have been identified that are the cause of or associated with disease outcomes including Down syndrome, mental retardation, velocardiofacial syndrome, HIV susceptibility, autism, multiple autoimmune diseases and many othersreviewed in 193-197. Until recently it was not possible to screen widely for copy number variants smaller than 50 kb and generally studies looking for CNVs do not use large enough sample sizes to detect rarer variants195,197. Because these small microdeletions and insertions are known to contribute to disease, it is important to catalog these small copy number variants in order to screen for their association with diseases.

In part due to advances in technology it has become possible to identify CNVs on the very arrays normally used for SNP genotyping because the data is highly quantifiable and the relatively large changes in signal from an expansion or deletion are easily visualized. In a screen for patterns of single-nucleotide polymorphism variation from

1000 individuals from 51 populations of the Human Genome Diversity Panel, multiple copy number variants (CNVs) were identified162, 197, unpublished data. Approximately 1200-

2000 copy number variants were identified in multiple individuals and an additional

20,000 de novo CNVs were found in single individuals. However, not all CNVs are detectable. The detectable variants are biased towards more rare events since

110 polymorphisms which frequently violate Mendelian ratios, as would be expected if they were contained within a copy number variant, were excluded from the Beadchips.

Nevertheless, a large maltase deletion removing approximately 30 kb of the largest maltase intron was identified in approximately 10% of the European samples assayed with multiple homozygous individuals found (unpublished data). The deletion was not detectable in any of the African populations. This deletion was highly probable, as the intensity values of all of the seven polymorphisms within this intron, but not those polymorphisms 5’ or 3’ to the intron, dropped dramatically on the array (unpublished data). Copy number variation in genes involved in starch metabolism which have been shown to functionally increase hydrolysis are more common in populations with high starch diets. The number of copies of salivary amylase (AMY1) has recently been shown to be highly correlated with dietary starch intake161. In fact, our genotyping platform had identified this amylase variation and we had begun to develop PCR genotyping assays concurrent with our work on maltase to determine whether variation in one or both of these genes was correlated with starch intake when this study was published. Because maltase seemed to show a population distribution similar to that of AMY1 we wanted to determine whether the large maltase intron deletion had any effect on maltase expression.

Determination of deletion endpoints The human MGAM gene is comprised of 48 exons spanning 82 kb on 7q34198. The deletion appeared to delete most of the large intron between exons 39 and 40. The 5’ deletion start site is bounded by SNPs rs7797360 and rs4276595, located approximately 1.8 kb apart. The 3’ deletion end site is bounded by SNPs

111 rs4726489 and rs4259331, located approximately 1.3 kb apart. We tiled forward primers within the 5’ bounded region and 1 kb 5’ and 3’ to the boundary, and reverse primers within the 3’ bounded region and 1 kb 5’ and 3’ to the boundary. We expanded our boundary definitions by 1 kb in either direction after initial failures with designed PCR primers in the limited boundaries. We performed PCR with short extension times so that when using the primers to detect the deletion only deleted alleles would give PCR products. We also designed primers within the deletion to detect non-deleted alleles. We began by testing primer pairs on a pool of human genomic DNA known to contain both

MGAM alleles, but found that primers that worked well on this sample did not necessarily amplify well on individual samples. We therefore moved to screening potential primer pairs on a set of six samples from Coriell’s International HapMap Project sample database.

The non-deleted allele was reproducibly detectable, but we had significant difficulty identifying primer pairs to detect the deleted allele. Despite expanding our potential endpoint boundaries by 1 kb in each direction, varying PCR conditions, polymerases, and testing 17 forward primers and 12 reverse primers in every combination, we could not reproducibly genotype the deleted allele. A detailed search of the MGAM intron and surrounding sequence revealed a large repeat present at each end of the suspected deletion. The sequence of the putative start boundary and the putative end boundary align with 92% identity, meaning that many of our designed primers were mispriming. The tandem repeats are approximately 6 kb long and align at 80% identity over their length extending into both exons 39 and 40 as well as their intron. This revealed to us the likely source of the deletion: recombination between the two repeats.

112

With this knowledge we were then able to design and optimize a set of primer pairs that reliably genotyped the deletion.

In order to map the endpoints of the deletion we amplified, purified and sequenced the band produced by the primer pairs for the deleted allele. Because the deletion appears to have occurred as the result of recombination between the two repeats we are unable to determine the exact endpoint of the deletion, but were able to narrow it down to a 58 base pair window. The window is defined by two polymorphic base pairs between the two repeats. The 5’ polymorphism in the deleted allele matches the 5’ repeat

83 base pairs downstream of the 5’ intron/exon junction and the 3’ polymorphism matches the 3’ repeat 258 base pairs upstream of the 3’ intron/exon junction (Figure 22).

Thus, the deletion occurs entirely within the large intron and does not disrupt any exonic sequence, although it does eliminate nearly the entire intron.

We genotyped a subset of samples from the Human Genome Diversity Panel using our PCR based genotyping, and although were blinded to their predicted genotype from Beadchip hybridization at the time of assay, all genotypes predicted by PCR matched the Beadchip genotypes (Figure 23). We were therefore confident in our ability to reliably detect both MGAM alleles.

113

Figure 22: Deletion endpoint schematic

Exon 39* * * * Exon 40

6 kb repeat 6 kb repeat

Exon 39 * * Exon 40

Top: non-deleted allele. Bottom: deleted allele. Polymorphisms which help define break points are marked with asterisks (*). The 58 base pairs in which the deletion actually occurred is shown as variegated pink and blue as the exact endpoints cannot be determined due to the extensive identity between the two repeats.

114

Figure 23: PCR genotyping correctly predicts array genotyping

Genotypes predicted by array data are shown in labels for the PCR lanes. All PCR genotyping results match predicted array genotypes.

MGAM genotype does not predict MGAM expression levels While we now knew that the deletion did not remove any exonic sequence it remained possible that the deletion eliminated a regulatory element, or perhaps a splice enhancing site. Splice enhancing sites generally occur very close to the exon that they serve to regulate, but can be located several hundred base pairs away199. Therefore, we sought to determine whether MGAM genotype correlated with MGAM expression levels in vivo. MGAM is not expressed well in lymphoblastoid cell lines, but is highly expressed in the kidney, salivary gland and intestine. This makes correlation of genotype and expression levels difficult as all major public repositories of human samples are maintained as lymphoblastoid lines. We collaborated with Heather Wheeler to obtain samples of primary human kidney DNA and RNA samples14. This allowed us to genotype

115 samples which could then be correlated with MGAM expression levels. Kidney genotyping was performed on seventy samples, 68 gave reproducible genotypes. Of these

68, 19 were found to be heterozygous for the deletion, we observed no homozygous deleted individuals. 50 of the 70 individuals represented self reported their race to be

“Caucasian”, 7 as “Asian”, 5 as “African-American”, and 8 declined to state. 16 of our heterozygotes were Caucasian, one was Asian and two declined to state. The HGDP samples indicated that approximately 10% of Europeans carried at least one deleted allele, while more than 30% of our Caucasian samples were heterozygotes. We observed no homozygotes in our kidney DNA samples, which we would have expected due to the apparently high allele frequency in this sample. Consistent with the array results, we observed no deleted alleles in African-American samples.

We first explored the relationship between genotype and expression by plotting the genotype of kidney samples against their MGAM levels previously assessed by microarray14, 200. The collected kidney samples were derived from either the cortex or medulla. Most of our samples had expression data for both tissues, although some had only one or the other. For our cortex analysis we had 47 homozygous non-deleted individuals and 19 heterozygous individuals. We found no significant expression differences between the two genotypes (P=0.89) (Figure 24). For our medulla analysis we had 40 homozygous non-deleted individuals and 19 heterozygous individuals. We again found no significant expression differences (P=0.17), although the variances of the medulla samples were significantly different (P=2x10-4) (Figure 24).

Since MGAM has many exons and we were unsure of the placement of the probes on the array, we used the matched RNA samples provided to perform quantitative PCR

116 on exons 16, 40 and 48 in cortex samples only. The deletion occurs within the intron between exons 39 and 40. We aimed to determine if the expression of exons 3’ to the deletion were affected by the loss of MGAM’s large intron. We found the qPCR expression data recapitulated the array expression data, where there were no significant differences between expression levels (P=0.29, 0.06, 0.39 for exons 16, 40 and 48 respectively) but the variance of the heterozygote expression levels was significantly smaller (P=1x10-3, 2x10-4, 1x10-4 for exons 16, 40 and 48 respectively) (Figure 25).

Finally, we used primers designed in the first and last MGAM exons to amplify cDNA from samples of heterozygotes and homozygous non-deleted individuals to determine if there were any detectable size differences in MGAM RNAs. Only a single band of equal size was found in both heterozygotes and homozygous non-deleted individuals (data not shown). It remains possible that small differences in RNA species not detectable by the methods we have used result in differences in protein levels or protein activity in individuals carrying the deletion. It is also possible that the deletion has no effect in the heterozygous form because of compensation by the non-deleted allele, but as we had no homozygous individuals we cannot test this hypothesis. It seems most likely that MGAM expression and activity are unaffected by the large deletion as the deletion is in Hardy-Weinberg equilibrium and at relatively high allele frequencies.

117

Figure 24: Genotype and MGAM expression in the cortex and medulla

MGAM expression levels in the cortex and medulla are not significantly different between heterozygotes and homozygous non-deleted individuals. Expression levels determined by expression arrays.

118

Figure 25: Expression of MGAM exons in the cortex

MGAM exon expression does not significantly differ between heterozygotes and homozygous non-deleted individuals.

119

Chapter 6: Conclusions

Gene conversion and the evolution of protocadherin diversity The evolutionary evidence in support of frequent gene conversion events acting on the protocadherin gene cluster is strong: orthologous relationships are maintained only in some regions of protocadherin paralogs, paralog diversity, both neutral and non- neutral, is severely restricted to certain paralog regions, and third position GC content is increased in those regions lacking diversity49. However strong this evidence, it cannot substitute for actual detection of gene conversion events if knowledge of the frequency and functional consequences of such events is desired. Our sequencing of paralogs in the

Pcdhα cluster provides direct evidence of gene conversion. We observed a striking preference for non-synonymous changes and a significant dampening of the transition:transversion ratio, both predicted if gene conversion, rather than mutation, were the cause. Several putative conversion events were identified, but as they represent single nucleotide variation between paralogs we cannot predict for each putative event whether it was due to mutation or conversion. Howeverr, the finding that “mutations” that could be readily explained by paralog strand invasion are common, gives further evidence that gene conversion is an important contributor to protocadherin cluster evolution.

The most striking example of gene conversion is the obvious Pcdhα4 conversion by Pcdhα8. The conversion event transferring sequence from Pcdhα8 to Pcdhα4 changes the highly conserved RGD sequence present in all Pcdhα first ectodomains (Figure 4).

The RGD motif is known to function in protein-protein interactions and as such has been shown to be required for many of the putative interactions of the protocadherins with

120 other proteins76, 109. The RGD motif of the Pcdhα4 reference sequence, the haplotype 1 sequence RGG, is different from every other mouse, human and rat Pcdhα paralog RGD sequence, where the aspartate residue is completely conserved 201. The aspartate residue is thought to be the essential residue for the putative association of Pcdhα proteins with

Reelin109. Haplotype 1 is the minor Pcdhα4 allele, and is present on 40% of human chromosomes. If the RGG sequence is unable to bind target molecules, like Reelin, then

40% of the population carries an at least partially non-functional Pcdhα4 allele and many homozygotes were identified that would carry no completely functional Pcdhα4 alleles. It is possible, then, that the Pcdhα4-Pcdhα8 conversion event repairs the non-functional

Pcdhα4 allele so that it is able to bind Reelin or other target interactors. It is unlikely that the haplotype 1 Pcdhα4 protein is unable to interact with other protocadherin paralogs as the RGD motif has never been implicated in these interactions. Furthermore, the high frequency of the deficient Pcdhα4 allele indicates that the mutation must be effectively neutral4. This may reflect redundancy among protocadherin paralogs such that the loss of a single paralog has no significant effect on fitness.

A large deletion is known to be present in the Pcdhα cluster and may functionally interact with the converted allele. A 16.7 kb deletion that truncates Pcdh-α8, and completely removes Pcdh-α9 and -α10 is present at an 11% allele frequency in

Europeans, and at high frequencies in all other populations sampled4. This deletion arose on a derivative of haplotype 2, one of the two major Pcdhα haplotypes and is in Hardy-

Weinberg equilibrium4. This deletion has been analyzed as a candidate for neurological disorders including autism, schizophrenia and bipolar disorder, but no association has been reported 4, 103. Due to its widespread distribution and Hardy-Weinberg equilibrium, it

121 is likely to be neutral, any phenotypic effect that it might produce must have only a limited effect on reproductive fitness 4.

The Pcdhα4-Pcdhα8 conversion event predates the large α8-α10 deletion event.

The deletion event occurred on haplotype 2; therefore every person carrying the α8-α10 deletion also carries the Pcdhα4-Pcdhα8 conversion. This means that the deletion of

Pcdhα8 occurred on a genetic background in which gene conversion had transferred a major functioning piece of the Pcdhα8 protein onto another paralog lacking that function201. It is possible, then, that the deletion was only tolerated because functional constraint on Pcdhα8 had been lowered as a result of the conversion event. The deletion is never seen on a haplotype that does not contain the converted Pcdhα4 allele. This may be because the deletion is present at a low frequency (estimated 5% worldwide frequency) and there is little recombination between Pcdhα4 and Pcdhα9 4. However, it is also possible that the combination of the defective Pcdhα4 allele with the loss of Pcdhα8-

α10 is deleterious.

The protocadherin gene family, as a large array of paralogs, necessitates invocation of a rather complicated model of evolution. Different vertebrate species have vastly different complements of protocadherins with entire sub-groups of protocadherins

(Pcdh-α, -β, -γ, -δ, -ε, -µ, and –ν) being lost in some lineages 46. Further diversifying the species, gene conversion and rapid diversification act to simultaneously increase cluster diversity and homogenize paralogs. These seemingly opposite forces co-exist because functional constraint differs at different protocadherin residues201. Some residues are necessarily identical among all paralogs because they are necessary for the structure and function of the protocadherin molecule. Other residues, like the RGD motif in Pcdhα

122

EC1, are conserved only between members of a paralog subgroup and may confer subgroup-specific properties. Residues not yet identified surely contribute to binding specificity and are hypothesized to be maintained in such a way as to increase diversity.

If we imagine that the protocadherins are really the “lock and key molecules” necessary for conferring neuronal diversity then the specific associations that the protocadherins make with each other or other molecules are not important, so long as the overall diversity of the interactions is maintained. Under this model, gene conversion events and non-synonymous mutations that do not disrupt a unique homophilic or heterophilic association would be functionally neutral and may accumulate over time, contributing to the deterioration of orthologous sequence similarity. Conversion events, like the Pcdhα4-Pcdhα8 event that appear to repair or add functionality to paralogs would be neutral as well, provided they do not reduce overall diversity201. Such events could provide a selective advantage and increase in allele frequency over time. Conversion events and non-synonymous mutations that create functionally redundant paralogs by homogenizing regions that need to be diverse in order to contribute to protocadherin diversity would be hypothesized to be deleterious and would be removed over time201.

We imagine that during the course of protocadherin cluster evolution tandem duplication creates functionally redundant paralogs that are free to accumulate mutations that increase cluster diversity. These diversity-increasing events could include both nonsynonymous mutations and gene conversion events that transfer functionality from one paralog to another. Deletion of any one of these paralogs would then not result in a decrease in overall diversity as duplication and divergence would simultaneously be acting to increase diversity201. This makes identification of gene conversion events more

123 difficult as loss of the donor molecule would not be disadvantageous. Therefore, many more conversion events will be present than we are capable of identifying.

When I began this project it was unclear if the two Pcdhα cluster haplotypes differed only at the identified promoter SNPs and were maintained at such high frequency because the haplotypes had different expression profiles, or if there were coding sequence differences between the haplotypes where selection maintained both haplotypes because of functional differences between the paralogs. My work demonstrated that there are at least twelve coding differences between the two haplotypes, that eight of these changes are nonsynonymous, although conservative, substitutions, and that at least one gene conversion event with a putatively functional effect distinguishes between the two haplotypes. Other large scale sequencing efforts have also identified coding differences between haplotypes that are biased towards diversifying mutations101. We hypothesize, then, that the two haplotypes are maintained not because of expression differences caused by promoter polymorphism, but because of functional differences between diversified paralogs between the two haplotypes.

More expansive sequencing efforts are sure to reveal more obvious gene conversion events, and are likely to contribute additional data in support of widespread gene conversion in the protocadherin clusters. However, without information about the molecular function of these molecules, it will remain difficult to determine the functional effect of these gene conversion events.

124

Nuclear localization of protocadherin fragments Multiple groups have confirmed the presence of the protocadherin CTF2 fragment in vitro and in vivo in multiple organisms78, 79, 80, 98, 180. Nevertheless, besides mouse knockout experiments which have demonstrated the necessity of the CTF2 fragment for protocadherin function, no further information has emerged regarding the putative function of the fragment. Our results indicate that exogenously expressed CTF2 fragment does localize to the nucleus, as previously described. We find this nuclear localization to be non-exclusive, with significant localization outside of the nucleus. Differences between in vitro systems used to study the fragment appear to influence fragment behavior. Some groups report that even transiently expressed CTF2 fragment is highly unstable, with proteasome inhibition necessary to detect exogenously expressed fragment. Others report no such instability. We found that the inhibition of the proteasome had no visible effect on fragment expression. While some groups were able to generate apparently homogenous cell lines stably expressing CTF2 fragment, we were unable to achieve this. While it has been reported that CTF2 expression increases expression from other protocadherin promoters, we were also unable to replicate this result. Whether this is a true failure to replicate or due to the relatively low percentage of cells within our stable cell lines actually found to be expressing the fragment, is unknown.

We can also not determine the true meaning of our finding that protocadherin

CTF2 fragments fail to immunoprecipitate with DNA. This is likely partially due to our failure to produce homogenous cell lines, but may also reflect a failure of our antibody to recognize the necessary tagged epitope present on our fragments. This failure to detect

CTF2, if not purely due to experimental error, may be indicative of one of several

125 biologically relevant outcomes. It is possible that the CTF2 fragment does not participate in binding to DNA whatsoever. Because the CTF2 fragment is small it may localize to the nucleus without active localization via NLS or other sequences. Whether the function of CTF2 relies on its nuclear localization is unknown. If CTF2 does participate in DNA binding, the context in which that happens is also unclear. As the CTF2 fragment itself does not have any domains resembling known DNA binding domains it seems the most likely that any association between CTF2 and DNA would be mediated by at least one other co-factor. While yeast two hybrid studies using the intracellular domain of the protocadherins have been performed, no findings which would shed light on the nuclear function of the fragment have been published.

A necessary experiment, that to our knowledge has not yet been published, is whether expression of the CTF2 fragment in complete knockout, or CTF2 knockout mice, or neurons derived from these animals would at least partially rescue the apoptotic phenotype. While it is likely that the generation of the CTF2 fragment must be under tight spatial and temporal control, even a slight amelioration of symptoms would indicate that at least some of the intracellular function of the protocadherins is carried out by the

CTF2 fragment.

REST/NRSF binding site prediction

Cistematic allows the prediction of the entire network of transcription factor binding sites based on the use of a single-starting conserved binding site under a PSFM model of prediction. In contrast to consensus motif designs, Cistematic can rank putative

126 matches against the model site. We were able to validate the Cistematic software by experimentally validating predicted binding sites of various ranks, allowing the experimental determination of an appropriate threshold score for site inclusion. Because

Cistematic incorporates more information into its model site, the resulting score is more indicative of biological relevance, at least for REST/NRSF. High PSFM scores were more likely to be validated to be bound by REST/NRSF and correlated with experimentally determined repression activity. Analysis of GO term enrichment and

RNA expression data indicated that the canonical function of REST/NRSF, as a master repressor of neuronal fate in non-neuronal tissues, could have been guessed from these computational data alone.

The power of the Cistematic software lies in its ability to incorporate all of the information provided by having a set of known positive binding sites. However, determination of the proper threshold value is critical for obtaining the strongest set of predictions. If the Cistematic software were to be used for transcription factors with less well-defined binding sites, especially if known positives have not been extensively studied, it would be difficult to predict what threshold value should be used. This is a drawback of the PSFM model as threshold values must be determined empirically in order to minimize false positives and false negatives. Therefore, other transcription factors for which Cistematic would be useful for determining binding site networks would have to have long and specific binding sites, show evolutionary conservation, and have available before or after site prediction quantitative functional analyses that would allow setting of a relevant threshold for site inclusion.

127

It is known that REST/NRSF has a wider function than just turning off neuronal genes in non-neuronal tissues. It acts as a repressor in neurons, the pancreas and the heart, and many of its target genes are not neuron-specific121, 122, 123, 141, 189, 190. Our results reflect these findings as well. The majority of genes with neuronal GO classifications are not bound by REST/NRSF, nor are most genes with brain expression bound. Thus most neuronal genes rely on other transcription factors to control their shut-down in non- neuronal tissues. This may be mediated by cascade signaling whereby REST/NRSF represses targets which then result in the downregulation of other genes, but this mechanism has not yet been elucidated. Consistent with this hypothesis, multiple transcription factors have associated NRSE sites.

Based on the finding that REST/NRSF is known to repress microRNAs, as well as the prediction that the targets of those microRNAs include REST/NRSF and some of its co-factors we hypothesize the existence of a feedback mechanism which regulates

REST/NRSF expression and the transition into the neuronal fate. We hypothesize that in stem cells and neural progenitors the high expression of REST/NRSF represses its neuronal target genes and microRNAs. This lack of microRNA expression allows the targets of these microRNAs to remain high, including REST/NRSF and CoREST, which helps to maintain target repression. Upon differentiation REST/NRSF is known to be posttranslationally downregulated, allowing expression of neuronal target genes and microRNAs. These microRNAs then act to downregulate their own targets including

REST/NRSF and CoREST, maintaining the low levels of REST/NRSF initiated by posttranslational modification. This system creates a feed forward loop which would

128 allow for quick downregulation of REST/NRSF mRNA as soon as it is posttranslationally modified in order to allow a quick progression of the cell into differentiation.

Further genome-wide studies have been performed on REST/NRSF since the publication of our paper. It was one of the first transcription factors to be used successfully in ChIPSeq12. This analysis revealed 1946 binding sites in the genome.

Using the Cistematic software to predict binding sites demonstrated that strongly scoring sites with scores greater than 90, are nearly always detected as being occupied by

REST/NRSF using ChIPSeq12. More than 75% of the sites detected by ChIPSeq matched the canonical RE1/NRSE, many more appear to contain the two half sites of the canonical motif separated by a larger spacer sequence, these sites could not have been detected by Cistematic. It was again demonstrated that binding of REST/NRSF was associated with a decrease in transcription at that promoter, that GO terms for genes associated with REST/NRSF binding are enriched for neuronal functions, that

REST/NRSF is bound at putative microRNAs and that these microRNAs integrate into an autoregulatory feedback mechanism12. Novel findings of the study included the finding of a putative pancreatic network of gene repression12.

The addition of information about the differences in REST/NRSF target sites between tissues and at different points in development will elucidate not only the specific role of this important neuronal regulator, but will also give us some of the first glimpses of whether transcription factor target choice is fine tuned throughout development and organs, or whether drastically different targets are used spatially and temporally.

129

Maltase deletion mapping

Maltase serves as an important regulator of starch metabolism by responding to starch levels and preventing dangerous sudden increases in blood glucose. As MGAM is substrate responsive, it would be predicted to be shut off more quickly in populations with a high starch diet. We found a large, common deletion in the MGAM gene that was restricted to populations with high starch diets. We hypothesized that if this deletion led to decreased MGAM activity, then it would have a negative impact on fitness in populations where starch was limiting. This might explain the limitation of the deletion to high starch diet peoples. We were able to map the MGAM deletion to the largest intron of the MGAM gene and demonstrated that the deletion was entirely contained within that intron. We show that two large repeats overlap the exon/intron boundaries and that the deletion likely occurred as a result of recombination between the two repeats. We found no differences in RNA expression levels or RNA length, and therefore conclude that the deletion is likely to have little to no effect on MGAM expression in heterozygous form.

We were unable to obtain matched kidney DNA and RNA samples from any homozygous individuals, although the initial description of the deletion by SNP array found multiple homozygotes. It is possible that compensation by the wildtype allele accounts for the lack of difference in expression levels between homozygous non-deleted and heterozygous individuals. It does appear that the variance in expression levels is much lower in heterozygotes than in homozygous non-deleted individuals although the significance of this finding is unknown. It is possible that further studies with matched samples that include homozygotes would find an appreciable difference in RNA levels between non-deleted and deleted individuals. However, because the deleted allele is at

130 high frequency and in Hardy-Weinburg equilibrium in our initial samples, it seems most likely that the deletion occurred in the European lineage and its restriction to these individuals is merely a reflection of its relatively late appearance on the evolutionary scale.

131

Works Cited 1. Li, J. et al. Lack of association between HoxA1 and HoxB1 gene variants and autism in 110 multiplex families. Am. J. Med. Genet 114, 24-30 (2002). 2. Risch, N. et al. A genomic screen of autism: evidence for a multilocus etiology. Am. J. Hum. Genet 65, 493-507 (1999). 3. Spiker, D. et al. Genetics of autism: characteristics of affected and unaffected children from 37 multiplex families. Am. J. Med. Genet 54, 27-35 (1994). 4. Noonan, J.P. et al. Extensive linkage disequilibrium, a common 16.7-kilobase deletion, and evidence of balancing selection in the human protocadherin alpha cluster. Am. J. Hum. Genet 72, 621-635 (2003). 5. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8, 186-194 (1998). 6. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8, 175-185 (1998). 7. Gordon, D., Abajian, C. & Green, P. Consed: a graphical tool for sequence finishing. Genome Res 8, 195-202 (1998). 8. Nickerson, D.A., Tobe, V.O. & Taylor, S.L. PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res 25, 2745-2751 (1997). 9. Cooper, S.J., Trinklein, N.D., Nguyen, L. & Myers, R.M. binding sites differ in three human cell types. Genome Research 17, 136-144 (2007). 10. Trinklein, N.D., Chen, W.C., Kingston, R.E. & Myers, R.M. Transcriptional regulation and binding of 1 and heat shock factor 2 to 32 human heat shock genes during thermal stress and differentiation. Cell Stress Chaperones. 9, 21–28 (2004). 11. Mortazavi, A., Thompson, E.C.L., Garcia, S.T., Myers, R.M. & Wold, B. Comparative genomics modeling of the NRSF/REST repressor network: From single conserved sites to genome-wide repertoire. Genome Research 16, 1208-1221 (2006). 12. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497-1502 (2007). 13. Valouev, A. et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods 5, 829-834 (2008). 14. Wheeler, H.E. et al. Genomic convergence reveals common variants associated with gene expression and aging in the human kidney. (2009).at 15. SPERRY, R.W. CHEMOAFFINITY IN THE ORDERLY GROWTH OF NERVE FIBER PATTERNS AND CONNECTIONS. Proc Natl Acad Sci U S A 50, 703-10 (1963). 16. Redies, C. Cadherins in the central nervous system. Prog Neurobiol 61, 611-48 (2000).

132

17. Shapiro, L. & Colman, D.R. The diversity of cadherins and implications for a synaptic adhesive code in the CNS. Neuron 23, 427-30 (1999). 18. Hamada, S. & Yagi, T. The cadherin-related neuronal receptor family: a novel diversified cadherin family at the synapse. Neurosci Res 41, 207-15 (2001). 19. Serafini, T. Finding a partner in a crowd: neuronal diversity and synaptogenesis. Cell 98, 133-136 (1999). 20. Takeichi, M. Cadherin cell adhesion receptors as a morphogenetic regulator. Science 251, 1451-5 (1991). 21. Vleminckx, K. & Kemler, R. Cadherins and tissue formation: integrating adhesion and signaling. Bioessays 21, 211-20 (1999). 22. Uemura, T. The cadherin superfamily at the synapse: more members, more missions. Cell 93, 1095-8 (1998). 23. Nollet, F., Kools, P. & van Roy, F. Phylogenetic analysis of the cadherin superfamily allows identification of six major subfamilies besides several solitary members. J Mol Biol 299, 551-72 (2000). 24. Redies, C. Cadherins and the formation of neural circuitry in the vertebrate CNS. Cell Tissue Res 290, 405-13 (1997). 25. Inoue, T. et al. Role of cadherins in maintaining the compartment boundary between the cortex and striatum during development. Development 128, 561-9 (2001). 26. Treubert-Zimmermann, U., Heyers, D. & Redies, C. Targeting axons to specific fiber tracts in vivo by altering cadherin expression. J Neurosci 22, 7617-26 (2002). 27. Fannon, A.M. & Colman, D.R. A model for central synaptic junctional complex formation based on the differential adhesive specificities of the cadherins. Neuron 17, 423-34 (1996). 28. Frank, M. & Kemler, R. Protocadherins. Curr Opin Cell Biol 14, 557-62 (2002). 29. Sano, K. et al. Protocadherins: a large family of cadherin-related molecules in central nervous system. EMBO J 12, 2249-2256 (1993). 30. Strehl, S., Glatt, K., Liu, Q.M., Glatt, H. & Lalande, M. Characterization of two novel protocadherins (PCDH8 and PCDH9) localized on human chromosome 13 and mouse chromosome 14. Genomics 53, 81-89 (1998). 31. Wu, Q. & Maniatis, T. A striking organization of a large family of human neural cadherin-like cell adhesion genes. Cell 97, 779-90 (1999). 32. Ludwig, D. et al. cDNA cloning, chromosomal mapping, and expression analysis of human VE-Cadherin-2. Mamm. Genome 11, 1030-1033 (2000). 33. Wolverton, T. & Lalande, M. Identification and characterization of three members of a novel subclass of protocadherins. Genomics 76, 66-72 (2001). 34. Blanco, P., Sargent, C.A., Boucher, C.A., Mitchell, M. & Affara, N.A. Conservation of PCDHX in mammals; expression of human X/Y genes predominantly in brain. Mamm. Genome 11, 906-914 (2000). 35. Yoshida, K. & Sugano, S. Identification of a novel protocadherin gene (PCDH11) on the human XY homology region in Xq21.3. Genomics 62, 540-543 (1999). 36. Kohmura, N. et al. Diversity revealed by a novel family of cadherins expressed in neurons at a synaptic complex. Neuron 20, 1137-51 (1998). 37. Wu, Q. & Maniatis, T. Large exons encoding multiple ectodomains are a characteristic feature of protocadherin genes. Proc Natl Acad Sci U S A 97, 3124-9 (2000).

133

38. Kohmura, N. et al. Diversity revealed by a novel family of cadherins expressed in neurons at a synaptic complex. Neuron 20, 1137-51 (1998). 39. Wang, X. et al. Gamma protocadherins are required for survival of spinal interneurons. Neuron 36, 843-54 (2002). 40. Zou, C., Huang, W., Ying, G. & Wu, Q. Sequence analysis and expression mapping of the rat clustered protocadherin gene repertoires. Neuroscience 144, 579-603 (2007). 41. Esumi, S. et al. Monoallelic yet combinatorial expression of variable exons of the protocadherin-alpha gene cluster in single neurons. Nat Genet 37, 171-6 (2005). 42. Kaneko, R. et al. Allelic gene regulation of Pcdh-alpha and Pcdh-gamma clusters involving both monoallelic and biallelic expression in single Purkinje cells. J Biol Chem 281, 30551-60 (2006). 43. Frank, M. et al. Differential expression of individual gamma-protocadherins during mouse brain development. Mol Cell Neurosci 29, 603-16 (2005). 44. Jontes, J.D. & Phillips, G.R. Selective stabilization and synaptic specificity: a new cell-biological model. Trends Neurosci 29, 186-91 (2006). 45. Wang, X., Su, H. & Bradley, A. Molecular mechanisms governing Pcdh-gamma gene expression: evidence for a multiple promoter and cis-alternative splicing model. Genes Dev 16, 1890-905 (2002). 46. Yu, W. et al. Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster. Proc Natl Acad Sci U S A 105, 3819-24 (2008). 47. Noonan, J.P. et al. Coelacanth genome sequence reveals the evolutionary history of vertebrate genes. Genome Res 14, 2397-405 (2004). 48. Tada, M.N. et al. Genomic organization and transcripts of the zebrafish Protocadherin genes. Gene 340, 197-211 (2004). 49. Noonan, J.P., Grimwood, J., Schmutz, J., Dickson, M. & Myers, R.M. Gene conversion and the evolution of protocadherin gene cluster diversity. Genome Res 14, 354-66 (2004). 50. Wu, Q. Comparative genomics and diversifying selection of the clustered vertebrate protocadherin genes. Genetics 169, 2179-88 (2005). 51. Yu, W., Yew, K., Rajasegaran, V. & Venkatesh, B. Sequencing and comparative analysis of fugu protocadherin clusters reveal diversity of protocadherin genes among teleosts. BMC Evol Biol 7, 49 (2007). 52. Wu, Q. et al. Comparative DNA sequence analysis of mouse and human protocadherin gene clusters. Genome Res 11, 389-404 (2001). 53. Takei, Y. et al. Two novel CNRs from the CNR gene cluster have molecular features distinct from those of CNR1 to 8. Genomics 72, 321-330 (2001). 54. Hill, E., Broadbent, I.D., Chothia, C. & Pettitt, J. Cadherin superfamily proteins in Caenorhabditis elegans and Drosophila melanogaster. J Mol Biol 305, 1011-24 (2001). 55. Sasakura, Y. et al. A genomewide survey of developmentally relevant genes in Ciona intestinalis. X. Genes for cell junctions and extracellular matrix. Dev. Genes Evol 213, 303-313 (2003). 56. Noda, T. & Satoh, N. A comprehensive survey of cadherin superfamily gene expression patterns in Ciona intestinalis. Gene Expr. Patterns 8, 349-356 (2008).

134

57. Hirayama, T. & Yagi, T. The role and expression of the protocadherin-alpha clusters in the CNS. Curr Opin Neurobiol 16, 336-42 (2006). 58. Sugino, H. et al. Genomic organization of the family of CNR cadherin genes in mice and humans. Genomics 63, 75-87 (2000). 59. Yanase, H., Sugino, H. & Yagi, T. Genomic sequence and organization of the family of CNR/Pcdhalpha genes in rat. Genomics 83, 717-726 (2004). 60. Sugino, H. et al. Distinct genomic sequence of the CNR/Pcdhalpha genes in chicken. Biochem. Biophys. Res. Commun 316, 437-445 (2004). 61. Tasic, B. et al. Promoter choice determines splice site selection in protocadherin alpha and gamma pre-mRNA splicing. Mol Cell 10, 21-33 (2002). 62. Hirayama, T., Sugino, H. & Yagi, T. Somatic mutations of synaptic cadherin (CNR family) transcripts in the nervous system. Genes Cells 6, 151-164 (2001). 63. Triana-Baltzer, G.B. & Blank, M. Cytoplasmic domain of protocadherin-alpha enhances homophilic interactions and recognizes cytoskeletal elements. J. Neurobiol 66, 393-407 (2006). 64. Fukuda, E. et al. Down-regulation of protocadherin-alpha A isoforms in mice changes contextual fear conditioning and spatial working memory. Eur J Neurosci 28, 1362-76 (2008). 65. Blank, M., Triana-Baltzer, G.B., Richards, C.S. & Berg, D.K. Alpha-protocadherins are presynaptic and axonal in nicotinic pathways. Mol Cell Neurosci 26, 530-43 (2004). 66. Kaneko, R., Kawaguchi, M., Toyama, T., Taguchi, Y. & Yagi, T. Expression levels of Protocadherin-alpha transcripts are decreased by nonsense-mediated mRNA decay with frameshift mutations and by high DNA methylation in their promoter regions. Gene 430, 86-94 (2009). 67. Kawaguchi, M. et al. Relationship between DNA methylation states and transcription of individual isoforms encoded by the protocadherin-alpha gene cluster. J Biol Chem 283, 12064-75 (2008). 68. Sugino, H. et al. Negative and positive effects of an IAP-LTR on nearby Pcdaalpha gene expression in the central nervous system and neuroblastoma cell lines. Gene 337, 91-103 (2004). 69. Murata, Y., Hamada, S., Morishita, H., Mutoh, T. & Yagi, T. Interaction with protocadherin-gamma regulates the cell surface expression of protocadherin-alpha. J Biol Chem 279, 49508-16 (2004). 70. Ribich, S., Tasic, B. & Maniatis, T. Identification of long-range regulatory elements in the protocadherin-α gene cluster. Proceedings of the National Academy of Sciences 103, 19719-19724 (2006). 71. Shapiro, L. et al. Structural basis of cell-cell adhesion by cadherins. Nature 374, 327-337 (1995). 72. Boggon, T.J. et al. C-cadherin ectodomain structure and implications for cell adhesion mechanisms. Science 296, 1308-1313 (2002). 73. Tamura, K., Shan, W.S., Hendrickson, W.A., Colman, D.R. & Shapiro, L. Structure-function analysis of cell adhesion by neural (N-) cadherin. Neuron 20, 1153-1163 (1998). 74. Steinberg, M.S. & Takeichi, M. Experimental specification of cell sorting, tissue spreading, and specific spatial patterning by quantitative differences in cadherin

135

expression. Proc. Natl. Acad. Sci. U.S.A 91, 206-209 (1994). 75. Shan, W.S. et al. Functional cis-heterodimers of N- and R-cadherins. J. Cell Biol 148, 579-590 (2000). 76. Mutoh, T., Hamada, S., Senzaki, K., Murata, Y. & Yagi, T. Cadherin-related neuronal receptor 1 (CNR1) has cell adhesion activity with [beta]1 integrin mediated through the RGD site of CNR1. Experimental Cell Research 294, 494-508 (2004). 77. Morishita, H. et al. Structure of the cadherin-related neuronal receptor/protocadherin-alpha first extracellular cadherin domain reveals diversity across cadherin families. J Biol Chem 281, 33650-63 (2006). 78. Hambsch, B., Grinevich, V., Seeburg, P.H. & Schwarz, M.K. {gamma}- Protocadherins, presenilin-mediated release of C-terminal fragment promotes locus expression. J Biol Chem 280, 15888-97 (2005). 79. Emond, M.R. & Jontes, J.D. Inhibition of protocadherin-alpha function results in neuronal death in the developing zebrafish. Dev Biol 321, 175-87 (2008). 80. Haas, I.G., Frank, M., Véron, N. & Kemler, R. Presenilin-dependent processing and nuclear function of gamma-protocadherins. J Biol Chem 280, 9313-9 (2005). 81. Fernández-Monreal, M., Kang, S. & Phillips, G.R. Gamma-protocadherin homophilic interaction and intracellular trafficking is controlled by the cytoplasmic domain in neurons. Mol Cell Neurosci 40, 344-53 (2009). 82. Reiss, K. et al. Regulated ADAM10-dependent ectodomain shedding of gamma- protocadherin C3 modulates cell-cell adhesion. J Biol Chem 281, 21735-44 (2006). 83. Jontes, J.D., Emond, M.R. & Smith, S.J. In Vivo Trafficking and Targeting of N- Cadherin to Nascent Presynaptic Terminals. J. Neurosci. 24, 9027-9034 (2004). 84. Phillips, G.R. et al. Gamma-protocadherins are targeted to subsets of synapses and intracellular organelles in neurons. J Neurosci 23, 5096-104 (2003). 85. Bennett, M.R. The concept of long term potentiation of transmission at synapses. Prog. Neurobiol 60, 109-137 (2000). 86. Benson, D.L., Schnapp, L.M., Shapiro, L. & Huntley, G.W. Making memories stick: cell-adhesion molecules in synaptic plasticity. Trends Cell Biol 10, 473-82 (2000). 87. Yamagata, K. et al. Arcadlin is a neural activity-regulated cadherin involved in long term potentiation. J Biol Chem 274, 19473-1979 (1999). 88. Tang, L., Hung, C.P. & Schuman, E.M. A role for the cadherin family of cell adhesion molecules in hippocampal long-term potentiation. Neuron 20, 1165-1175 (1998). 89. Tanaka, H. et al. Molecular modification of N-cadherin in response to synaptic activity. Neuron 25, 93-107 (2000). 90. Bozdagi, O., Shan, W., Tanaka, H., Benson, D.L. & Huntley, G.W. Increasing numbers of synaptic puncta during late-phase LTP: N-cadherin is synthesized, recruited to synaptic sites, and required for potentiation. Neuron 28, 245-259 (2000). 91. Kallenbach, S. et al. Changes in subcellular distribution of protocadherin gamma proteins accompany maturation of spinal neurons. J. Neurosci. Res 72, 549-556 (2003). 92. Carroll, P. et al. Juxtaposition of CNR protocadherins and reelin expression in the developing spinal cord. Mol. Cell. Neurosci 17, 611-623 (2001).

136

93. Hasegawa, S. et al. The protocadherin-alpha family is involved in axonal coalescence of olfactory sensory neurons into glomeruli of the olfactory bulb in mouse. Mol Cell Neurosci 38, 66-79 (2008). 94. Kano, M. & Hashimoto, K. Synapse elimination in the central nervous system. Curr. Opin. Neurobiol (2009).doi:10.1016/j.conb.2009.05.002 95. Bekirov, I.H., Needleman, L.A., Zhang, W. & Benson, D.L. Identification and localization of multiple classic cadherins in developing rat limbic system. Neuroscience 115, 213-27 (2002). 96. Phillips, G.R. et al. The presynaptic particle web: ultrastructure, composition, dissolution, and reconstitution. Neuron 32, 63-77 (2001). 97. Morishita, H. et al. Myelination triggers local loss of axonal CNR/protocadherin alpha family protein expression. Eur J Neurosci 20, 2843-7 (2004). 98. Weiner, J.A., Wang, X., Tapia, J.C. & Sanes, J.R. Gamma protocadherins are required for synaptic development in the spinal cord. Proc Natl Acad Sci U S A 102, 8-14 (2005). 99. Prasad, T., Wang, X., Gray, P.A. & Weiner, J.A. A differential developmental pattern of spinal interneuron apoptosis during synaptogenesis: insights from genetic analyses of the protocadherin-gamma gene cluster. Development 135, 4153-64 (2008). 100. Lefebvre, J.L., Zhang, Y., Meister, M., Wang, X. & Sanes, J.R. gamma- Protocadherins regulate neuronal survival but are dispensable for circuit formation in retina. Development 135, 4141-51 (2008). 101. Miki, R. et al. Identification and characterization of coding single-nucleotide polymorphisms within human protocadherin-alpha and -beta gene clusters. Gene 349, 1-14 (2005). 102. Schwab, S.G. et al. Evidence suggestive of a locus on chromosome 5q31 contributing to susceptibility for schizophrenia in German and Israeli families by multipoint affected sib-pair linkage analysis. Mol. Psychiatry 2, 156-160 (1997). 103. Lachman, H.M. et al. Analysis of protocadherin alpha gene deletion variant in bipolar disorder and schizophrenia. Psychiatr Genet 18, 110-5 (2008). 104. Pedrosa, E. et al. Analysis of protocadherin alpha gene enhancer polymorphism in bipolar disorder and schizophrenia. Schizophr Res 102, 210-9 (2008). 105. Koide, T., Moriwaki, K., Ikeda, K., Niki, H. & Shiroishi, T. Multi-phenotype behavioral characterization of inbred strains derived from wild stocks of Mus musculus. Mamm. Genome 11, 664-670 (2000). 106. Grant, S.G. et al. Impaired long-term potentiation, spatial learning, and hippocampal development in fyn mutant mice. Science 258, 1903-10 (1992). 107. Yagi, T. et al. A role for Fyn tyrosine kinase in the suckling behaviour of neonatal mice. Nature 366, 742-5 108. Isosaka, T. et al. Activation of Fyn tyrosine kinase in the mouse dorsal hippocampus is essential for contextual fear conditioning. Eur J Neurosci 28, 973-81 (2008). 109. Senzaki, K., Ogawa, M. & Yagi, T. Proteins of the CNR family are multiple receptors for Reelin. Cell 99, 635-647 (1999). 110. Howell, B.W., Gertler, F.B. & Cooper, J.A. Mouse disabled (mDab1): a Src binding protein implicated in neuronal development. EMBO J 16, 121-132 (1997). 111. Angst, B.D., Marcozzi, C. & Magee, A.I. The cadherin superfamily: diversity in

137

form and function. J. Cell. Sci 114, 629-641 (2001). 112. Jossin, Y. et al. The Central Fragment of Reelin, Generated by Proteolytic Processing In Vivo, Is Critical to Its Function during Cortical Plate Development. J. Neurosci. 24, 514-521 (2004). 113. Ehlers, M.D., Fung, E.T., O'Brien, R.J. & Huganir, R.L. Splice variant-specific interaction of the NMDA receptor subunit NR1 with neuronal intermediate filaments. J. Neurosci 18, 720-730 (1998). 114. Adams, J.C. Characterization of cell-matrix adhesion requirements for the formation of fascin microspikes. Mol. Biol. Cell 8, 2345-2363 (1997). 115. Taguchi, Y., Koide, T., Shiroishi, T. & Yagi, T. Molecular evolution of cadherin- related neuronal receptor/protocadherin(alpha) (CNR/Pcdh(alpha)) gene cluster in Mus musculus subspecies. Mol. Biol. Evol 22, 1433-1443 (2005). 116. Schoenherr, C.J. & Anderson, D.J. The neuron-restrictive silencer factor (NRSF): a coordinate repressor of multiple neuron-specific genes. Science 267, 1360-1363 (1995). 117. Chong, J.A. et al. REST: A mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell 80, 949-957 (1995). 118. Nishimura, E., Sasaki, K., Maruyama, K., Tsukada, T. & Yamaguchi, K. Decrease in neuron-restrictive silencer factor (NRSF) mRNA levels during differentiation of cultured neuroblastoma cells. Neurosci. Lett 211, 101-104 (1996). 119. Ballas, N., Grunseich, C., Lu, D.D., Speh, J.C. & Mandel, G. REST and its corepressors mediate plasticity of neuronal gene chromatin throughout neurogenesis. Cell 121, 645-657 (2005). 120. Palm, K., Belluardo, N., Metsis, M. & Timmusk, T.O. Neuronal Expression of Zinc Finger Transcription Factor REST/NRSF/XBR Gene. J. Neurosci. 18, 1280-1296 (1998). 121. Scholl, T., Stevens, M.B., Mahanta, S. & Strominger, J.L. A zinc finger protein that represses transcription of the human MHC class II gene, DPA. J. Immunol 156, 1448-1457 (1996). 122. Bessis, A., Champtiaux, N., Chatelin, L. & Changeux, J. The neuron-restrictive silencer element: A dual enhancer/silencer crucial for patterned expression of a nicotinic receptor gene in the brain. Proceedings of the National Academy of Sciences of the United States of America 94, 5906-5911 (1997). 123. Chen, Z.F., Paquette, A.J. & Anderson, D.J. NRSF/REST is required in vivo for repression of multiple neuronal target genes during embryogenesis. Nat. Genet 20, 136-142 (1998). 124. Schoenherr, C.J., Paquette, A.J. & Anderson, D.J. Identification of potential target genes for the neuron-restrictive silencer factor. Proceedings of the National Academy of Sciences of the United States of America 93, 9881-9886 (1996). 125. Thiel, G., Lietz, M. & Cramer, M. Biological Activity and Modular Structure of RE- 1-silencing Transcription Factor (REST), a Repressor of Neuronal Genes. J. Biol. Chem. 273, 26891-26899 (1998). 126. Tapia-Ramírez, J., Eggen, B.J., Peral-Rubio, M.J., Toledo-Aral, J.J. & Mandel, G. A single zinc finger motif in the silencing factor REST represses the neural-specific type II sodium channel promoter. Proc. Natl. Acad. Sci. U.S.A 94, 1177-1182 (1997).

138

127. Lunyak, V.V. et al. Corepressor-dependent silencing of chromosomal regions encoding neuronal genes. Science 298, 1747-1752 (2002). 128. Yeo, M. et al. Small CTD phosphatases function in silencing neuronal gene expression. Science 307, 596-600 (2005). 129. Ballas, N. et al. Regulation of neuronal traits by a novel transcriptional complex. Neuron 31, 353-365 (2001). 130. Huang, Y., Myers, S.J. & Dingledine, R. Transcriptional repression by REST: recruitment of Sin3A and histone deacetylase to neuronal genes. Nat. Neurosci 2, 867-872 (1999). 131. Andrés, M.E. et al. CoREST: a functional corepressor required for regulation of neural-specific gene expression. Proc. Natl. Acad. Sci. U.S.A 96, 9873-9878 (1999). 132. Roopra, A., Qazi, R., Schoenike, B., Daley, T.J. & Morrison, J.F. Localized domains of G9a-mediated histone methylation are required for silencing of neuronal genes. Mol. Cell 14, 727-738 (2004). 133. Shi, Y. et al. Coordinated histone modifications mediated by a CtBP co-repressor complex. Nature 422, 735-738 (2003). 134. Bruce, A.W. et al. Genome-wide analysis of repressor element 1 silencing transcription factor/neuron-restrictive silencing factor (REST/NRSF) target genes. Proc. Natl. Acad. Sci. U.S.A 101, 10458-10463 (2004). 135. Kuwabara, T., Hsieh, J., Nakashima, K., Taira, K. & Gage, F.H. A small modulatory dsRNA specifies the fate of adult neural stem cells. Cell 116, 779-793 (2004). 136. Lunyak, V.V. & Rosenfeld, M.G. No Rest for REST: REST/NRSF Regulation of Neurogenesis. Cell 121, 499-501 (2005). 137. Belyaev, N.D. et al. Distinct RE-1 silencing transcription factor-containing complexes interact with different target genes. J. Biol. Chem 279, 556-561 (2004). 138. Conaco, C., Otto, S., Han, J. & Mandel, G. Reciprocal actions of REST and a microRNA promote neuronal identity. Proc. Natl. Acad. Sci. U.S.A 103, 2422-2427 (2006). 139. Canzonetta, C. et al. DYRK1A-dosage imbalance perturbs NRSF/REST levels, deregulating pluripotency and embryonic stem cell fate in Down syndrome. Am. J. Hum. Genet 83, 388-400 (2008). 140. Tahiliani, M. et al. The histone H3K4 demethylase SMCX links REST target genes to X-linked mental retardation. Nature 447, 601-605 (2007). 141. Kuwahara, K. et al. NRSF regulates the fetal cardiac gene program and maintains normal cardiac structure and function. EMBO J 22, 6310-6321 (2003). 142. Zuccato, C. et al. Huntingtin interacts with REST/NRSF to modulate the transcription of NRSE-controlled neuronal genes. Nat. Genet 35, 76-83 (2003). 143. Lepagnol-Bestel, A. et al. DYRK1A interacts with the REST/NRSF-SWI/SNF chromatin remodelling complex to deregulate gene clusters involved in the neuronal phenotypic traits of Down syndrome. Hum. Mol. Genet. 18, 1405-1414 (2009). 144. Brené, S. et al. Regulation of GluR2 promoter activity by neurotrophic factors via a neuron-restrictive silencer element. European Journal of Neuroscience 12, 1525- 1533 (2000). 145. Kallunki, P., Edelman, G.M. & Jones, F.S. Tissue-specific Expression of the L1 Cell Adhesion Molecule Is Modulated by the Neural Restrictive Silencer Element.

139

J. Cell Biol. 138, 1343-1354 (1997). 146. Mieda, M., Haga, T. & Saffen, D.W. Expression of the Rat m4 Muscarinic Acetylcholine Receptor Gene Is Regulated by the Neuron-restrictive Silencer Element/Repressor Element 1. J. Biol. Chem. 272, 5854-5860 (1997). 147. Myers, S.J. et al. Transcriptional Regulation of the GluR2 Gene: Neural-Specific Expression, Multiple Promoters, and Regulatory Elements. J. Neurosci. 18, 6723- 6739 (1998). 148. Schoch, S., Cibelli, G. & Thiel, G. Neuron-specific Gene Expression of Synapsin I. J. Biol. Chem. 271, 3317-3323 (1996). 149. Timmusk, T., Palm, K., Lendahl, U. & Metsis, M. Brain-derived Neurotrophic Factor Expression in Vivo Is under the Control of Neuron-restrictive Silencer Element. J. Biol. Chem. 274, 1078-1084 (1999). 150. Wood, I.C., Roopra, A. & Buckley, N.J. Neural Specific Expression of the m4 Muscarinic Acetylcholine Receptor Gene Is Mediated by a RE1/NRSE-type Silencing Element. J. Biol. Chem. 271, 14221-14225 (1996). 151. Southgate, D.A. Digestion and metabolism of sugars. Am. J. Clin. Nutr 62, 203S- 210S; discussion 211S (1995). 152. Quezada-Calvillo, R. et al. Luminal starch substrate "brake" on maltase- glucoamylase activity is located within the glucoamylase subunit. J. Nutr 138, 685- 692 (2008). 153. Ao, Z. et al. Evidence of native starch degradation with human small intestinal maltase-glucoamylase (recombinant). FEBS Lett 581, 2381-2388 (2007). 154. Nichols, B.L. et al. Human Small Intestinal Maltase-glucoamylase cDNA Cloning. HOMOLOGY TO SUCRASE-ISOMALTASE. J. Biol. Chem. 273, 3076-3081 (1998). 155. Peters, A. et al. The selfish brain: competition for energy resources. Neuroscience & Biobehavioral Reviews 28, 143-180 (2004). 156. Chugani, H.T. A Critical Period of Brain Development: Studies of Cerebral Glucose Utilization with PET. Preventive Medicine 27, 184-188 (1998). 157. Englyst, K.N. & Englyst, H.N. Carbohydrate Bioavailability. British Journal of Nutrition 94, 1-11 (2005). 158. Lebenthal, E., Khin-Maung-U, Zheng, B.Y., Lu, R.B. & Lerner, A. Small intestinal glucoamylase deficiency and starch malabsorption: a newly recognized alpha- glucosidase deficiency in children. J. Pediatr 124, 541-546 (1994). 159. Sander, P. et al. Novel mutations in the human sucrase-isomaltase gene (SI) that cause congenital carbohydrate malabsorption. Hum. Mutat 27, 119 (2006). 160. Nichols, B.L. et al. Congenital maltase-glucoamylase deficiency associated with lactase and sucrase deficiencies. J. Pediatr. Gastroenterol. Nutr 35, 573-579 (2002). 161. Perry, G.H. et al. Diet and the evolution of human amylase gene copy number variation. Nat. Genet 39, 1256-1260 (2007). 162. Li, J.Z. et al. Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science 319, 1100-1104 (2008). 163. Benovoy, D. & Drouin, G. Ectopic gene conversions in the human genome. Genomics 93, 27-32 (2009). 164. Hardison, R. & Miller, W. Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Mol. Biol. Evol 10, 73-102

140

(1993). 165. Sharon, D. et al. Primate evolution of an olfactory receptor cluster: diversification by gene conversion and recent emergence of pseudogenes. Genomics 61, 24-36 (1999). 166. Galtier, N. Gene conversion drives GC content evolution in mammalian histones. Trends Genet 19, 65-68 (2003). 167. Galtier, N., Piganeau, G., Mouchiroud, D. & Duret, L. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159, 907- 911 (2001). 168. Eyre-Walker, A. Evidence of Selection on Silent Site Base Composition in Mammals: Potential Implications for the Evolution of Isochores and Junk DNA. Genetics 152, 675-683 (1999). 169. Elliott, B., Richardson, C., Winderbaum, J., Nickoloff, J.A. & Jasin, M. Gene conversion tracts from double-strand break repair in mammalian cells. Mol. Cell. Biol 18, 93-101 (1998). 170. Nekrutenko, A., Makova, K.D. & Li, W. The K A/K S Ratio Test for Assessing the Protein-Coding Potential of Genomic Regions: An Empirical and Simulation Study. Genome Research 12, 198-202 (2002). 171. Stephens, J.C. et al. Haplotype Variation and Linkage Disequilibrium in 313 Human Genes. Science 293, 489-493 (2001). 172. Freudenberg-Hua, Y. et al. Single Nucleotide Variation Analysis in 65 Candidate Genes for CNS Disorders in a Representative Sample of the European Population. Genome Res. 13, 2271–2276 (2003). 173. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-933 (2001). 174. Birdsell, J.A. Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol. Biol. Evol 19, 1181-1197 (2002). 175. Gumbiner, B.M. Cell adhesion: the molecular basis of tissue architecture and morphogenesis. Cell 84, 345-357 (1996). 176. Behrens, J. et al. Functional interaction of beta-catenin with the transcription factor LEF-1. Nature 382, 638-642 (1996). 177. Mosimann, C., Hausmann, G. & Basler, K. Beta-catenin hits chromatin: regulation of Wnt target gene activation. Nat. Rev. Mol. Cell Biol 10, 276-286 (2009). 178. Wodarz, A. & Nusse, R. Mechanisms of Wnt signaling in development. Annu. Rev. Cell Dev. Biol 14, 59-88 (1998). 179. Yap, A.S. & Kovacs, E.M. Direct cadherin-activated cell signaling: a view from the plasma membrane. J. Cell Biol 160, 11-16 (2003). 180. Bonn, S., Seeburg, P.H. & Schwarz, M.K. Combinatorial expression of alpha- and gamma-protocadherins alters their presenilin-dependent processing. Mol Cell Biol 27, 4121-32 (2007). 181. Schroeter, E.H., Kisslinger, J.A. & Kopan, R. Notch-1 signalling requires ligand- induced proteolytic release of intracellular domain. Nature 393, 382-386 (1998). 182. De Strooper, B. et al. A presenilin-1-dependent [gamma]-secretase-like protease mediates release of Notch intracellular domain. Nature 398, 518-522 (1999). 183. Fortini, M.E. [gamma]-Secretase-mediated proteolysis in cell-surface-receptor

141

signalling. Nat Rev Mol Cell Biol 3, 673-684 (2002). 184. Marambaud, P. et al. A presenilin-1/γ-secretase cleavage releases the E-cadherin intracellular domain and regulates disassembly of adherens junctions. EMBO J. 21, 1948–1956 (2002). 185. Marambaud, P. et al. A CBP binding transcriptional repressor produced by the PS1/epsilon-cleavage of N-cadherin is inhibited by PS1 FAD mutations. Cell 114, 635-645 (2003). 186. Reiss, K. et al. ADAM10 cleavage of N-cadherin and regulation of cell–cell adhesion and β-catenin nuclear signalling. EMBO J 24, 742-752 187. Maretzky, T. et al. ADAM10 mediates E-cadherin shedding and regulates epithelial cell-cell adhesion, migration, and β-catenin translocation. Proceedings of the National Academy of Sciences of the United States of America 102, 9182-9187 (2005). 188. Spasic, D. & Annaert, W. Building gamma-secretase: the bits and pieces. J. Cell. Sci 121, 413-420 (2008). 189. Atouf, F., Czernichow, P. & Scharfmann, R. Expression of Neuronal Traits in Pancreatic Beta Cells. IMPLICATION OF NEURON-RESTRICTIVE SILENCING FACTOR/REPRESSOR ELEMENT SILENCING TRANSCRIPTION FACTOR, A NEURON-RESTRICTIVE SILENCER. J. Biol. Chem. 272, 1929-1934 (1997). 190. Kemp, D.M., Lin, J.C. & Habener, J.F. Regulation of Pax4 paired homeodomain gene by neuron-restrictive silencer factor. J. Biol. Chem 278, 35057-35062 (2003). 191. Griffiths-Jones, S. The microRNA Registry. Nucleic Acids Res 32, D109-111 (2004). 192. GNF Gene Atlas. GNF Gene Atlas at 193. Buchanan, J.A. & Scherer, S.W. Contemplating effects of genomic structural variation. Genet. Med 10, 639-647 (2008). 194. Zhang, F., Gu, W., Hurles, M.E. & Lupski, J.R. Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10, 451-481 (2009). 195. Sharp, A.J. Emerging themes and new challenges in defining the role of structural variation in human disease. Hum. Mutat 30, 135-144 (2009). 196. Ionita-Laza, I., Rogers, A.J., Lange, C., Raby, B.A. & Lee, C. Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics 93, 22-26 (2009). 197. Itsara, A. et al. Population analysis of large copy number variants and hotspots of human genetic disease. Am. J. Hum. Genet 84, 148-161 (2009). 198. Nichols, B.L. et al. The maltase-glucoamylase gene: common ancestry to sucrase- isomaltase with complementary starch digestion activities. Proc. Natl. Acad. Sci. U.S.A 100, 1432-1437 (2003). 199. Huh, G.S. & Hynes, R.O. Elements regulating an alternatively spliced exon of the rat fibronectin gene. Mol. Cell. Biol 13, 5301-5314 (1993). 200. Rodwell, G.E.J. et al. A transcriptional profile of aging in the human kidney. PLoS Biol 2, e427 (2004). 201. Noonan, J.P. The Evolution of Protocadherin Gene Cluster Diversity. (2004).

142