<<

Novel in the class II region of the human major histocompatibility complex

t y

Isabel Mary Hanson

A thesis submitted for the degree of Doctor of Philosophy in the University of London

Human Immunogenetics Laboratory Imperial Cancer Research Fund 44 Lincolns Inn Fields London

and

Department of Genetics and Biometry University College Gower Street London

February 1991

1 ProQuest Number: 10610901

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest

ProQuest 10610901

Published by ProQuest LLC(2017). Copyright of the Dissertation is held by the Author.

All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLC.

ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 to Chris

2 ACKNOWLEDGEMENTS

I am very grateful to the friends and colleagues who have helped to make my three years at ICRF such an enjoyable time. First and foremost I would like to thank my supervisor, John Trowsdale, and my lab. mum, Pat Miller. Next, I gratefully acknowledge my fellow Human Immunogeneticists, past and present: Danny Altman, Ian 'Peachy Keen' Campbell, Vincent Cunliffe, Will Foulkes, Richard Glynne, Vikki Groves, Hitoshi Ikeda, Adrian '' Kelly, Lesley-Anne Kerr, Chris 'More tea, vicar?' Lock, Ruth 'Someone's stolen my bike' Love- ring, Ian Mockridge, Steve Powis, Jiannis 'Disaster' Ragoussis, Philippe Sanseau, David 'Captain' Sansom and David Wilkinson. I am also indebted to Kathy Cheah, Paul Freemont, Pat Gorman, Ketan Patel, Sabine Myers, Sue 'Perfect Day' Rider, Melissa 'Ruby' Rubock, Lisa Stubbs, Susan Tonks and my external supervisor, Jonathan Wolfe. Beyond ICRF I would like to acknowledge my parents, Charles & Anne Hanson, and my parents-in-law-to-be, Ken & Penny Riley. I am particularly grateful to Patricia Cocks for providing me with a roof over my head in salubrious St. John's Wood. Last, but by no means least, I would like to thank Chris Riley for his support which never faltered even though I didn't name my new genes after him.

3 ABSTRACT

The aim of the work described in this thesis was to identify and characterise novel genes in the class II region of the human major histocompatibility complex (MHC).

A physical map of the region, spanning over IMbp, was constructed using pulsed field gel electrophoresis (PFGE) in conjunction with probes for the known class II genes. This map facilitated the direct localisation of the gene for the a2 chain of type XI fibrillar collagen, COL11A2, to a region just centromeric of the class II DP subregion. In addition the PFGE map revealed four clusters of sites for restriction endonucleases which cut preferentially in CpG-rich regions often found at the 5‘ ends of genes. Three of these clusters were cloned by cosmid walking and jumping. Genomic fragments from these regions were hybridised to cDNA libraries which resulted in the identification of five novel genes designated RING1-5. RING1, RING2 and RING5 were 95kb, 90kb and 85kb proximal respectively to DPB2. RING3 was 35kb distal to DNA. R1NG4 was 25kb proximal to DOB.

Nucleotide sequencing revealed that RING1-5 were not related to each other or to the class II genes. The predicted product of RING1 contained a novel cysteine-histidine motif which was conserved in a variety of other and which was reminiscent of domains found in zinc-dependent nucleic acid binding proteins. RING3 potentially encoded a protein with striking homology to the product of fsh, a Drosophila developmental gene. The putative product of RING4 was a member of the ‘ABC1 superfamily of ATP-dependent transporter proteins. RINGS was the human homologue of KE4, a gene in the mouse MHC region which is a candidate for the developmental lethal fwtf

These findings may be of importance in understanding MHC/disease associations and the role of the MHC in the immune response.

4 CONTENTS

Page 1. The major histocompatibility complex 1.1. Classical studies of the MHC 12 1.2. Function of class I and class II gene products 16 1.3. Molecular genetics of the MHC 26 1.4. Evolution of the MHC 36 1.5. Clinical relevance of the class II region 42 1.6. Evidence for other genes in the class n region 45 1.7. Approaches to finding novel genes 48

2. Materials and Methods 2.1. Bacterial cell culture 51 2.2. Screening of recombinant DNA libraries 52 2.3. Subcloning of DNA fragments 55 2.4. Transformation of bacterial cells with DNA by electroporation 56 2.5. Preparation of DNA from transformed bacterial cells 57 2.6. Small scale preparation of bacteriophage X DNA 60 2.7. Preparation of eukaryotic DNA from cells in culture 61 2.8. Preparation of very high molecular weight DNA for pulsed field gel electrophoresis 62 2.9. Preparation of RNA 63 2.10. Restriction endonuclease digestion of DNA 64 2.11. Electrophoresis of DNA and RNA 66 2.12. Preparation of DNA and RNA blots 67 2.13. Preparation of DNA probes 69 2.14. Hybridisation of blots 70 2.15. DNA sequencing 71

3. Identification of potential CpG islands in the class 11 region by PFGE mapping 3.1. Strategy to identify potential sites of genes 75 3.2. Choice of materials for construction of the PFGE map 77 3.3. Construction of the PFGE map 79

5 Contents (continued)

3.4. Summary and discussion 87

4. Mapping of the COL11A2 gene to the class II region 4.1. Introduction 89 4.2. Mapping of COL11A2 using somatic cell hybrids 92 4.3. Mapping of COL11A2 by PFGE 92 4.4. Cosmid walking between the DP subregion and COL11A2 95 4.5. Summary and discussion 97

5. Identification of novel genes associated with clusters of rare-cutter sites 5.1. Introduction 99 5.2. Cloning of cluster 1 by cosmid walking 99 5.3. Cloning of cluster 2 by chromosome jumping 105 5.4. Identification of cluster 3 in previously isolated cosmid clones 112 5.5. Analysis of cloned regions for coding sequences 115 5.6. Expression patterns of R1NG1-5 121 5.7. Refinement of the physical map of the class II region 124 5.8. Summary and discussion 127

6. Characterisation of novel genes by nucleotide sequencing 6.1. Introduction 129 6.2. Nucleotide sequence of RING1 129 6.3. Partial nucleotide sequence of RING2 139 6.4. Nucleotide sequence of RING3 139 6.5. Nucleotide sequence of RING4 146 6.6. Partial nucleotide sequence of RING5 153 6.7. Summary 155

7. Comparative mapping of novel genes in the MHC region of mouse and man 7.1. Introduction 156

6 Contents (continued)

7.2. Determination of the positions of sequences homologous to the human genes RING1-5 and COL11A2 in the mouse MHC 158 7.3. Determination of the positions of sequences homologous to three mouse genes, KE3-5, in the human MHC 161 7.4. Summary and discussion 168

8. RFLP analysis of novel genes 8.1. Introduction 172 8.2. Approach to identifying RFLPs 172 8.3. RFLP analysis of COL11A2 173 8.4. RFLP analysis of RING1 174 8.5. RFLP analysis of RING2 175 8.6. RFLP analysis of RING3 176 8.7. RFLP analysis of RING4 111 8.8. RFLP analysis of B51 178 8.9. Summary and discussion 179

9. Concluding discussion 9.1. Advances in MHC mapping 181 9.2. Potential role of novel genes in class II-assodated phenotypes 183 9.3. Additional applications of probes generated in this study 188 9.4. More new genes in the class II region? 189 9.5. Function of the MHC gene cluster 190

10. References 192

Appendices A. Cosmid clones 221 B. cDNA clones 224 C. Gene symbols and database accession numbers 225

7 FIGURES

1.1. Position of the MHC on 15 1.2. Schematic illustration of presentation 17 1.3. Schematic structures of class I and class II molecules 23 1.4. Atomic structures of class I and class II molecules 24 1.5. Molecular genetic maps of the MHC 28 1.6. Evolutionary tree for the class II genes 38 1.7. Organisation of murine and human MHC regions 40

3.1. Diagram of PFGE apparatus 78 3.2. PFGE mapping of the class II region 80 3.3. PFGE mapping of the class II region 81 3.4. PFGE mapping of the class II region 82 3.5. Physical map of the class II region showing clusters of rare-cutter sites 86

4.1. Generalised structure of a fibrillar collagen molecule 89 4.2. Mapping of COL11A2 by in situ hybridisation 91 4.3. Mapping of COL11A2 using somatic cell hybrids 93 4.4. PFGE mapping of COL11A2 94 4.5. Cosmid clones between COL11A2 and the DP subregion 96

5.1. Cosmid clones in the region centromeric of COL11A2 100 5.2. Detailed map of the new cosmid clones 101 5.3. Determination of the methylation status of the cloned rare-cutter sites in the genome 103 5.4. Determination of the methylation status of the cloned rare-cutter sites in the genome 104 5.5. Construction of a rare-cutter jumping library 106 5.6. Restriction maps of jumping clone X]2 and cosmid HPB.ALL 71 109 5.7. Determination of the methylation status of the cloned rare-cutter sites in the genome 110 5.8. Overview of region cloned by jumping 111

8 Figures (continued)

5.9. Map of overlapping cosmids in the DQ subregion 113 5.10. Determination of the methylation status of the cloned rare-cutter sites in the genome 114 5.11. Zoo blot analysis of genomic fragment 33X1 116 5.12. Northern blot analysis of genomic fragment 33X1 118 5.13. Novel genes in clusters 1,2 and 3 120 5.14. Northern blot analyses of RING1-5 122 5.15. Mapping of the DNA gene 125 5.16. Physical map of the class II region showing the positions of RING1-5 and COL11A2 126

6.1. Nucleotide and amino acid sequence of RING1 132 6.2. Alignment of proteins sharing the R1NG1 cysteine-histidine motif 133 6.3. Hypothetical structure of the RING1 cysteine-histidine motif 137 6.4. Nucleotide and amino acid sequence of RING2 138 6.5. Restriction map of RING3 cDNA clones 141 6.6. Nucleotide sequence of RING3 142 6.7. Alignment of the RING3 amino add sequence with fsh 144 6.8. Nucleotide sequence of RING4 147 6.9. Hydropathidty plot of the RING4 protein product 148 6.10. Alignment of the RING4 amino acid sequence with members of the ABC superfamily 150 6.11. Schematic diagram of ABC structure 151 6.12. Alignment of the RING5 nudeotide and amino acid sequencewith KE4 154

7.1. Molecular map of the proximal region of the mouse MHC 157 7.2. Determination of the positions of sequences homologous to the human genes RING1-5 and COL11A2 in the mouse MHC 159

9 Figures (continued)

7.3. Determination of the positions of sequences homologous to the mouse genes KE4 and KE5 in the human MHC 163 7.4. Restriction map of cosmid HPB.ALL 51 165 7.5. PFGE mapping of B51 166 7.6. Physical map of the proximal end of the human MHC region 167 7.7. Comparative map of the proximal regions of the mouse and human MHCs 169

A.I. Restriction map of the cosmid vector cos202 222 B.l. Restriction map of the cDNA vector CDM8 223

10 TABLES

1.1. Serologically defined HLA specificities 19 1.2. Associations between the MHC and diseases 43

3.1. Distribution of rare-cutter sites in human DNA 76 3.2. PFGE fragments detected by class II probes 83

4.1. PFGE fragments detected by COL11A2 and DPB1 95

5.1. PFGE fragments detected by COL11A2 and 33X1 102 5.2. Zoo blot results obtained with cosmid fragments 115 5.3. cDNA clones isolated with cosmid fragments 121 5.4. Expression patterns of RING1-5 123

6.1. Homologues of RING1 134

7.1. Genomic probes for the mouse genes KE2-5 162 7.2. PFGE fragments detected by B51 and 33X1 168

8.1. RFLP analysis of COL11A2 173 8.2. RFLP analysis of RING1 (CEM15) 174 8.3. RFLP analysis of RING1 (33X1) 175 8.4. RFLP analysis of RING2 175 8.5. RFLP analysis of RING3 176 8.6. RFLP analysis of RING4 178 8.7. RFLP analysis of B51 179

A.I. Orientation of cosmid clones 221 B .l. Orientation of cDNA clones 225

11 1. The major histocompatibility complex

The major histocompatibility complex (MHC) is one of the best characterised regions of the . Situated on chromosome 6, region p21.3, it spans about 4Mbp of DNA and contains over 70 genes. Three features of the MHC in particular have stimulated the enormous amount of work on this gene cluster. First, it encodes molecules which play a central role in the regulation of the immune response. Second, it is associated with numerous diseases. Third, it represents 1/750th of the human genetic material, and as such is a paradigm for the detailed molecular genetic organisation of the genome.

1.1. Classical studies of the MHC

Many of the MHC lod were discovered by chance and were studied for several years in absence of any knowledge of their true function. This section summarises experiments in both human and murine systems which led to the discovery of the MHC.

1.1.1. Class I molecules Some of the first experiments which identified a phenotype controlled by the MHC were performed seventy five years ago when it was shown that the ability to successfully transplant tumours between mice was genetically determined, being dependent on the strains selected as donor and host (Little and Tyzzer, 1916; Klein, 1986). Much later it was shown that that the outcome of tissue transplants segregated with a serologically defined blood antigen, H(histocompatibility)-2 (Gorer, 1937; Gorer et al., 1948). More refined genetic and serological analyses revealed that the situation was more complicated, with two segregant allelic series of , H-2K and H-2D, playing a major role in determining the success of transplants (Amos et al., 1955). In humans, the presence of histocompatibility antigens was demonstrated when it was discovered that the sera of multiply transfused patients and

12 multiparous women contained antibodies which were capable of agglutinating the leukocytes of some donors but not others (Dausset, 1958; Payne and Rolfs, 1958; van Rood et al., 1958). The complex patterns of reactivity of these sera with large panels of donors were statistically analysed to define the first genetically controlled human leukocyte antigens (HLA; van Rood and van Leeuwen, 1963; Payne et al., 1964). Initially, two distinct loci, HLA-A and HLA-B, were found to control the expression of HLA molecules. It was soon demonstrated by family studies and population studies that these loci were closely linked (Ceppelini et al., 1967). Later, a third closely linked locus, HLA-C, was also described (Svejgaard et al., 1973). Studies of the effects of matching at the HLA-A, -B and -C lod on the survival of human tissue transplants suggested that the HLA antigens were analogous to the histocompatibility loci ( H-2D and H-2K) which had been previously defined in mice. These molecules were classified together as the classical transplantation antigens, or class I antigens. The finding that humans and mice (and other species) had a gene cluster which played an important role in the survival of tissue transplants gave rise to the concept of the major histocompatibiltiy complex.

1.1.2. Class II molecules Evidence for a novel MHC-encoded function in man was obtained when it was found that strong proliferative responses were obtained when lymphocytes from different individuals were cultured together in vitro (Bach and Hirschorn, 1964). The locus responsible for this mixed lymphocyte reaction was closely linked to those encoding the classical transplantation antigens, but was distinct from them because the MLR proceeded even when the lymphocyte donors were matched for HLA-A, -B and -C (Yunis and Amos, 1971). The HLA-D region was thus defined. Later it became apparent from more detailed serological analysis that the HLA-D region contained two lod, HLA-DR and HLA- DQ (Tosi et al., 1978). Finally it was found that MLR occurred even between lymphocytes matched for DR and DQ and that there was a third D-region locus, HLA-DP (Shaw et al., 1980).

In mice, evidence for a new MHC linked function came from studies of the ability of different inbred strains to mount an immune response to synthetic antigens The 'immune response' (Ir) genes controlling this

13 phenotype were shown to map to a region of the mouse MHC distinct from those encoding the H-2K and D molecules (McDevitt et al., 1972). Two Ir molecules, I-E and I-A, were defined. Later it was shown that the mouse Ir and the human D-region molecules were related, and they were termed class II antigens.

1.1.3. Class III molecules The first non-class I-dass II gene was mapped to the MHC when it was discovered that mouse serum contained a serologically defined factor which was encoded in the region between H -2K and H-2D loci (Shreffler and Owen, 1963; Shreffler and David, 1972). Later it was demonstrated that this was the complement component C4 (Meo et al., 1975). Similarly in man it was found that two serologically defined serum proteins were encoded by the MHC, and later it was discovered that these proteins were the products of the two C4 loci (O'Neill et al., 1978). Complement components C2 and factor B (Bf) were also found to be encoded in the interval between HLA-D and HLA-B (Fu et al., 1974; Lamm et al., 1976). The genes encoding C4, C2 and Bf were inseparable from one another by recombination (Weitkamp and Lamm, 1982). The interval between the class II region and the class I region which encoded the complement components was designated the class III region.

1.1.4. Mapping of the classical MHC loci The polymorphism of the MHC molecules facilitated the mapping of classical human MHC loci by recombination analysis, revealing the gene order DP-[DQ, DR]-[C4, C2, Bf]-B-C-A (Figure 1.1.; Weitkamp and Lamm, 1982).

The chromosomal location of the MHC was deduced from a variety of approaches. For example, HLA typing of the family of an individual with a cytologically detectable translocation breakpoint in region 21 on the short arm of chromosome 6 revealed that the breakpoint had occurred in the middle of the MHC region, thus positioning the MHC at 6p21 (Berger et al., 1979). More recently, HLA typing of y-radiation- induced MHC-loss mutant cell lines coupled with high-resolution cytogenetic karyotyping facilitated the mapping of the MHC to the distal portion of band 6p21.3 (Spring et al., 1985). From recombination

14 CHROMOSOME 6

HLA-A

HLA-C -HLA-B HLA-8 SCA1 0.3 GlO Bf; C4F rAH C2.C45

Q7

HLA-O/DR

PGM 3

-M E 1

SOD 2

Figure 1.1. Position of the classical MHC lod relative to one another and to other polymorphic markers on human chromosome 6 (from Weitkamp and Lamm, 1982). mapping of the MHC loci relative to other polymorphic markers on chromosome 6 the orientation of the complex on the short arm and its position relative to these markers was determined. These studies revealed that the DP locus was closest to the centromere (Figure 1.1.; Weitkamp and Lamm, 1982).

1.2. Function of class I and class II gene products

The class I molecules were initially defined as transplantation antigens (section 1.1.1.). The class II molecules were initially defined in man as being responsible for the mixed lymphocyte reaction and in mouse as the immune response genes (section 1.1.2.). However, it is now apparent that the class I and class II molecules are closely related, both structurally and functionally. By binding peptide fragments and presenting them on the cell surface, class I and class II molecules provide the context for recognition of peptide antigen by the T-cell receptor, and as such play a central role in the control of the immune response. In general, class I molecules present peptides derived from endogenous proteins to CD8+ cytotoxic T-lymphocytes while class II molecules present peptides derived from exogenous proteins to CD4+ helper T-lymphocytes (Figure 1.2.). The interaction between the MHC/antigen complex and the T-cell receptor results in activation of the T-cell. Stimulated helper T-lymphocytes secrete lymphokines which promote antibody production by B-cells and assist in the activation of cytotoxic T-cells. Stimulated cytotoxic T-cells lyse the cell presenting the foreign antigen (Klein, 1986).

Class I molecules are expressed by virtually all nucleated cells. Class n molecules are expressed constitutively by B-lymphocytes, macrophages and activated T-lymphocytes and de novo expression can be induced by y-interferon in a wide variety of cell types (Cresswell, 1987).

1.2.1. Polymorphism of class I and class II gene products It was the polymorphism of the classical class I, class II and class in molecules which initially led to their discovery. In fact, the MHC is the most polymorphic gene system known.

16 CD8+ cytotoxic T-cell CD4+ helper T-cell

T cell APC APC

ntigen

Peptide TCR

MHC I MHC II

CDS CD4

Figure 1.2. Highly schematic illustration of antigen presentation by class I and class II molecules to cytotoxic and helper T-cells respectively. APC, antigen presenting cell; TCR, T-cell receptor; 0, peptide. CD8 and CD4 are 'accessory' molecules on the surface of cytotoxic and helper T-cells respectively which are also involved in the interaction between the T-cell receptor and the MHC molecule. The development of serological reagents to characterise the human and murine class I and class II antigens revealed that for each locus a remarkably large number of alleles were present in the population at a significant frequency (>1%). A list of the serologically defined human class I and class II specificities is shown in Table 1.1. (Bodmer et al., 1989b). More recently, the PCR-based cloning and sequencing of numerous class II genes has revealed even greater polymorphism at the DNA level (Bodmer et al., 1990). All the serologically detectable variation in class II DR molecules is attributable to polymorphism in the p-chain, because the a-chain is monomorphic. In contrast, both DQ chains are polymorphic. The DP a-chain is polymorphic, although markedly less so than the DP P-chain (Trowsdale et al., 1985).

Although some species, such as the Syrian hamster, seem to display limited MHC polymorphism, the highly polymorphic human and mouse systems are probably more typical (Klein, 1986). Because the polymorphism is especially marked in those loci which encode the functional antigen presenting molecules, and tends to be particularly concentrated within the exons encoding the domains which form the antigen binding groove, it is most likely that it has been established by natural selection rather than through random drift (Klein, 1986; Parham et al., 1988; Marsh and Bodmer, 1989; Klein and Takahara, 1990).

The major selection pressure on MHC class I and class II poly­ morphism has probably been exerted through pathogens (Klein, 1986). As described below, class I and class II molecules are important in determining which antigens an individual can respond to in two ways: through selective binding of specific peptide fragments, and through influencing the development of the T-cell repertoire. An individual whose MHC molecules fail to stimulate an immune response against a critical antigenic determinant on a pathogen, either because the antigen does not bind to the presenting molecules or because the individual's T-cells cannot recognise the MHC/antigen complex, has a selective disadvantage if infected by that pathogen. However, MHC polymorphism is of great advantage to the population as a whole because it ensures that other individuals will have combinations of class I and class II alleles which are effective in responding to the same

18 A B C D DR DQ DP Al B5 Cwl Dwl DR1 DQwl DPwl A2 B7 Cw2 Dw2 DR2 DQw2 DPw2 A3 B8 Cw3 Dw3 DR3 DQw3 DPw3 A9 B12 0 4 Dw4 DR4 DQw4 DPw4 A10 B13 Cw5 Dw5 DR5 DQw5 (wl) DPw5 A ll B14 Cw6 Dw6 DRw6 DQw6(wl) DPw6 Awl9 B15 0 7 Dw7 DR7 DQw7 (w3) A23 (9) B16 O S Dw8 DRw8 DQw8 (w3) A24 (9) B17 0 9 (w3) Dw9 DR9 DQw9(w3) A25 (10) B18 0 1 0 (w3) DwlO DRwlO A26(10) B21 OH Dwl 1 (w7) DRwll (5) A28 Bw22 Dwl2 DRwl2 A29 (wl9) B27 Dwl 3 DRwl3(w6) A30 (wl9) B35 Dwl4 DRwl4 (w6) A31 (wl9) B37 Dwl 5 DRwl5 (2) A32 (wl9) B38 (16) Dwl6 DRwl6 (2) Aw33 (wl9) B39 (16) Dwl7 (w7) DRwl7 (3) Aw34 (10) B40 Dwl 8 (w6) DRwl8 (3) Aw36 Bw41 Dwl9 (w6) Aw43 Bw42 Dw20 DRw52 Aw66 (10) B44 (12) Dw21 Aw68 (28) B45 (12) Dw22 DRw53 Aw69 (28) Bw46 Dw23 Aw74 (wl9) Bw47 Dw24 Bw48 Dw25 B49 (21) Dw26 Bw50 (21) B51 (5) Bw52 (5) Bw53 Bw54 (w22) Bw55 (w22) Bw56 (w22) Bw57 (17) Bw58 (17) Bw59 Bw60 (40) Bw61 (40) Bw62 (15) Bw63 (15) Bw64 (14) Bw65 (14) Bw67 Bw70 Bw71 (w70) Bw72 (w70) Bw73 Bw75 (15) Bw76 (15) Bw77 (15) Bw4 Bw6

Table 1.1. Complete listing of serologically defined HLA specificities from the Tenth International Histocompatibility Workshop (Bodmer et al., 1989b). pathogen (Benacerraf, 1981; Klein, 1986). Natural selection is probably responsible not only for maintaining the multiple alleles seen at each functional class I and class II locus but also for maintaining the the multiple class I and class II loci. The potential of each haplotype to encode more than one class I or class II molecule increases the range of antigens to which an individual can respond (Doherty et al., 1976).

1.2.2. Functional significance of class I and class II polymorphism A series of key experiments contributing to contemporary under­ standing of the function of class I and class II molecules was performed by Zinkemagel and Doherty (1975; Doherty et al., 1976). Different strains of mice were infected intracerebrally with lymphocytic choriomen­ ingitis virus (LCMV). One week later the mice were killed and their T- lymphocytes tested for the ability to lyse LCMV-infected fibroblasts from a mouse with the H-2k MHC haplotype. The target fibroblasts were killed only by T-cells derived from H-2k strains. It was later shown that all the mouse strains produced cytotoxic T-lymphocytes in response to the viral infection but in each case these lymphocytes were only able to lyse target cells of the same (self) MHC haplotype. Specifically, the phenotype was determined by the class I H-2K gene. On the basis of this work it was proposed that the cytotoxic T-lymphocytes had dual specificity, simultaneously recognising the foreign viral antigen and the self class I molecule. Later it was demonstrated that guinea-pig T-helper lymphocytes were only activated by antigens presented by cells of the self class II haplotype (Sprent, 1978). These studies gave rise to the concept of MHC-restricted antigen presentation, in which cytotoxic T-cells and helper T-cells recognise foreign antigen only in association with self class I or class II molecules respectively.

At first it was not known whether the T-cells recognised the antigen and MHC molecule separately with different receptors or together with the same receptor, or what form the antigen was in. Over the subsequent years it became apparent that T-cells had a single receptor which recognised a single complex of MHC molecule with antigen in the form of a peptide of about 5-10 residues in length (Figure 1.2.; Owen and Crum pton, 1987; Davis and Bjorkman, 1988). Proteins from which MHC-presented antigens are derived are now known to processed by partial proteolysis before presentation (Townsend and Bodmer, 1989).

20 Classical in vivo studies of the control of the immune response had already indicated that class II molecules could present short synthetic peptides (Benacerraf, 1981) and it was later shown that peptides could bind directly to purified class II molecules in vitro (Babbitt et al., 1985; Buus et al., 1986). Similarly it was shown that the antigen presented by class I molecules could be mimicked by short synthetic peptides (Townsend et al., 1986) and that class I molecules purified from cells were bound to peptides of 8-9 residues in length (Elliott et al., 1990).

Classical studies on the control of the immune response in mice had indicated that the MHC was responsible for determining whether or not an immune response could be mounted against a given peptide (Benacerraf, 1981). Later it was shown that the ability to mount an immune response correlated directly with the affinity of purified MHC molecules for that peptide (Babbitt et al., 1985; Buus et al., 1986). Thus, MHC molecules dictate, by their ability to selectively bind amino acid sequences from a protein antigen, whether a T-cell response can be generated against that protein antigen (Benacerraf, 1981).

A second mechanism by which class I and class II molecules play an important role in determining the specificity of an individual's immune response is through the selection of the T-cell repertoire during development (Schwartz, 1989; Marrack and Kappler, 1988). This process takes place in the thymus and involves both the positive and negative selection of maturing T-lymphocytes. The positive selection process is poorly understood but is thought to involve the recognition by the T-cell receptor of thymically-expressed self class I or class n molecules, and results in only those T-cells bearing receptors which recognise self MHC molecules reaching the pool of mature T-cells. Positive selection is the means by which all mature T-cells recognise self MHC molecules as restriction determinants for foreign antigens. The process of negative selection eliminates those T-cells bearing receptors which have high affinity for self MHC + self peptide, and which would thus be autoreactive in the periphery. There is good evidence for the role of class II molecules in the negative selection procedure. Using monoclonal antibodies which recognise T-cell receptors containing the Vpl7a epitope it was shown that T-cells carrying such receptors were selectively deleted in the thymuses of

21 mice expressing the I-E antigen (Schwartz, 1989). Negative selection is a major mechanism through which tolerance to self proteins is established, and is thought to involve the recognition by maturing T- cells of thymically expressed MHC molecules presenting peptides from self proteins. However, the clonal deletion process does not remove all autoreactive T-cells, especially in the case of tissue-specific antigens not expressed in the thymus. Potentially autoreactive T-cell clones in the periphery may be controlled by additional mechanisms, such as suppressor T-cells, about which little is known (Schwartz, 1989).

1.2.3. Structure of class I and class II MHC molecules Class I and class II molecules were initially characterised using standard biochemical techniques (Strominger, 1987). The functional molecules are cell surface glycoproteins. Both are heterodimers of a- and p-chains, as illustrated schematically in Figure 1.3. The mature class I or class II molecule has four extracellular domains, each of approximately ninety amino acids. For class I molecules, three of these domains are contained within the ~43kD a-chain, which is encoded within the MHC. The fourth domain is provided by P2-microglobulin (~12kD), which is encoded on human chromosome 15. In the case of class II molecules, two domains are contained within the ~34kD a-chain and two within the ~29kD p-chain, and both chains are encoded within the MHC.

More recently, the crystal structure of two class I molecules, HLA-A2 and HLA-Aw68, have been determined (Figure 1.4.; Bjorkman et al., 1987a; Garrett et al., 1989). The membrane-proximal domains a3 and p2-microglobulin interact to form the base of the molecule upon which the polymorphic al and a2 domains are supported. The al and a2 domains interact to form a cleft, the floor of which is composed of eight anti-parallel P-strands and the walls of which are formed by two a- helices. The crystal structure of a class II molecule has not yet been obtained, but a hypothetical structure has been modeled based on the amino adds conserved between dass n and class I (Brown et al., 1988). In this model, the al and pi domains interact to form a cleft similar to that observed in the class I structure (Figure 1.4.).

22 MHC class I MHC class II

TM TM

Cytoplasm Cytoplasm

Expressed by virtually Restricted expression all nucleated cells e.g. macrophages, B-cells and activated T-cells

Figure 1.3. Schematic illustration of the structure of class I and class II molecules. TM, transmembrane region. Glycosylation sites are indicated by open circles. Intra-domain disulphide bridges are indicated by S-S. 30 I * 4

Figure 1.4. (a) Three-dimensional crystal structure of the class I molecule HLA-A2 (i) from the side, showing the antigen binding groove created by theotl andc*2 domains, and (ii) looking down on the groove from above (from Bjorkman et al., 1987a). (b) Predicted structure of the antigen binding site of a class II molecule (from Brown et al., 1988). The groove identified in the membrane-distal surface of the class I and class II molecules is thought to be the site of peptide binding. The dimensions of the cleft are appropriate for the accommodation of a peptide of between 8 and 25 amino adds in length, depending on the conformation of the peptide. In the two X-ray diffraction studies it was observed that unidentified material had co-crystallised in the cleft of the class I molecules, and theoretically this could be peptide (Bjorkman et al., 1987a; Garrett et al., 1989). In addition, many of the polymorphic residues and many of the residues predicted from functional studies to be important for peptide binding were found to line the groove (Bjorkman et al., 1987b; Parham et al., 1988; Marsh and Bodmer, 1989). The shape of the groove created by the polymorphic residues in HLA- A2 was quite distinct from that in HLA-Aw68, providing a structural basis for the observed allelic specificity in peptide binding (Garrett et al., 1989).

1.2.4. Peptide binding to class I and class II molecules Class I molecules, which bind and present antigens derived from intracellular proteins, are thought to become complexed with peptide in the endoplasmic reticulum (ER) or a closely associated compartment (Townsend et al., 1989; Yewdell and Bennink, 1990). On the basis of current evidence, it seems likely that peptides are generated by proteolytic activity in the cytoplasm and then transported into the lumen of the ER where they bind to newly synthesised class I heavy chains (Townsend and Bodmer, 1989; Townsend et al., 1989). Interestingly, a gene encoding a function thought to be important for normal association of class I heavy chains with peptide is probably encoded within the MHC (section 1.6.2.). Formation of the heavy chain/peptide complex induces association with p2-microglobulin and subsequent export from the ER (Townsend et al., 1989). The class I/peptide complex moves rapidly from the ER to the Golgi apparatus, where modification of carbohydrate occurs, and then to the cell surface, where presentation of the bound antigen to T-lymphocytes occurs (Neefjes et al., 1990).

Class II molecules, which bind and present peptides derived from extracellular proteins, are thought to become complexed with peptide in an endosomal compartment (Yewdell and Bennink, 1990). Newly

25 synthesised class II molecules become associated in the ER with a protein known as invariant chain, which plays a key role in differentiating the exogenous antigen presentation pathway taken by class II molecules from the endogenous antigen presentation pathway taken by class I molecules. In vitro studies with purified invariant chain and class II molecules have shown that binding of invariant chain to class II molecules inhibits the binding of peptides (Teyton et al., 1990). This work led to the suggestion that association of invariant chain with class II could prevent binding of cytoplasmically-derived peptides to class II in the ER. From pulse-chase and immuno-electron microscopy experiments evidence has been obtained that the class n/invariant chain complex proceeds rapidly from the ER to the Golgi apparatus but instead of continuing directly to the plasma membrane the complex is diverted to a novel post-Golgi cytoplasmic compartment where it is retained for several hours before the appearance of mature class II molecules on the cell surface (Neefjes et al., 1990). Targetting of the invariant chain/class II complex to the post-Golgi vesicles is dependent on an amino add sequence, identified by deletion mapping, in the N-terminus of the invariant chain molecule (Bakke and Dobberstein, 1990). It has been proposed that the compartment to which the complex is targetted may be an endosomal vesicle in which invariant chain becomes dissociated, facilitating binding of class II to peptides which have been generated by proteolysis from endosomally im ported exogenous proteins (Neefjes et al., 1990; Teyton et al., 1990). This is consistent with the observation that antigen presentation by dass II molecules, but not by dass I molecules, is sensitive to inhibitors which disrupt the endosomal trafficking and processing pathway by which extracellular proteins are taken into the cell and degraded (Yewdell and Bennink, 1990).

1.3. Molecular genetics of the MHC

1.3.1. Application of molecular cloning and m apping techniques to the MHC region With the advent of molecular cloning technology, it soon became apparent that the complexity of the MHC was much greater than had been previously indicated by the classical studies. The power of this

26 technology is illustrated by comparing the map in Figure 1.1., which summarises the classical studies, to that in Figure 1.5.b., which summarises ten years of molecular studies on the MHC region. The MHC provides an excellent example of the way in which a variety of cloning and mapping methods can be used to characterise in detail the molecular genetic organisation of a region of the mammalian genome.

The first important advance made possible by cloning technology was the isolation of probes for the classically defined MHC genes. These facilitated the sequencing, and analysis of gene organisation, of individual loci. In addition, through hybridisation of these probes to total human genomic DNA at reduced stringency, it was revealed that there were many other related sequences in the human genome. For example, class I gene probes detected numerous genomic fragments, which are now known to correspond to seventeen class I genes, whereas only three class I loci had been detected by serological methods. Similarly class II a- and (3-chain probes revealed class II genes in addition to those encoding the serologically defined DP, DQ and DR antigens. These class I- and class II-related genes were isolated by exhaustive screening of genomic and cDNA libraries with probes for the classical MHC genes. All these cross-hybridising human MHC genes have now been isolated, but additional class I- and class n-related sequences may exist which have diverged too greatly to be detected by this approach.

Initially, somatic cell hybrids containing fragments of chromosome 6 and y-irradiation-induced mutants with deletions in chromosome 6 were used to show that these related genes also mapped to the MHC region. In some cases the relative positions of genes which are in very close physical proximity, such as the four genes in the DP subregion, could be determined by cosmid cloning. However, the most important advance in the understanding of the organisation of the region was facilitated by the development of the powerful long range physical mapping technique of pulsed field gel electrophoresis (PFGE; Schwartz and Cantor, 1984; Carle and Olson, 1984). In conjunction with the numerous MHC probes obtained from gene cloning studies, PFGE i | allowed the accurate mapping of the genes and subregions which had | not been previously linked on overlapping genomic clones. i i I 27 e I *

.a

0) CO ( f l (f> — TJ < < —I _J o o

55

O 60 o & CM a - a » _

0) 0< ) _ l o

'I? ■t; 4-* u

a S. § C/5 « & S ^ . n (0 Q * “ & < .y s

8

o — 1

-Q o ns £ When the experiments described in this thesis were started most of the known MHC genes had been isolated on single genomic clones or small contigs extending tens of kilobases which were unlinked to one another (Figure 1.5.a). The PFGE maps revealed that there were large gaps between these cloned regions, which theoretically could contain many additional genes. Recently it has been the aim of several laboratories to clone and analyse the DNA in these intervals, using a 'reverse genetics' approach to detect genes in genomic clones without any knowledge of the functions that those genes might encode. The application of reverse genetics techniques has resulted in a remarkable increase in our understanding of the molecular organisation of the MHC over the last three years (compare Figures 1.5.a. and 1.5.b.).

Details of the molecular genetic organisation of the different MHC regions, as deduced from the application of the approaches described above, are described in the following sections.

1.3.2. The class I region The class I genes are found in a region spanning approximately 2Mbp at the telomeric end of the MHC. The map order of the classical class I loci, HLA-A, -B and -C, as determined by recombination analysis has been confirmed by PFGE mapping. Thus, HLA-C is 130kb distal to HLA- B and just over lOOOkb proximal to HLA-A (Carroll et al., 1987; Ponta- rotti et al., 1988; Ragoussis et al., 1989).

Hybridisation of class I gene probes to Southern blots of total human genomic DNA revealed the presence of numerous other class I-related sequences (Orr and DeMars, 1983). Fourteen non -A-B-C class I gene sequences, accounting for all the cross-hybridising bands seen on Southern blots, were isolated from genomic libraries. From sequencing studies it was shown that three of these were intact genes while the others were pseudogenes (Koller et al., 1989). The three intact genes are expressed and give protein products and have been designated HLA-E (Koller et al., 1988), HLA-F (Geraghty et al., 1990) and HLA-G (Kovats et al., 1990). However, the functions of these non-classical class I genes is as yet unknown. Using a panel of mutant cell lines with deletions in different regions of the MHC, it was possible to demonstrate that the n o n - A-B-C sequences mapped to the class I region. Most of the

29 sequences mapped close to, or between, the HLA-A, -B and -C genes, but from recombination analysis it was shown that four (including HLA-F and -G) mapped 8cM telomeric of HLA-A, while a fifth mapped an additional 2cM telomeric (Koller et al., 1989).

Another intact non-A-B-C class I gene, cdall, has been described by Ragoussis et al. (1989), which maps within 50kb of HLA-A. The relationship of this gene to those described by Koller et al. (1989) is currently unclear. In a PFGE mapping study using a class I probe at reduced stringency it was shown that all the class I-related genes were encompassed in a region spanning a total of 2Mbp (Ragoussis et al., 1989; Figure 1.5.b.). In the human genome, the average frequency of recombination between genes is such that a genetic distance of lcM is roughly equivalent to a physical distance of IMbp. This relationship holds well in the proximal portion of the class I region, where the genetic distance between HLA-B and HLA-A is roughly lcM, and the physical distance is just over IMbp. However, the more distal class I genes map within IMbp of HLA-A and yet have a genetic separation of up to lOcM (Koller et al., 1989; Ragoussis et al., 1989). This provides evidence for an unusually high frequency of recombination between HLA-A and the more distal class I genes.

1.3.3. The class II region The class II genes are found in a region at the centromeric end of the MHC which from PFGE mapping spans approximately IMbp (Hardy et al., 1986; Ragoussis et al., 1989; Dunham et al., 1989). The genes which encode the a and p chains of the serologically defined class II antigens DP, DQ and DR are localised in discrete genetic subregions. In addition this region contains class II pseudogenes and genes which are apparently intact but for which no protein product has yet been identified.

The first class II cDNA clones isolated were for the DR A gene. Polysomes from a B-lymphoblastoid cell line were selected with a monoclonal antibody against the DRa chain, and the mRNA from the antibody-bound polysomes was used as a template to make cDNA (Korman et al., 1982a). The cloning of DRA facilitated the cloning of other class II a-chain genes by hybridisation of the DRA probe to

30 genomic or cDNA libraries at low stringency (Spielman et al., 1984; Auffray et al., 1984). Similar approaches were taken to isolate DRB and DQB1 gene clones (Long et al., 1983). Again, these were used to screen libraries at reduced stringency to obtain clones for the other class II 13- chain genes. To date the class II region contains 6 a-chain genes and between 7 and 11 P-chain genes, the exact number depending on the haplotype.

The DP subregion The DP subregion spans 70kb at the proximal end of the class II region and contains two a-chain genes ( DPA1 and DPA2) and two p-chain genes ( DPB1 and DPB2) which are organised as shown in Figures 1.5 and 4.5. (Trowsdale et al., 1984; Servenius et al., 1984; O kada et al., 1985a). The expressed DP molecule has been demonstrated by transfection studies to be the product of the DPA1 and DPB1 genes (Okada et al., 1985a). The DPA2 and DPB2 genes, in contrast, are non­ transcribed pseudogenes, containing frame-shift and defective splice junctions which would prevent the expression of a functional protein product (Gustafsson et al., 1987).

The DQ subregion The DQ subregion contains five genes, as shown in Figure 1.5.b. The DQA1/DQB1 gene pair and the DQA2/DQB2 (formerly DXa/DXp) gene pair were initially isolated on two unlinked sets of overlapping genomic clones (Okada et al., 1985b; Auffray et al., 1984; Jonsson et al., 1987). The two gene pairs are highly related at the nucleotide sequence level. 15kb distal to the DQA2 gene is a truncated pseudogene, DQB3 (formerly DVP), which apparently does not contain the exons encoding the signal sequence or p2 domain (Ando et al., 1989). Sequencing of the remaining exons revealed that DQB3 is more closely related to DQB1 and DQB2 than to other class II p-chain genes. The entire DQ subregion has now been cloned on overlapping cosmids, revealing that the DQA1/DQB1 and DQA2/DQB2 gene pairs are 75kb apart (Blanck and Strominger, 1988). The DQA1 and DQB1 genes are fully functional and encode the serologically defined DQ antigen. Although the DQA2 and DQB2 genes are intact and do not appear to contain any deleterious mutations which would prevent their expression at the RNA or

31 protein level, transcription of these loci has never been detected (Auffray et al., 1987).

The DR subregion The DR subregion is the only class II subregion known to vary in gene number between individuals. All haplotypes carry a single a-chain gene, DRA , but the number of p-chain genes is variable. DR4 haplotypes, for example, carry four DRB genes, while DR1 and DR8 haplotypes are thought to have only one (Bohme et al., 1985; Andersson et al., 1987). The DR subregions of the DR2, DR3 and DR4 haplotypes have been analysed in detail by cosmid cloning although in none of these studies were all the DR genes linked on a single contig (Rollini et al., 1985; Spies et al., 1985; Andersson et al., 1987; Kawai et al., 1989). The three DRB genes, DRB1, DRB2 and DRB3, from the DR2 haplotype have been linked on overlapping cosmid clones (Kawai et al., 1989). The three genes were contained within a region spanning 80kb. They were not linked to the DRA gene. The DRB2 gene lacked an exon containing the 5' untranslated region, and did not give a cell- surface product when co-transfected with the DRA gene into m ouse L- cells, and is therefore probably a pseudogene. The DRB1 and DRB3 genes, in contrast, were expressed with DRA in transfection studies to give functional class II molecules (Kawai et al., 1989). The DR subregions in the DR3 and DR4 haplotypes also contained two functional DRB genes, along with one or two pseudogenes (Rollini et al., 1985; Spies et al., 1985; Andersson et al., 1987).

The DN subregion The DN subregion is currently defined by a single gene, DNA (formerly DZa). The DNA gene was isolated by screening a human cosmid library with a DRA gene probe at reduced stringency (Spielman et al., 1984). This gene has been shown by physical mapping to be between the DP subregion and the DOB gene (Inoko et al., 1989), but has not been accurately positioned within this interval or oriented on the chromo­ some. The nucleotide sequence of a DNA genomic clone revealed that the DNA gene has a similar organisation to that of other class II a- chain genes, and that it does not contain any mutations which would prevent transcription (Trowsdale and Kelly, 1985). Indeed, the DNA gene is transcribed in B-lymphocytes, giving a major transcript of 3.5kb

32 and a minor transcript of l.lkb (Kelly, 1988). The major transcript is thought to be the result of inefficient 3f RNA processing directed by the unusual polyadenylation signal (ACTAAA) found in the DNA gene (Trowsdale and Kelly, 1985). The minor transcript however is processed and polyadenylated normally and does not appear to contain any deleterious mutations that would prevent translation; however, no protein product has been described (Young and Trowsdale, 1990). It has been speculated that a DNA gene product might pair with a DOB gene product to form a functional class II molecule, but this is unlikely because transcription of the DNA and DOB genes is not co-ordinately regulated (Tonnelle et al., 1985).

The DO subregion The DO subregion is currently defined by a single gene, DOB. The DOB gene was first identified by screening a human cDNA library with a mixture of DRB and DPB probes at reduced stringency (Tonnelle et al., 1985). In an independent study DOB was isolated from a human genomic library using a probe for the mouse class II gene Ob (formerly Ap2) at low stringency (Servenius et al., 1987). The DOB gene is transcribed at low levels in B-lymphocytes but no protein product has been reported. Like the DNA gene, DOB is isolated in the class II region and a closely physically linked a-chain gene has not yet been described. As mentioned above, it is unlikely that any DOB product would pair with an a-chain from the DP, DQ, DR or DN subregions because DOB is not co-ordinately regulated with that of the other class II loci. Specifically, DOB is transcribed only at low levels in B- lymphocytes, and transcription of DOB in fibroblasts is not inducible by y-interferon (Tonnelle et al., 1985). Thus, it is possible that a DOA gene remains to be discovered, and that a DO molecule exists with a function distinct from the classical class II molecules. The DOB gene was mapped distal of the DQ subregion (Hardy et al., 1986) and has recently been accurately positioned 45kb proximal to DQB2 by cosmid walking (Blanck and Strominger, 1988; Figures 1.5. and 5.9.).

1.3.4. The class III region The class III region, which is bounded at the centromeric end by the class II region and at the telomeric end by the class I region, has been

33 shown by PFGE mapping to span about l.IMbp (Figure 1.5.; Dunham et al., 1987; Carroll et al., 1987; Ragoussis et al., 1989; Dunham et al., 1990).

With the advent of molecular cloning techniques it became possible to clone the class III complement genes which had been identified by classical studies. The genes for C4, C2 and factor B were all cloned by designing synthetic oligonucleotides based on the known protein sequences and using these to screen cDNA libraries. cDNA probes were then used to screen cosmid libraries, and overlapping clones were obtained which contained these genes (Carroll et al., 1984). The cloned complement gene subregion contained C2, Bf and two C4 genes ( C4A and C4B) within 120kb of DNA. The C2 and Bf genes were separated by less than 500bp, while C4A and C4B were about lOkb apart. The products of these genes are serum glycoproteins which are components of the classical (C4 and C2) and alternative (factor B) complement cascades. These cascades ultimately result in the formation of a membrane attack complex, a transmembrane channel which causes lysis and death of the cells of invading micro-organisms (Reid, 1988). The classical pathway is activated by antibody bound to antigen and is therefore a major effector of the humoral immune response. The alternative pathway can be activated directly by the cell surface of the pathogen.

Prompted by linkage analysis in pedigrees affected by congenital adrenal hyperplasia caused by steroid 21-hydroxylase deficiency, which had mapped the defective locus to the class III region, it was soon demonstrated that the region encompassing the complement genes also contained two genes for steroid 21-hydroxylase (21A and 21B in Figure 1.5.), one just proximal of each of the C4 genes (Carrol et al., 1985; W hite et al., 1985). 21A is a pseudogene. Deleterious mutations in the functional steroid 21-hydroxylase gene were shown to be responsible for congenital adrenal hyperplasia (White et al., 1985). Deletions and duplications of one or other [21-hydroxylase/C4] gene unit are not uncommon in the population (Carroll and Alper, 1987).

The genes for the related lymphokines tumour necrosis factor (TNF) a and p (Tnfa and Tnfb) were localised to the MHC by analysis of MHC deletion mutants (Spies et al., 1986) and then to the class III region by

34 PFGE mapping (Inoko and Trowsdale, 1987; Dunham et al., 1987; Carroll et al., 1987).

PFGE mapping revealed that the distance between DRA and the complement gene cluster was about 390kb, while the distance between the complement genes and HLA-B was about 580kb (Dunham et al., 1987; Carroll et al., 1987; Ragoussis et al., 1989; Dunham et al., 1990). These intervals were clearly large enough to accommodate numerous genes in addition to the Tnf genes. Intensive efforts to clone the entire class in region were initiated, and these have resulted in over 900kb of DNA being cloned in overlapping cosmids (Spies et al., 1989a, b; Sarg­ ent et al., 1989a, b; Kendall et al., 1990; Spies et al., 1990). The remaining interval between the proximal end of the cosmid contig and the DRA gene has now been covered by yeast artificial chromosome clones (Ragoussis et al., 1991b).

Analysis of the cloned regions for coding sequences, using reverse genetics techniques, led to the discovery of at least 26 novel genes in the class III region (Figure 1.5.b.; Levi-Strauss et al., 1988; Spies et al., 1989a, b; Sargent et al., 1989a, b; Morel et al., 1989; Milner and Campbell, 1990; Kendall et al., 1990; Spies et al., 1990). The total number is uncertain because there have been two very recent independent reports of novel genes between 21B and DRA and it is not yet clear whether both groups have identified the same genes (Kendall et al., 1990; Spies et al., 1990). In Figure 1.5.b. the data of Kendall et al. (1990) are shown because this report described seven novel genes whereas Spies et al. (1990) only described six. The density of genes in the class HI region is remarkably high. In fact, one of the novel genes (OSG in Figure 1.5.b.) actually overlaps, on the opposite strand, with the 21B gene (Morel et al., 1989).

Three of the novel genes were shown by nucleotide sequencing to be members of the heat shock protein hsp70 multigene family (Sargent et al., 1989a; Milner and Campbell, 1990). Hsp70-1 and hsp70-2 are very closely related, encoding identical protein products. Hsp70-hom shares 90% identity with hsp70-l but unlike hsp70-l and -2, expression of this gene is not heat-inducible.

35 The nucleotide sequences of a further four of these novel genes ( RD, BAT2, BAT3 and OSG) have been published. The RD gene encodes a predicted protein product of 42kD, which contains an novel motif consisting of a reiterated arginine-aspartate dipeptide (Levi-Strauss et al., 1988). BAT2 and BAT3 (G2 and G3 in Figure 1.5.b.) encode proteins of 228 and HOkD respectively, both of which are unusually rich in proline (Banerji et al., 1990). The ’opposite strand gene’, OSG, which overlaps with 21B is expressed, like 21B, in steroidogenic adrenal tissue (Morel et al., 1989). This intriguing finding is suggestive of a functional or regulatory relationship between 21B and OSG, but the partial sequence of OSG did not reveal what this might be. In addition to the sequence data from these human genes, a partial sequence has been determined for the mouse homologue of the B144 gene but this did not match any sequences in the nucleotide sequence databases and the function of this gene is not known (Tsuge et al., 1987). No sequence data have yet been published for the other novel genes.

1.4. Evolution of the MHC

1.4.1. Evolutionary relationships betw een d ass I and dass II genes Once the genes encoding dass I and class n molecules had been cloned it became apparent from the conservation in gene organisation and identity at the nucleotide sequence and predicted amino acid sequence level that they were related. In particular, there was significant homology between the class II a2 domain, the class II p2 domain and the class I a3 domain. These domains were also related at the amino acid sequence level to the conserved 'antibody fold' domain found in members of the immunoglobulin supergene family (Korman et al., 1982b; Hood et al., 1986). The common ancestry of class I and class II molecules is also reflected in their related functions and their related structures (Figures 1.2. and 1.4.). It is not clear when in evolutionary time MHC molecules arose. However, class I and class II genes have now been described in amphibians, birds, reptiles and fish as well as mammals, which implies that class I and class II genes evolved in their present form before the radiation of vertebrates which took place about 400 million years ago (Kaufman et al., 1990; Hashimoto et al., 1990).

36 The ancestral class II and class I genes have clearly been duplicated many times to give the multiple loci seen in the class II and class I regions of the mammalian MHC. The organisation of the human class II region suggests that the subregions were generated from a primordial a /p gene pair through a series of duplication and divergence events (Klein, 1986; Bodmer et al., 1986).

Comparisons of the nucleotide and amino add sequences of the class II a-chain genes and their protein products revealed that DPA1,DQA1, DRA and DNA are equally diverged from one another, suggesting that that they arose by duplication at roughly the same time (Figure 1.6.; Auffray et al., 1984). A similar analysis of the p-chain genes reveals that DPB1, DQB1 and DRB are equally diverged from one another but that DOB is more distantly related. This suggests that the DOB gene may have arisen before the other p-chain genes in evolutionary time (Tonnelle et al., 1985). Sequence comparisons between genes within subregions reveals that additional duplications have probably occurred much more recently (Figure 1.6.). For example, the DQA1 and DQA2 genes are 99% related in the exon encoding the a2 domain (Auffray et al., 1987). The duplication which took place in the DP subregion probably occurred earlier than that in the DQ subregion because the DPA2 and DPB2 genes are significantly diverged from DPA1 and DPB1. At the nucleotide level, the identity between DPA1 and DPA2 is 76%, while between DPB1 and DPB2 it is 86%. The increased divergence of the DPA2 gene compared to DPB2 suggests that DPA2 became a pseudo­ gene before DPB2 (Gustafsson et al., 1987).

1.4.2. Organisation of the MHC region in different species As the sequences of the class II genes of mouse and man were deduced, it became apparent that the subregions were homologous. Thus, the DRA and Ea genes are more closely related to one another than to the other lod, as are the DRB and Eb genes (Kaufman et al., 1984; Denaro et al., 1985). In the same way the DQA and DQB genes are related to the m ouse Aa and Ab genes respectively (Kaufman et al., 1984). The isolated mouse p-chain gene, Ob (formerly Ap2), is homologous to the isolated human P-chain gene, DOB (Larhammar et al., 1985; Tonnelle et al., 1985; Servenius et al., 1987). Finally, the isolated m ouse p-chain pseudogene Pb (formerly AP3) was found to be most related to the

37 Figure 1.6. Evolutionary tree for the dass II genes. HLA and Ig refer to the primordial MHC and immunoglobulin genes. ABC refers to the primordial dass I gene. Da and P refer to the primordial dass II a- and p-chain genes. Subsequent events involve pair­ wise combinations of a and p genes except where indicated. Approximate divergence times are indicated in millions of years (from Bodmer et al., 1986). DZ is now known as DN; DX is now known as DQ2. sequences of the two (3-chain genes in the human DP subregion (Widera and Fla veil, 1985). When the organisation of the genes in the human and mouse class II regions was deduced by cosmid cloning and physical mapping it was revealed that the relative positions of the homologous subregions is conserved between the two species (Steinmetz et al., 1986; Hardy et al., 1986). These sequence comparison and gene mapping data have led to the hypothesis that the overall organisation of the class II region was established before the radiation of rodents and primates about 135 million years ago (Bodmer et al., 1986; Klein, 1986). Differences between the two class II regions, such as the duplication of the human DQA1/DQB1 gene pair to give DQA2 and DQB2, and the deletion from the mouse class II region of sequences homologous to DPA1 or DPA2, have most likely occurred more recently. In contrast to the distinct conservation of subregions seen in the mouse and human class II regions, the class II genes of the chicken MHC are more closely related to one another than they are to DR, DQ or DP. Thus, the multiple chicken class II genes probably arose independently by duplication from a primordial class n gene pair after the separation of the mammalian and avian lineages about 300 million years ago (Kroemer et al., 1990).

As in the human MHC, the mouse class I and class II regions are separated by a class HI region (Figure 1.7.). The class in regions of mouse and man seem to be conserved, although many of the most recently discovered genes in the human class III region have not yet been mapped in the mouse. Both class HI regions contain genes for steroid 21-hydroxylase and the complement components C4, C2 and factor B; the organisation of these closely linked genes is similar in the two species (Chaplin, 1985). The hum an RD gene has a homologue in the analogous position in the mouse class III region (Levi-Strauss et al., 1988). The B144, Tnfa, Tnfb and BAT1 genes have recently been linked on overlapping cosmids in mouse, and their order is conserved relative to the human class III region (Wroblewski et al., 1990). A m ouse hsp70 gene has been localised to the class III region by recombination analysis but its position relative to other class III loci is not yet known (Gaskins et al., 1990).

39 & 7 3 cd £ cd s HH HH ‘ 2 o CA (A d , ttf a , 7 3 hH* 7 3 (A £ CA cd 2 u § fH o cd • H ■4-J £ • H O

s ’■ 8 a 0 ) > a> •42 ^cd ■ s CO ' 3 Mh CO »H o o o cd (A B • H CA CA § cd LU to »H 7 3 o 7 3 7 3 tr < © © cd£ 4 4 Q 2 cd 11 „ P o CA 'd O CA £ Q £ cd • H O 7 3 O *42 V CA cd 1—1 cd a CA CO (A CA © j d IU £ LU ' u & © 2 • H © *0 2 Q_ k - u LU O • H J 3 z o Q 0 CO CO 4-» 4-* CO E Cd Mh < O B o o 2 ° a s v _ a> CA CA -*—• © CA 3 , £ C 2 o cd o o . a X

The major difference in organisation between the human and mouse MHCs is in the class I region. The human MHC contains seventeen class I genes, while the mouse MHC contains between 26 and 33 class I genes depending on the haplotype (Flavell et al., 1986). As mentioned previously, the human class I genes all lie telomeric of the class in region (Figure 1.7.). Most mouse class I genes are also found in a region telomeric of the class in region. However, a pair of class I genes, K and K2, are present at the centromeric end of the class H region, about 70kb proximal to Pb (Figure 1.7.; W idera and Flavell, 1985; Steinmetz et al., 1986). These genes are proposed to have arisen by duplication in the Qa region of the mouse class I gene cluster, and to have subsequently moved to their current location by an intrachromosomal double cross­ over event (Bodmer, 1981; Weiss et al., 1984). This organisational change is believed to have occurred since the radiation of rodents and

41 mammals because it has only been found in the MHCs of the closely related rat and mouse (Klein, 1986).

1.5. Clinical relevance of the class II region

The extreme polymorphism of the MHC genes and gene products has provided a powerful system of markers to test for association between the MHC and diseases. In population association studies, the frequency of a particular MHC allele is tested (using serological reagents, or, more recently, using RFLPs or sequence specific oligonucleotides) in a group of individuals with a particular disease and a group of healthy matched controls. The frequencies of each allele in the two groups are calculated and subjected to statistical tests to determine whether there is a significantly increased or decreased frequency of a given allele between the two groups. If a significant difference is found, that allele is considered to be associated with the disease.

This approach has revealed significant associations between the MHC and over 40 diseases (Table 1.2.; Tiwari and Terasaki, 1985). Many of the associations are strongest with class II region; these include insulin- dependent diabetes mellitus (IDDM), , narcolepsy and Hodgkin's lymphoma. The associations are never complete, reflecting the fact that these diseases usually have complex genetics and that the final development of the disease phenotype is probably dependent on a combination of environmental and genetic factors. Part of the genetic component is generally considered to be contributed by a 'disease susceptibility' gene in the class II region which predisposes an individual towards developing a certain disease (Bell and Todd, 1989).

Interpretations of the results of disease association studies must take in to account the phenomenon of linkage disequilibrium. Infrequent recombination between loci results in linkage disequilibrium between alleles at those loci in the population; that is, the frequency at which alleles at two loci are found together in the population is greater than would be expected from the individual frequency of the alleles. It is well documented that there is very strong linkage disequilibrium between genes in the MHC region (Klein, 1986). In the class II region,

42 Frequency (%) HLA Relative Condition Allele Patients Controls Risk

H odgkin’s disease A l 40 32.0 1.4 Idiopathic hemochromatosis A3 ' 76 28.2 8.2 B14 16 3.8 4.2 Behget’s disease B5 41 10.1 6.3 Congenital adrenal hyperplasia B47 • v 9 0.6 15.4 Ankylosing spondylitis B27 90 9.4 87.4 Reiter’s disease B27 79 9.4 37.0 Acute anterior uveitis B27 52 9.4 10.4 Subacute thyroiditis B35 70 14.6 13.7 Psoriasis vulgaris Cw6 87 33.1 13.3 Dermatitis herpetiformis DR3 85 26.3 15.4 Celiac disease DR3 79 26.3 10.8 DR 7 AJso increased IgA deficiency in blood donors DR3 64 26.3 5.0 DR 7 AJso increased Sicca syndrome DR3 78 26.3 9.7 Idiopathic Addison’s disease DR3 69 26.3 6.3 Graves’ disease DR3 56 26.3 3.7 Insulin-dependent diabetes mellitus DR3 a n d /o r DR4 91 57.3 7.9 DR2 10 30.5 0.2 Myasthenia gravis DR3 50 28.2 2.5 Systemic lupus erythematosus DR3 70 28.2 5.8 Idiopathic membranous nephropathy DR3 75 20.0 12.0 Zw*-immunized mothers DR3 95 15 113 Narcolepsy DR2 100 22 B7 Also increased Multiple sclerosis DR2 59 25.8 4.1 Optic neuritis DR2 46 25.8 2.4 C2 deficiency DR2 B18 Goodpasture’s syndrome DR2 88 32.0 15.9 Rheumatoid arthritis DR4 50 19.4 4.2 Pemphigus (in Jews) DR4 87 32.1 14.4 IgA nephropathy DR4 49 19.5 4.0 Hydralazine-induced SLE DR4 73 32.7 5.6 Postpartum thyroiditis DR4 72 32.2 5.3 Hashimoto’s thyroiditis DR5 19 6.9 3.2 Pernicious anemia DR5 25 5.8 5.4 Juvenile rheumatoid arthritis DRw8 23 7.5 3.6 Primary glomerulonephritis C4B*2.9 25 1.5 22.0

Table 12. Associations between the MHC and diseases. The relative risk shows how many times more frequently the disease occurs amongst individuals with a particular MHC allele compared to those without that allele (from Klein, 1986). alleles at the DR and DQ loci are particularly strongly associated. DP alleles in general are not strongly associated with the DQ/DR region, providing evidence for a hot spot of recombination in the proximal class II region, although in some haplotypes significant disequilibrium between DP and DR/DQ is seen (Rosenberg et al., 1989). The practical significance of linkage disequilibrium is that if a disease association is found with a particular class II allele, any gene (including known class II genes and, theoretically, novel genes) in linkage disequilibrium with that allele is a candidate for the disease susceptibility gene (Bell and Todd, 1989).

Many of the diseases which are associated with the class II region have an autoimmune pathology and consequently most efforts to explain these associations have focussed on the genes encoding the classical class II antigens. Since the products of these genes clearly play an important role in the control of the immune response (section 1.2.2.), the classical class II genes are intuitively excellent candidates for autoimmune disease susceptibility genes. The sequences of numerous class II alleles with which disease associations have been detected, or which are in linkage disequilibrium with the marker alleles, have been determined in an attempt to identify shared residues or epitopes in the encoded class II molecules which could explain why particular haplotypes predispose to disease (Bell and Todd, 1989). As a result of these studies, particular class II antigens have been strongly implicated in the mechanism of development of IDDM and rheumatoid arthritis (Todd et al., 1988; Nepom , 1990). Shared amino acids in these allelic products are proposed to give them common structural features which fail to delete potentially autoreactive T-cells in the thymus or which facilitate presentation of critical autoantigens in the periphery (Bell and Todd, 1989).

There is, however, still no proof that the class II genes implicated by these studies explain the association of any disease with the class II region. Thus it is possible that the true autoimmune disease susceptibility genes are as yet undiscovered class II genes or genes encoding accessory functions in antigen presentation (section 1.6.). Furthermore, not all class II-associated diseases have an autoimmune pathology. Almost 100% of individuals suffering from the sleep

44 disorder narcolepsy have DR2(Drwl5) yet show no evidence of an autoimmune response (Aldrich, 1990). Other examples of non- autoimmune diseases in association with the class II region include Hodgkin's lymphoma, chronic lymphocytic leukaemia and acute non- lymphocytic leukaemia, all of which are associated with DP alleles (Bodmer et al., 1989a; Pawelec et al., 1989). In these cases the association with the class II region may be explained by the presence of novel genes with non-immunological functions. A precedent for this latter situation in the human MHC was the association of congenital adrenal hyperplasia with HLA-Bw47. The molecular basis of this disease was actually a deletion in the functional steroid 21-hydroxylase gene which was in linkage disequilibrium with HLA-Bw47 (White et al., 1985). Thus, a complete understanding of class II-disease associations will depend on characterising all of the genes in this region and determining their function.

1.6. Evidence for other genes in the class II region

1.6.1. Novel dass II genes Evidence for the existence of a novel human class II antigen has been reported by Carra and Accolla (1987), who isolated a monoclonal antibody which immunoprecipitated a class II molecule from B- lymphoblastoid cell lines lysates after the lysates had been cleared of DR, DQ and DP molecules. The immunoprecipitated molecules contained a- and P-chains as judged by SDS-polyacrylamide gel electrophoresis, but 2-D peptide mapping studies revealed that these were distinct from the DR, DQ and DP a- and p-chains present in the parental cell line. As yet, the genes encoding this novel antigen have not been identified.

It may be that the novel class II molecule described in this study is a product of the DNA locus with that of a hitherto undiscovered DNB locus, or a product of the DOB locus with that of a hitherto undiscovered DO A locus. As discussed in section 1.3.3. both the DNA and DOB genes are transcribed and the nucleotide sequences of the corresponding cDNA clones did not reveal and deleterious mutations which would prevent translation (Tonnelle et al., 1985; Young and

45 Trowsdale, 1990). DNA homologues have been described in whale and rabbit, and the rabbit gene is expressed, but the potential of these genes to encode a protein product has not yet been determined (Kulaga et al., 1987; Trowsdale et al., 1989). The mouse homologue of DOB, Ob, is also transcribed and potentially functional, and intuitively it seems likely that DOB, which clearly arose before the major mammalian radiation, would only be maintained in both mouse and man in a potentially functional form if it did indeed encode a protein product. It is therefore possible that the human class II region contains functional DNB o r DO A genes. Alternatively, the class II molecule described by Carra and Accolla (1987) may be the product of a completely novel class II a /p gene pair. The approaches taken previously to isolate novel a- and p- chain genes involved the screening of genomic or cDNA libraries with class II gene probes at reduced stringency. These methods would not detect diverged class II genes. It should be mentioned that any novel class II gene would not necessarily be encoded within the class II region, although class II genes unlinked to the MHC region have not yet been described in any species.

1.6.2. Class I-modifying locus Evidence for a non-class II gene potentially mapping in the class n region has come from studies of a mutant human B-lymphoblastoid cell line LBL 721.174 which has a defect in the presentation of antigen to T-cells by class I molecules (DeMars et al., 1985; Cerundolo et al., 1990). Class I molecules present short peptide fragments derived from the degradation of intracellular proteins, such as viral proteins, to cytotoxic T-lymphocytes (CTL), which then become activated to kill the infected cell. The class I molecules in LBL 721.174 are functionally normal because they could present exogenously added peptide fragments to CTL such that killing ensued. However, virally infected LBL 721.174 cells were not killed and the class I molecules were retained in the endoplasmic reticulum (which is where antigen binding to class I molecules is thought to occur) instead of progressing to the cell surface. This phenotype has been interpreted as showing that LBL 721.174 has a defect in the transport of intracellularly derived peptides from their site of generation in the cytosol to their site of class I binding in the ER (Cerundolo et al., 1990). Understanding the molecular basis of the defect in this mutant will provide important

46 clues as to the way in which class I molecules become complexed with antigen, about which little is known. LBL 721.174 has a deletion spanning from the DPB2 gene to the complement gene cluster, and the gene responsible for the antigen presentation defect is therefore likely to m ap within this interval (Cerundolo et al., 1990; DeMars et al., 1985). A strikingly similar phenotype has also been described in the rat, and in this case the gene responsible for the defect in class I antigen presentation could be mapped to the class II region of the rat MHC (Livingstone et al., 1989). Like the m ouse class II region (section 1.4.2.), the rat class II region is homologous to the human class II region, and therefore the observation of Livingstone et al. (1989) provides additional evidence for the presence of a class I-modifying locus in the human class II region.

1.6.3. LMP antigens Evidence for novel genes in the mouse MHC was obtained when an antiserum made between mice differing only in the MHC region was found to precipitate a large (~580kDa) multisubunit protein complex from a mouse macrophage cell line (Monaco and McDevitt, 1982). The complex was composed of a large number of noncovalently linked low molecular weight polypeptide (LMP) subunits which were bio­ chemically, serologically and genetically distinct from the class I, class II and class HI gene products. The subunits ranged in molecular weight from 12-35kD. Two of the subunits displayed electrophoretic poly­ morphism and both the polymorphisms mapped by recombination analysis within the mouse class II region. Lmp-7 was localised between between Pb and Ab, while Lmp-2 was mapped slightly more accurately, between Pb and Ob (Monaco and McDevitt, 1986; Steinmetz et al., 1986). Neither gene has yet been cloned. The genes for the other fourteen LMP subunits could not be mapped in this study because their products were not polymorphic. A biochemically similar complex has also been described in human cells (Monaco and McDevitt, 1984). Therefore, given that the human and mouse class II regions are homologous, it is possible that the human class II region also encodes LMP subunits. The function of the LMP complex is unknown, although the fact that it is expressed in macrophages and lymphocytes and is inducible by y- interferon, like class II molecules, has led to speculation that it may provide some accessory function in antigen presentation (Monaco and

47 McDevitt, 1986). The unusual properties of the LMP complex are also found in the eukaryotic multicatalytic proteinase (Rivett, 1989). This broad specificity non-lysosomal endopeptidase complex of about 600kD is composed of at least thirteen distinct subunits with molecular weights of between 20 and 35kD. An intriguing possibility is that the LMP complex is the same as, or closely related to, the high molecular weight proteinase, and generates peptides from intracellular proteins for presentation by MHC molecules (Parham, 1990).

1.7. Approaches to finding novel genes

There is substantial circumstantial evidence to support the hypothesis, on which the experiments described in this thesis are based, that there are previously undiscovered genes in the class II region of the human MHC. To summarise briefly the preceding sections, evidence that there may be additional class n genes in the class II region comes from (i) immunoprecipitation of non-DP-DQ-DR class II molecules from B- lymphoblastoid cell lines and (ii) the finding of two potentially functional class II genes, each without a 'partner1. Evidence for non­ class II genes in the class II region comes from (i) studies of class I antigen presentation mutants in human and rat which implicate a class II-encoded locus and (ii) the mapping of genes encoding two subunits of a novel multisubunit complex to the mouse class II region. In addition, disease association studies can be interpreted in terms of previously undiscovered genes (both class II and non-class II) in the class II region.

In order to analyse a region of the genome of interest for coding sequences in the absence of detailed knowledge about the function of any gene which might be present, a ’reverse genetics' strategy can be applied (Orkin, 1986). Reverse genetics methods have provided spectacular successes in the cloning of the genes causing diseases which could not be analysed using more traditional approaches, such as Duchenne muscular dystrophy and cystic fibrosis (Monaco and Kunkel, 1987; Rommens et al., 1989). Over the last three years, this approach has also proved highly successful in the discovery of novel genes in the class in region of the hum an MHC, as described in section 1.3.4. Some

48 of the methods which can be used have already been alluded to, but the principles will now be described in more detail.

Once the genomic region of interest has been identified it is mapped and cloned. The most powerful technique for long-range physical mapping is currently pulsed field gel electrophoresis (Barlow and Leh- rach, 1987), although irradiation-induced hybrids and naturally occurring chromosomal translocations and deletions have also proved extremely useful for determining the relative order of probes in a given region (Hastie et al., 1988; Cox et al., 1990). The region of interest is then cloned, using the previously mapped probes as starting points for the isolation of genomic clones. Genomic libraries in bacteriophage X and cosmid vectors have previously proven successful, although the relatively small insert sizes in these systems can make the cloning of large regions of DNA extremely laborious. More recently however, yeast artificial chromosome (YAC) vectors have been developed in which hundreds of kilobases of genomic DNA can be cloned and propagated (Schlessinger, 1990). Jumping libraries and linking libraries provide additional sources of cloned genomic material (Poustka and Lehrach, 1986). The use of rare-cutter jumping libraries has the additional advantage that the cloned regions may contain CpG islands which are frequently found at the 5' ends of genes (see below and Chapter 3).

The methods used to search for genes in a cloned region are based on the known properties of transcribed sequences. For example, many genes are associated with CpG islands, short (l-2kb) regions of DNA which are unusually rich in unmethylated CpG dinucleotides (Bird, 1987). The function of these regions is not yet known, but they provide convenient diagnostic markers for expressed sequences, and can be detected because they contain sites for certain restriction endonucleases which otherwise cleave DNA very infrequently (Brown and Bird, 1986; Lindsay and Bird, 1987). Another diagnostic feature of coding sequences is that they are likely to be conserved in the genomes of other organisms. Cross-hybridisation of fragments from genomic clones to genomic DNA from other species has proved a successful approach to identifying genes (Monaco et al., 1986). Furthermore, transcribed regions can be detected by hybridising fragments from genomic clones

49 to northern blots or to cDNA libraries (Monaco et al., 1986). Very recently, 'exon-trapping1 techniques have been developed to test for the presence of splice junctions in a region of cloned DNA (Duyk et al., 1990).

These approaches are theoretically applicable to any part of the human genome, including the class II region. In fact, as described in the preceding sections, many steps have already been taken towards characterising the class II region at the molecular level. In particular, physical maps of the region are available which can be used as a basis for further work, and numerous genes have been cloned which are suitable for use as probes in the construction of more detailed maps and the isolation of new genomic clones. The following chapters of this thesis describe the application of molecular mapping and cloning techniques in the class II region to determine the likely positions of genes, to obtain genomic clones covering these regions, to identify genes within these clones, and to characterise the novel genes thus found.

50 2. Materials and methods

2.1. Bacterial cell culture

Liquid cultures were grown by inoculating L-broth containing the appropriate antibiotic with a single bacterial colony and shaking vigorously overnight at 37°C. To grow bacterial colonies, cells were streaked onto the surface of L-agar plates containing the appropriate antibiotic and incubated with the plates inverted at 37°C overnight.

L-broth Bacto-tryptone lOg/litre Bacto-yeast extract 5g/litre NaCl lOg/litre Sterilised by autodaving.

L-agar plates 1.5% (w/v) Bacto-agar in L-broth. Sterilised by autodaving. Solid media were melted by microwaving and cooled to 50°C before adding antibiotics.

Antibiotics A m picillin Anhydrous ampicillin was dissolved in double distilled water (DDW) at 50mg/ml, sterilised by filtration and used at a final concentration of 50pg/m l. Tetracycline Tetracydine hydrochloride was dissolved in DDW at 12.5mg/ml, sterilised by filtration and used at a final concentration of 12.5|ig/ml. Kanam ycin Kanamycin sulphate was dissolved in DDW at lOmg/ml, sterilised by filtration and used at a final concentration of 10|ig/ml.

Antibiotics were stored at -20°C.

51 2.2. Screening of recombinant DNA libraries

2.2.1. Cosmid libararies The cosmid library used in this study, a gift from Dimitri Kioussis, (National Institute of Medical Reseach, London) was made using genomic DNA from the human T-cell line HPB.ALL in the cosmid vector cos202 (ampicillin resistant). 106 recombinants were plated out and screened as follows. The titre of the cosmid library was first determined by preparing a 10"2 dilution of the frozen library stock (lOjil stock in 990|il L-broth) and plating out dilutions of this on L-agar plates containing ampicillin. These plates were incubated at 37°C overnight and the 10'2 dilution was stored at 4°C. The following day the number of colonies obtained for each dilution was counted and used to calculate the volume of the 10"2 dilution required to yield 250 000 colonies (the optimal number for a 20x20cm filter). Four 20x20cm Hybond-N membranes (Amersham) marked with an asymmetric pattern of dots were lowered onto the surface of four L-agar 245x245mm (Nunc) plates. The calculated volume of the 10"2 dilution was then spotted on to the surface of the filter and spread evenly using a flamed glass spreader. These plates, the master plates, were incubated overnight at 37°C. The next day eight additional 245x245mm plates, the replica plates, were poured and overlaid with nylon membranes. To prepare duplicate replicas of the first master, the filter from the first master plate was lifted off the agar with Millipore forceps and placed colony side up on three thicknesses of Whatman 3mm paper. The wetted filter from the first replica plate was then overlaid on this filter, covered with further layers of Whatman and pressed on to the master. The master pattern of spots was marked on to the replica filter before the two were pulled apart and the replica was returned to its plate. The same master filter was then overlaid with a second replica filter and the process repeated. The first master filter was then returned to its plate. When duplicate replicas had been prepared from the four master filters, all twelve plates were incubated at 37°C until the colonies had regrown. The master plates were then sealed with Parafilm and stored at 4°C.

The filters from the replica plates were removed from the agar surface and processed by placing them successively (colony side up) on pads of

52 Whatman soaked in denaturing solution (7min) and neutralising solution (2x 3min). The filters were then rinsed by immersion in 2x standard saline titrate (SSC) and air dried before the DNA was fixed by baking for 2hr at 80°C or UV irradiation at 0.4 J/cm2. The membranes were incubated for l-2hr at 42°C in 500ml TENS buffer and the bacterial debris removed with a rubber policeman. After a final rinse in 2x SSC the filters were ready for hybridisation

Following hybridisation of the replica filters with the probe of interest, a positive hybridisation signal present in duplicate was identified on the autoradiographs and used to pinpoint on the appropriate master plate the position of the region containing the positive colony. Bacterial cells were removed from this region with a toothpick or flamed loop and transferred to 1ml L-broth. After vigorous mixing to disperse the cells, dilutions of this stock were made and plated out on 140mm L-agar plates containing ampicillin and incubated at 37°C overnight. A plate bearing well separated colonies was selected for the preparation of duplicate colony lifts. A nylon membrane of appropriate size was overlaid on the agar surface for lmin. During this time the position of the filter was marked by piercing with a needle and syringe containing ink. The filter was then removed and placed successively (colony side up) on pads of Whatman soaked in denaturing solution (7min) and neutralising solution (2x 3min). The filter was briefly rinsed in 2x SSC and the DNA fixed onto the surface by baking or UV irradiation. Meanwhile, a second membrane was overlaid on the surface of the plate and marked in the same position with the needle and ink. This replica was processed in the same way as the first. The secondary plate was incubated at 37°C until the colonies had regrown, then stored at 4°C. Following hybridisation of the secondary filters with the probe of interest the autoradiographs were used to identify the positions of individual positive colonies on the plates. These were picked and the cosmid DNA purified as described in section 2.5.1.

Denaturing solution 1.5M NaCl 0.5M NaOH

53 Neutralising solution 1.5M NaCl 0.5M Tris.Cl pH7.2 Im M EDTA

20x SSC 3M NaCl 0.3M sodium citrate

TENS buffer 50mM Tris.Cl pH8.0 ImM EDTA 1M NaCl 0.1% SDS

2.2.2. cDNA libraries The cDNA libraries used in this study were constructed according to the method of Seed (1987) in the plasmid vector CDM8 or derivatives thereof. For each library, 2.5x105-lxl06 clones, propagated in E. coli MC1061/p3, were plated out and screened in the same way as the cosmid library, using ampicillin + tetracycline selection. Once a single positive colony had been identified, plasmid DNA was purified as described in section 2.5.1. The CEM (T-cell line) cDNA library was a gift from Jenny Dunne (Lymphocyte Molecular Biology Laboratory, ICRF). The JY (B-cell line) and y-interferon induced macrophage (U937) cDNA libraries were a gift from Dr. David Simmonds, Oxford.

2.2.3. Jum ping libraries The rare-cutter jumping libraries used in this study were a gift from Annemarie Poustka (German Cancer Research Centre, Heidelberg, FRG). Both libraries were constructed as described in Poustka and Lehrach (1988; see also Figure 5.5.). The Notl jumping library was constructed from Notl-cut human genomic DNA which was circularised around the marker plasmid and then re-cut with BamHI. The BssHII jumping library was constructed from BssHII-cut human genomic DNA which was circularised around the marker plasmid and then re-cut with BamHI+Hindni. In both cases the marker plasmid was pMLS-Mlu-Not, which carries the supF gene, and the vector was a modified form of the bacteriophage X strain NM1151 which contains amber mutations that are suppressed in the presence of supF. Jumping

54 clones were propagated as temperature sensitive lysogens of phage X recombinants in E. coli MC1061/p3 (ampicillin + tetracycline selection). The library was handled and screened in the same way as the cosmid library except that incubations were done at 30°C to maintain lysogenic growth. However, once an individual positive bacterial colony was detected by secondary screening the phage were induced to undergo lytic growth so that preparations of X DNA could be made. A positive colony was inoculated into 5ml L-broth containing tetracycline and ampicillin and shaken at 30°C overnight. 1ml was diluted into 50ml L- broth containing antibiotics and shaken at 30°C for 90min. The lytic growth cycle was then induced by shaking the culture at 42°C for 20min. Phage DNA was prepared as described in section 2.6., starting with the step to pellet the lysed bacterial cells.

2.3. Subcloning of DNA fragments

DNA fragments were subcloned into the plasmid vector Bluescript (ampicillin resistant), which has a useful polylinker cloning site and carries the lacZ gene which facilitates colour selection of insert-carrying recombinants. Vector for subcloning was cleaved to completion (as judged by testing a sample on an agarose gel) with the appropriate (s) and then purified by phenol extraction and ethanol precipitation. Insert DNA was prepared by excising the desired fragment from a lx TAE agarose gel and purifying the DNA using Geneclean (Bio 101). For the ligation reaction, vector and insert DNA were mixed in the ratio [1 vector terminus: 2 insert termini] and incubated overnight at 16°C with lpl lOmM ATP, lpl lOx ligation buffer, lpl T4 DNA ligase (Biolabs) and DDW in a total volume of 10|il. The reaction was then ethanol precipitated and resuspended in 10(0.1 DDW for transformation into the Bluescript host E. coli XL-1 Blue (Stratagene) by electroporation (section 2.4.2.). lOx Ligation buffer 500mM Tris.Cl pH8.0 lOOmM MgCl2 200mM dithiothreitol (DTT) 500pg/m l BSA

55 lOmg/ml BSA Bovine serum albumin was dissolved at lOmg/ml in lOmM Tris.Cl pH7.5, ImM EDTA

2.4. Transformation of bacterial cells with DNA by electroporation

2.4.1. Preparation of E. coli cells for electroporation 5ml L-broth were inoculated with a single colony of the desired E. coli strain and shaken at 37°C overnight. 2.5ml of this culture were diluted into 500ml L-broth in a 2 litre flask and shaken at 37°C until OD600 was 0.5-0.6. The culture was then chilled in an ice-water bath for 15min, transferred to pre-chilled 500ml centrifuge bottles, and spun for 20min, 4000rpm, 2°C, in a Beckman J6B. The supernatant was poured off and the pellet resuspended in 5ml ice cold water. A further 500ml ice cold water were added before centrifuging at lOOOOrpm, 2°C, 20min. The supernatant was poured off quickly to avoid loss of the loose pellet. The cells were then resuspended and centrifuged as before. The supernatant was removed immediately and the pellet resuspended by swirling in the residual liquid. 40ml ice-cold 10% glycerol were added and the cells were pelleted at 10 OOOrpm, lOmin, 2°C. Finally the cells were resuspended in an equal volume of 10% glycerol, aliquotted into Eppendorfs (50pl each), frozen on dry ice and stored at -70°C (Ausubel et al., 1987).

2.4.2. Introduction of DNA into cells DNA in ligation buffer was ethanol precipitated and resuspended in water before transformation. The salts in ligation mixes lowered the efficiency of transformation because the resistance of the sample was reduced to sub-optimal levels.

DNA (typically l-100ng) was added to 50|il electroporation-competent cells and the mixture was pipetted into the bottom of a pre-chilled electroporation cuvette. This was placed in the sample chamber of the electroporation apparatus, which was set to 2.5kV, 25pF with the pulse controller adjusted to 200Q. The pulse was applied and the cuvette removed. 1ml SOC medium was immediately added and the mixture pipetted into a fresh Eppendorf and incubated at 37°C for 30-60min.

56 Aliquots of the transformation mix were plated out on L-agar plates containing the appropriate antibiotic. When Bluescript plasmids were used, the plates were spread with lOOpl (for an 82mm plate) IPTG and 40pl X-gal 30min before use. The efficiency of transformation by electroporation was so high for supercoiled plasmids (typically 108-109 transformants/pg) that extensive dilutions of transformation mixes were made in SOC in order to obtain single colonies (Ausubel et al., 1987).

SOC Medium 0.5% Bacto yeast extract 2% Bacto-tryptone 2.5mM KC1 lOmM NaCl lOmM MgCl2 lOmM MgS04 20mM glucose Sterilised by autodaving.

IPTG 25mg/ml isopropyl thiogalactoside in DDW, stored at -20°C.

X-Gal 25mg/ml bromo-chloro-indolyl-p-D-galact- oside in dimethyl formamide, stored at -20°C.

2.5. Preparation of DNA from transformed bacterial cells

2.5.1. Small scale preparation of plasmid or cosmid DNA by alkaline lysis This technique was used for the rapid small-scale purification of plasmid or cosmid DNA from bacterial cells (Sambrook et al., 1989).

5ml L-broth were inoculated with a single bacterial colony and shaken overnight at 37°C. 1.5ml of each culture were transferred to an Eppendorf tube and microfuged for lmin. The supernatant was rem­ oved by aspiration with a drawn-out Pasteur pipette. The cell pellet was resuspended by vortexing in lOOpl GTE and the tubes were left at room temperature for 5min. The cells were lysed by adding 200pl freshly prepared 0.2M NaOH/1% SDS which was mixed by inverting the tube. After standing on ice for 5min the bacterial chromosomal DNA was

57 precipitated by adding 150pl KAc pH4.8, vortexing. and standing on ice for a further 5min. The chromosomal DNA was then pelleted by microfuging for lOmin at 4°C, and the supernatant was transferred to a fresh Eppendorf tube. An equal volume of phenol/chloroform/iso­ amyl alcohol (PCIA) was added and mixed thoroughly by vortexing, and the tubes were microfuged for 5min. The aqueous phase containing the plasmid DNA was transferred to a fresh Eppendorf tube. Two volumes of absolute ethanol were added and mixed before leaving at -20°C for 2hr. Following precipitation the DNA was pelleted by microfuging for 15min at 4°C. The supernatant was discarded before addition of 500|il 70% ethanol. The tubes were microfuged again and the ethanol removed by aspiration. The pellets were dried for 5min by vacuum desiccation and then resuspended in 20fil TE pH8.0 containing 100ng/ml DNase-free RNase. 2-5^1 were analysed by restriction endonuclease digestion. This technique typically yielded 2|ig cosmid DNA or 5|ig plasmid DNA.

GTE 50mM glucose lOmM EDTA 25mM Tris.Cl pH8.0

KAc pH4.8 60ml 5M KAc 11.5ml glacial acetic add 28.5ml DDW

PCIA Phenol (melted at 65°C) was mixed with chloroform and iso-amyl alcohol in the ratio 25:24:1 and buffered by equilibrating once with 1 volume 50mM Tris base, twice with 1 volume 50mM Tris.Cl pH8.0 and once with 1 volume TE pH8.0.

TE pH8.0 lOmM Tris.Cl pH8.0 ImM EDTA pH8.0

58 DNase-free RNase Pancreatic RNase A was dissolved in 15mM NaCl, lOmM Tris.Cl pH7.5 at lOmg/ml, heated to 100°C for 15min and cooled slowly to room temperature before storing at -20°C.

2.5.2. Large scale preparation of plasmid or cosmid DNA by alkaline lysis This technique was used to prepare milligram-scale quantities of high purity plasmid or cosmid DNA (Sambrook et al., 1989).

5ml L-broth were inoculated with a single bacterial colony and grown during the day (or overnight) at 37°C with vigorous shaking. This culture was then diluted into 400ml L-broth in a 2 litre flask (or split 2x 200ml in 1 litre flasks) and shaken at 37°C overnight. Bacterial cells were pelleted in a 500ml centrifuge bottle by centrifuging at 6000rpm for lOmin and then resuspended thoroughly in 20ml GTE containing 4mg/ml lysozyme. This suspension was left at room temperature for 5min before adding 40ml 0.2M NaOH/0.1% SDS, mixing gently, and incubating for a further 5min on ice. 20ml 5M KAc pH4.8 were then added and mixed by vortexing. This mixture was incubated for 15min on ice and centrifuged for 15min at 8000rpm. The supernatant was filtered through a nylon tea-strainer into a 200ml glass bottle, and the nucleic acid precipitated by adding 0.6 volumes of propan-2-ol and incubating at -20°C for 2hr. The precipitate was pelleted by spinning at 2000rpm for 20min at 4°C and the pellet was resuspended in 5ml RNasing buffer. The solution was transferred to a 50ml Falcon tube before adding lOpl lOmg/ml RNase and incubating at 37°C for 15min. 200|il 20mg/ml proteinase K and 125jxl 20% SDS were then added and the incubation continued at 37°C for a further 30min. An equal volume of PCIA was added and mixed thoroughly by vortexing. The phases were separated by centrifuging at 2000rpm for 15min, and the aqueous phase was transferred to a 30ml Corex tube. 0.1 volumes 3M NaAc pH5.2 and 2 volumes absolute ethanol were added and the mixture was stored at -20°C for 2hr. The DNA was pelleted by centrifugation at lOOOOrpm for 20min at 4°C and resuspended in 1ml A-50 buffer. The plasmid or cosmid DNA was then separated from contaminating bacterial DNA and RNA by fractionation through a

59 30ml A-50 biogel (Bio-Rad) column. Fractions contributing to the first OD260 peak were pooled in a 30ml Corex tube and ethanol precipitated. The DNA was pelleted by spinning at lOOOOrpm for 20min at 4°C, the supernatant was poured off and the pellet was washed in 5ml 70% ethanol before centrifuging again. The pellet was dried under vacuum and resuspended in 500pl TE pH8.0.

RNasing buffer 0.1M NaCl 5mM EDTA 0.1M Tris.Cl pH8.0

Proteinase K Fungal proteinase K (BDH) was dissolved at 20mg/ml in 20mM Tris.Cl pH8.0, self-digested at 37°C for lhr, and frozen at -20°C.

Biogel A-50 buffer 0.5M NaCl 25mM Tris.Cl pH8.0 Im M EDTA

2.6. Small scale preparation of bacteriophage X DNA lOOpl phage were mixed with 100|il lOmM CaCl2, lOmM MgSC >4 and lOOpl of a saturated culture of host bacterial cells and incubated for 20min at 37°C. This mixture was diluted in 50ml L-broth containing lOmM MgS 0 4 , 0.5% casamino acids and 0.2% maltose and shaken overnight in a 200ml conical flask at 37°C. 20ml aliquots were transferred to Falcon tubes and centrifuged at 4000rpm for 15min to pellet the debris of lysed bacterial cells. 15ml of supernatant were transferred to a fresh tube and incubated with lOpl lOmg/ml DNase for 15min at 37°C to degrade bacterial DNA. 5ml of AS buffer were then added and the mixture heated to 70°C for 15min to denature the phage capsids. The tube was cooled under the cold tap and 1.7ml KAc pH4.8 were added before centrifugation at 4000rpm for 15min. The super­ natant was filtered through a tea strainer into a fresh Falcon tube, and 15ml propan-2-ol were added before centrifugation at 4000rpm for 15min. The supernatant was poured off and the pellet resuspended in 200|il 0.3M NaAc. The phage DNA was transferred to a microfuge tube,

60 treated with RNase for 15min at 37°C, and extracted with PCIA. The aqueous phase was transferred to a fresh microfuge tube and precipitated with 2 volumes absolute ethanol at -20°C. The DNA was pelleted by microfugation and washed with 70% ethanol. Following a second spin the supernatant was removed by aspiration, the pellet was dried for l-2min under vacuum and the DNA was resuspended in 50- lOOpl DDW. IOjjI were analysed by restriction enzyme digestion.

AS buffer 0.3M Tris.Cl pH8.0 0.15M EDTA 1.5% SDS

2.7. Preparation of eukaryotic DNA from cells in culture

Many of the human genomic DNA samples used to prepare Southern blots in this study were a gift from Susan Tonks (Tissue Antigen Laboratory, ICRF). Genomic DNA samples from other vertebrate species for the preparation of 'zoo' blots were obtained from Dr. Nigel Spurr (Clare Hall Laboratories, ICRF). Additional genomic DNA samples were prepared using the following protocol (Susan Tonks, personal communication).

Cells were grown in RPMI medium supplemented with 10% foetal calf serum (Gibco) at 37°C in 5% CO 2 . For DNA preparation, 108-109 cells were pelleted at 3000rpm for 5min and resuspended in 10ml PBSA. This procedure was repeated. 90ml sucrose-triton were then added and the mixture centrifuged at lOOOOrpm for lOmin. The supernatant was discarded and the pellet resuspended in 20ml 75mM NaCl, 24mM EDTA pH8.0. 40jil lOmg/ml RNase, 500|il 20% SDS and 200|il 20mg/ml proteinase K were added and the mixture was incubated at 37°C for 3hr. An equal volume of buffered phenol was added and mixed thoroughly by shaking. The phases were separated by spinning at 2000rpm for 5min and the aqueous phase was transferred to a fresh tube. The process was repeated using PCIA until the aqueous phase was clear. The aqueous phase was then extracted with chloroform/isoamyl alcohol (24:1) and transferred to a beaker. 0.1 volumes 0.5M NaCl and 2 volumes of absolute ethanol were then added and the precipitated DNA strands

61 were spooled onto a Pasteur pipette and transferred to an Eppendorf. lml TE was then added and the DNA left to dissolve at 4°C. The OD was assayed at 260 and 280nm to assay the ratio of DNA to protein. If OD 2 60 /OD 280 <1.75, the sample was incubated again with proteinase K and re-extracted. A typical yield from 108 cells was 200pg DNA.

PBSA 171mM NaCl 3.4mM KC1 lOmM Na2HP04 1.8mM KH 2PO4

Sucrose-Triton 0.33M sucrose lOmM Tris.Cl pH7.5 5mM MgCl 2 1% Triton-X 100

2.8. Preparation of very high molecular weight DNA for pulsed field gel electrophoresis

DNA for pulsed field gel electrophoresis (PFGE) was prepared in the form of cells encased in blocks of low melting point agarose (LMPA) to avoid shearing (Dr. Denise Barlow, personal communication). Each LMPA block contained 3x106 cells, enough for three digests.

Approximately 108 cells were pelleted at lOOOrpm for 5min, the supernatant was removed by aspiration and the cells resuspended in 10ml PBSA. The cells were counted using a haemocytometer, then pelleted as before and resuspended in a further 10ml PBSA. Finally, the cells were pelleted again and resuspended in 50pl PBSA per 3x106 cells. An equal volume of 1% LMPA in PBSA, melted and equilibrated at 42°C, was added to the cells and mixed. lOOpl aliquots were pipetted into the slots of pre-chilled block formers (LKB) which had previously been soaked in 10% hydrogen peroxide for 30min to remove all traces of nucleases. The block formers were left on ice for 30min until the agarose had set. The blocks were then transferred to a Falcon tube containing 2.5ml 0.5M EDTA, 1% sodium laurylsarcosine and lm g/m l proteinase K, and incubated at 50°C for 48hr. A second aliquot of

62 proteinase K was added about half way through the incubation period. The tube was then topped up with TE pH8.0 and inverted several times to wash the blocks. The blocks were separated from the TE by passing through a tea strainer, then returned to the Falcon tube. This process was repeated three times with fresh TE. The TE was then replaced with TE containing 0.04mg/ml PMSF (freshly prepared by dissolving 40mg/ml in propan-2-ol) and incubated at 50°C for 30min to inactivate any residual protease activity. This step was repeated and the blocks were then stored in 0.5M EDTA pH8.0 at 4°C until required.

2.9. Preparation of RNA

Preparation and analysis of RNA was performed with the assistance of Dr. Ruth Lovering.

2.9.1. Isolation of total RNA from cells in culture Cell lines were cultured in the Cell Production Unit, ICRF. RNA was isolated using RNase-free reagents, plasticwear and glass wear (Sambrook et al., 1989). Cells were pelleted by spinning at 2000rpm, 5min. The pellet volume was estimated and 5 volumes of guani- dinium thiocyanate homogenisation buffer were added. Homog­ enisation was effected by drawing the lysate ten times into a syringe fitted with a 19-gauge needle. The homogenate was then layered onto a cushion of 5.7M CsCl, 0.1M EDTA in a polyallomer ultracentrifuge tube. The volume of the homogenate determined the size of the rotor used and the amount of CsCl required, as shown below.

Homogenate Spin time Speed R otor volume (ml) CsQ (ml) (hours) (rpm) SW60 1.2 3.1 12 40 000 SW41 3.5 9.7 24 32 000 SW28 12.0 26.5 24 25 000

After centrifugation, the liquid in the tube except the last 500|il was carefully removed with a pipette and the curved base of the tube, containing the RNA pellet, was cut off with a scalpel. The remaining supernatant in the base of the tube was carefully pipetted off and the

63 pellet was resuspended in 500pl lOmM Tris.Cl pH7.5, 5mM EDTA, 1% SDS. This was extracted twice with PCIA and the RNA was precipitated from the aqueous phase by adding 0.1 volumes 3M NaAc pH5.2 and 2 volumes absolute ethanol, and storing at -20°C overnight. The RNA was pelleted by microfuging at 4°C for 5min and the pellet was washed in 70% ethanol. After re-spinning the supernatant was removed and the pellet dried under vacuum. The RNA was resuspended in 200|il DDW.

Homogenisation 5M guanidinium thiocyanate buffer 5mM sodium citrate 1% Sarkosyl 0.7% p-m ercaptoethanol

To test for y-interferon inducible gene expression in colon carcinoma cell lines, cells were incubated with 300 units of y-interferon/ml for 36- 48hr before RNA extraction (experiment performed by Adrian Kelly, this laboratory).

2.9.2. Isolation of poly(A)+ RNA Polyadenylated RNA was purified from total RNA by oligo-dT cellulose chromatography using Fast-track reagents (Invitrogen). Each total RNA sample was made 0.5M with NaCl and incubated with pre­ equilibrated oligo-dT cellulose for 30min at room temperature. The mixture was then transferred to a spin-column and washed three times with high-salt binding buffer to remove non-polyadenylated RNA. Poly(A)+ RNA was then eluted from the column in a low-salt buffer and mixed with 0.15 volumes 2M NaAc and 2 volumes absolute ethanol. The mixture was frozen on dry ice to precipitate the RNA and then microfuged for 15min. The supernatant was removed and the pellet resuspended in low-salt buffer before storing the sample at -80°C.

Z10. Restriction endonuclease digestion of DNA

For routine mapping of plasmid, cosmid or bacteriophage DNA with restriction endonucleases, 100-500ng were typically digested in a volume of 20pl with a 2-5 fold excess of enzyme for l-2hr. When

64 specific fragments were required for subcloning or probe preparation, the digest was scaled up accordingly.

For preparation of genomic Southern blots, lOpg genomic DNA were typically digested overnight in a volume of 50-1 OOpl with a 5-fold excess of enzyme. A tenth of the digest was resolved on a minigel to monitor the extent of digestion. Incomplete digests were usually diluted into an equal volume of lx restriction buffer and supplemented with more enzyme before continuing the incubation.

For preparation of PFGE blots, blocks (section 2.8) were washed free of EDTA storage buffer by rocking in 50ml TE at room temperature (2x 30min) followed by 30min in DDW. Blocks were then cut with a scalpel into thirds, each third containing enough DNA for one digest (10pg). These were pre-equilibrated in 1ml of the appropriate buffer before digestion with a 5-fold excess of enzyme in a total volume of lOOpl.

All digests were carried out in the presence of lOOmg/ml BSA. The volume of enzyme added was not allowed to exceed 10% of the total volume. Genomic digests were supplemented with 5mM spermidine if the salt concentration in the buffer exceeded 50mM. Digests were incubated at the optimum temperature as recommended by the manufacturer. Overnight digests involving from thermophilic bacteria (e.g. TaqI, BssHII) were covered with a layer of paraffin oil to prevent evaporation during incubation.

Four basic lOx buffers were prepared which were supplemented with additional salts if required, according to the manufacturers’ recomm­ endations. lOx L Buffer lOOmM Tris.Cl pH7.5, lOOmM MgCl 2, lOmM DTT lOx M Buffer 500mM NaCl, lOOmMTris.Cl pH7.5, lOOmM MgCl 2, lOmM DTT

65 lOx H Buffer 1M NaCl, lOOmM Tris.Cl pH7.5, lOOmM MgCl 2, lOmM DTT lOx VH Buffer 1.5M NaCl, lOOmM Tris.Cl pH7.5, lOOmM MgCl 2, lOmM DTT

2.11. Electrophoresis of DNA and RNA

2.11.1. Conventional agarose gel electrophoresis All DNA samples were mixed with 0.2 volumes of the appropriate loading buffer before electrophoresis. Routine restriction endonuclease digests of plasmid, cosmid and bacteriophage DNA were resolved on 0.8% agarose gels in 0.5x TBE buffer containing 0.5pg/ml ethidium bromide using Bio-Rad Mini-Sub minigel apparatus (gel length 10cm). Digests of genomic DNA were resolved using Bio-Rad DNA Sub-Cell apparatus (gel length 20cm) in lx TAE buffer, typically using 0.8% agarose gels which were ideal for most purposes. The agarose concentration was increased up to 1.2% for the resolution of smaller fragments. DNA size markers were HindlD-cut bacteriophage X, BstEII- cut X or HaelD-cut <|>X174, depending on the size range of interest. lOx TBE 108g/l Tris base 55g/l Boric add 20mM EDTA lOx TAE 48.4g/l Tris base 11.4ml/l gladal acetic add 20mM EDTA

5x Loading buffer 50% glycerol 60mM EDTA 0.25% brom ophenol blue 5x TBE or TAE

66 2.11.2. Pulsed field gradient gel electrophoresis PFGE was performed using Pulsaphor apparatus (LKB). 250ml 0.8% agarose in 0.25x TBE were poured into the 20x20cm casting frame supplied. The 16-well LKB comb was used to form the slots into which the digested blocks were inserted. The blocks were sealed into the slots with molten 1% LMPA. Size markers of yeast (S. cerevisiae strain YP148, a gift from Denise Barlow, Genome Anlysis Laboratory, ICRF) and concatemers of bacteriophage X DNA were used. The gel tank contained 2.5 litres of 0.25x TBE buffer which was recirculated at 10°C. The electrodes were positioned in the standard double inhomogeneous field configuration as described in the LKB manual. The gel was run at 330V with a constant pulse time of between 45-60sec for 22-26hr, which resolved molecules of up to 900kb.

2.11.3. Electrophoresis of RNA RNA was resolved in formaldehyde agarose gels. 3g agarose, 20ml lOx MOPS buffer and 150ml DDW were mixed and boiled in a microwave. The gel was cooled to 50°C and 24ml 38% formaldehyde were added before pouring. RNA samples (10-15}ig total RNA or 1.5-3|ig poly(A)+ RNA) were desiccated under vacuum and resuspended in 5^.1 deionised formamide, lpl lOxMOPS buffer, 1.6(0.1 38% formaldehyde and 1.4|il DDW. The samples were heated to 65°C for 5min and cooled on ice. lpl loading dye (50% glycerol, 0.1% bromophenol blue) was added and the samples were loaded on the gel. Electrophoresis was carried out at 30-40mA overnight.

2.12. Preparation of DNA and RNA blots

Nucleic acids were routinely transferred to nylon membranes (Hybond- N, Amersham) using blotting apparatus and protocols described in Sambrook et al. (1989).

2.12.1. Preparation of DNA blots Following electrophoresis, gels which had not been run in the presence of ethidium bromide (e.g. pulsed field gels) were stained by immersing in running buffer containing 0.5pg/ml ethidium bromide for lhr. The gel was destained by immersing in running buffer or DDW without

67 ethidium bromide for 30-60min. A photograph was then taken before the gel was soaked 0.15M HC1 for 2x lOmin to partially depurinate the DNA. (This treatment resulted in more efficient transfer of high molecular weight DNA molecules to the filter, because the depurinated sites are cleaved during the denaturation step, thus fragmenting the long molecules.) Following a rinse in DDW, the gel was soaked in denaturing solution for 2x 20min and then in neutralising solution for 2x 20min. The gel was then assembled in a capillary blotting apparatus containing 20x SSC and transfer of nucleic acid to the membrane was allowed to proceed overnight (genomic Southerns) or for 48hr (PFGE gels). When blots of minigels containing cosmid, plasmid or phage DNA were being prepared, the acid treatment step was omitted and each denaturation or neutralisation step was reduced to lOmin. Up to four blots were made from a single minigel by allowing transfer to proceed for 15min each time before changing the filter.

After transfer, filters were rinsed by immersing in 2x SSC and air dried. The DNA was then fixed to the membrane by baking for 2hr at 80°C or by UV cross linking at 0.4 J/cm 2.

Denaturing solution 0.5M NaOH 1.5M NaCl

Neutralising solution 0.5M Tris.Cl pH7.5 1.5M NaCl Im M EDTA

2.12.2. Preparation of RNA blots Following electrophoresis, northern gels were stained for 30min in lpg/m l ethidium bromide and destained in water for lhr before photography. Gels were then rinsed in 20x SSC and then blotted onto Hybond-N overnight in 20x SSC. After RNA transfer, filters were baked for 2hr at 80°C in a vacuum oven and UV cross linked at 0.4J/cm2.

RNA loading was judged by hybridising northern blots with the probe pEDl, a cDNA subclone of the esterase-D gene ESD , w hich is ubiqu­ itously expressed in a cell-cycle independent fashion (Squire et al., 1986).

68 2.13. Preparation of DNA probes

Probe DNA fragments were purified using Geneclean (Biol 01) from normal agarose gels run in lx TAE buffer, or excised directly from low melting point agarose (LMPA) gels run in 0.5x TBE buffer.

Geneclean-purified DNA was labelled to high specific activity with [a- 32P]dCTP by random hexamer priming (Feinberg and Vogelstein, 1983). For Genedeaned fragments, DDW was added to 10-50ng probe DNA to give a final volume of 33p,l and boiled for 3 mins. The mixture was briefly cooled on ice before the addition of lOpl OLB, 2fil lOmg/ml BSA, 2.5pl [a-32P]dCTP (10pCi/pl) and 2.5|il Klenow fragment (8U/|il, BRL).

For probes isolated on LMPA gels, the required fragment was cut out with a scalpel over a UV transilluminator and the resulting gel slice boiled with 3 volumes DDW for 7min. After cooling to 37°C, aliquots containing 50ng DNA were transferred to fresh Eppendorfs and the volume adjusted to 33pl with DDW. One aliquot was then used for the labelling reaction as described above (Feinberg and Vogelstein, 1984).

The reaction was typically incubated for 6hr at room temperature. After this time, probe DNA was separated from unincorporated nucleotides by passing the reaction mix through a TES-equilibrated G-50 Sephadex column prepared in a Pasteur pipette plugged with glass wool. The first radioactive peak was collected and assayed. Labelled probes were boiled for 5min before addition to hybridisation bags (section 2.14.).

Probes containing repetitive DNA sequences were competed before use by adding 100|ig sonicated human carrier DNA per 50ng labelled probe in a final concentration of 5x SSC, boiling for 5min, and incubating at 65°C for lOOmin before addition to the hybridisation bag.

OLB 100|il Solution A 250|il Solution B 150pl Solution C

69 SolutionA 1.25M Tris.Cl pH7.5 0.125M MgCl2 18pl p-mercaptoethanol 5pl each 0.1 M dATP, dTTP, dGTP (Pharmacia) (Solutions of nucleotides were prepared in 3mM Tris.Cl pH7.5, 0.2mM EDTA.)

Solution B 2M HEPES adjusted to pH7.4 with NaOH

Solution C Hexadeoxyribonudeotides (Pharmacia) suspended at 90 OD 260 units/ml in lOmM Tris.Cl, ImM EDTA.

TES lOmM Tris.Cl pH8.0 lOmM EDTA 0.5% SDS

2.14. Hybridisation of blots

2.14.1. Hybridisation of DNA blots Genomic blots were prehybridised at 65°C for 5-8hr in 6x SSC, 10% dextran sulphate, 5x Denhardt's solution, 0.5% SDS and lOOpg/ml heat denatured sonicated salmon sperm DNA. Hybridisation was carried out at 65°C overnight in fresh buffer with the addition of 106 cpm probe per ml (Sambrook et al., 1989).

Cosmid blot filters and recombinant DNA library filters were hybridised in the same way except that the probe concentration was reduced to 5x10s cpm/ml.

In hybridisations using a human genomic or cDNA probe on human DNA, filters were washed for 20min at 65°C in 2x SSC, 0.5% SDS followed by 20min at 65°C in O.lx SSC, 0.5% SDS. These washing conditions were used unless otherwise stated. When a cross-species hybridisation was being performed, lower washing stringencies were used as described in the appropriate places in the text. A typical final wash, determined empirically, was 6x SSC at 65°C. Filters were exposed

70 to Kodak XAR-5 autoradiography film at -80°C between intensifying screens. lOOx Denhardt's solution 2% (w/v) BSA 2% (w/v) Ficoll 2% (w /v) Polyvinylpyrrolidine

2.14.2. H ybridisation of RNA blots Northern blots were prehybridised at 42°C for lhr in a buffer containing 50% deionised formamide, 6x SSPE, 5x Denhardt's solution and lOOpg/ml sheared salmon sperm DNA. Hybridisation was carried out at 42°C overnight in fresh buffer with the addition of 106 cpm probe per ml.

When a human cDNA probe was hybridised to a human northern blot, filters were washed at 65°C for 20min in 2x SSPE, 0.5% SDS followed by 20min at 65°C in O.lx SSPE, 0.5% SDS. When a human genomic fragment containing an unknown amount of exon was used as the probe, the washing stringency was reduced. A typical final wash, determined empirically, was 2x SSPE at 50°C.

20x SSPE 3.6M NaCl 0.2M sodium phosphate pH7.7 0.02M EDTA

2.15. DNA sequencing

DNA was sequenced by the chain-termination method (Sanger et al., 1977) using Sequenase protocols and reagents (USB). The Sequenase enzyme used was a modified bacteriophage T7 DNA polymerase. Sequencing was performed on double stranded DNA obtained from maxipreps or minipreps. DNA prepared by alkaline lysis minipreps was found to be clean enough for sequencing, but the RNA had to be removed first in order to accurately determine the concentration of DNA present. Oligonucleotide primers other than those for M13-based vectors were made in the Oligonucleotide Synthesis Laboratory, ICRF.

71 2.15.1. Sequencing reaction 2(ig template DNA, 2|il 20mM EDTA and 2jil 2M NaOH were mixed in a microfuge tube with DDW to bring the final volume to 20}tl. The mixture was incubated for 5min at room temperature before neutralising with 3^,1 3M NaAc pH5.2 and 7\i\ DDW. 75pl absolute ethanol were then added and the DNA precipitated at -70°C for 15min. The DNA was pelleted by microfuging at 4°C for lOmin, washed in lOOpl 70% ethanol, respun, and the supernatant removed by aspiration with a drawn-out Pasteur pipette. The pellet was briefly dried under vacuum and resuspended in 7|il DDW and 2pl 5x reaction buffer. 0.5pmol primer were added to the resuspended denatured template and the two were annealed by incubating the mixture at 65°C for 2min and then cooling slowly to <35°C. The primer was extended by adding to the template/primer mix (lOpl): l|il 0.1M DTT, 2pl lx labelling mix, 0.5pl [a-35S]dATP and 2pl Sequenase enzyme diluted 1:7 in TE pH8.0. The reaction was incubated at room temperature for 5min. 3.5|il were then transferred to four prewarmed tubes, each containing 2.5pl of one of the termination mixes. The reactions were incubated at 37°C for 5min, during which time dideoxynucleotides were incorporated into the extended DNA molecules. 4jxl stop solution were then added to each tube. The samples were denatured at 75°C for 2min before resolving 2.5pl of each on a sequencing gel. The remainder of each sample was stored at -20°C.

To resolve compressions sometimes observed on the gel due to secondary structures in regions rich in dG and dC, the sequencing reaction was performed with the substition of dITP for dGTP (and ddlTP for ddGTP). dITP forms weaker secondary structures than dGTP and these are more readily denatured during electrophoresis, resulting in improved resolution in the compressed areas.

5x Reaction buffer 200mM Tris.Cl pH7.5 lOOmM MgCl2 250mM NaCl

72 5x Labelling mix 7.5pM each of dGTP, dCTP and d ITP Diluted to lx with DDW. ddG Termination mix 80|iM each of dGTP, dATP, dCTP and dTTP 8|xM ddGTP 50mM NaCl ddA Termination mix 80|iM each of dGTP, dATP, dCTP and dTTP 8jiM ddATP 50mM NaCl ddC Termination mix 80fiM each of dGTP, dATP, dCTP and dTTP 8}iM ddCTP 50mM NaCl ddT Termination mix 80}iM each of dGTP, dATP, dCTP and dTTP 8|xM ddTTP 50mM NaCl

Stop solution 95% formamide 20mM EDTA 0.05% brom ophenol blue 0.05% xylene cyanol

2.15.2. Denaturing polyacrylamide gel electrophoresis Sequencing gel electrophoresis was performed using Koch-Light apparatus and gel plates. Before each run, the gel plates (40x20cm) were washed in warm soapy water, rinsed and dried. They were cleaned by wiping first with ethanol, then acetone, and finally DDW. The back plate was siliconised with Repelcote. 0.4mm spacers were positioned between the plates and the sandwich fastened together with bulldog clips. 50ml polyacrylamide gel mix were prepared and injected between the plates with a syringe. The comb was positioned and the gel was allowed to set. The gel was then assembled in the tank and warmed by electrophoresing in lx STBE for at least 20min at 4kW. Excess urea was syringed from the loading area and 2.5^1 of each sample were then loaded. A run time of about lOOmin resolved sequences near the

73 primer. Longer runs of up to 4hr allowed sequences over 200 nucleotides from the primer to be read.

Following the run, the back plate was carefully lifted off and the gel (on the front plate) was fixed in a tray containing 10% glacial acetic acid, 10% methanol for 15min. The gel was then drained, transferred to a sheet of Whatman 3mm paper, covered in Saran wrap and dried at 80°C for 30min on a Bio-Rad slab drier. Autoradiography was performed overnight using Kodak XAR-5 film at room temperature. Sequences were compiled using Intelligenetics GEL and SEQ software. Comparisons of nucleotide and predicted amino acid sequences to the databases were performed using the Intelligenetics IFIND program.

Gel mix 31.5ml 50% urea 6.7ml 40% acrylamide stock 5ml lOx STBE DDW to 50ml 400|il lOOmg/ml ammonium persulphate 60|il TEMED

40% acrylamide stock 38% (w/v) acrylamide 2% N,N'-bismethylene-acrylamide Deionised, filtered and stored at 4°C. lOx STBE Tris base 108g/litre Boric acid 55g/litre EDTA 9.3g/litre

74 3. Identification of potential CpG islands in the class II region by PFGE mapping

3.1. Strategy to identify potential sites of genes

The approach taken to identify the position of novel genes in the class II region was based on the observation that the 5' ends of genes are frequently associated with short stretches of CpG-rich DNA, known as CpG islands (Bird, 1987).

Vertebrate DNA is relatively depleted of C+G, and the dinucleotide CpG is particularly rare, being present at only 20% of the expected frequency. The majority of CpG dinucleotides are stably methylated at the 5 position on the cytosine ring (Bird, 1986). It has been observed, however, that about 1% of the genomic DNA in a wide variety of vertebrate species is non-methylated, as judged by cleavage of genomic DNA with the methylation sensitive restriction endonuclease Hpall (which cuts at CCGG but not CmCGG) (Cooper et al, 1983). Cloning and sequencing of a number of the Hpall fragments revealed that they were rich in C+G and that CpG was present at the expected frequency (Bird et al., 1985). Randomly selected clones were found to detect transcripts when used as probes on northern blots, and characterisation of one clone in detail revealed that the CpG rich region was located at the 5’ end of two opposite strand divergent transcripts (Lavia et al., 1987). It had also been independently shown that non-methylated sequences characteristic of CpG islands were associated with the 5' ends of a number of vertebrate genes, such as the chicken a2(I) collagen gene (McKeon et al., 1982). A subsequent search of the gene databases revealed that all constitutively expressed 'housekeeping' genes and many tissue specific genes had CpG-rich islands at the 5' end (Bird, 1986; Gardiner-Garden and Frommer, 1987).

The unusual nucleotide composition of CpG islands led to the pred­ iction that islands would be detectable in genomic DNA as containing clusters of sites for the infrequently cutting restriction endonucleases

75 used for constructing PFGE maps (Brown and Bird, 1986; Lindsay and Bird, 1987). These 'rare-cutter' enzymes have 6 or 8bp recognition sequences containing one or more CpG, which must be unmethylated for cleavage to take place. CpG islands are therefore likely to contain sites for these enzymes. Indeed, if it is assumed that CpG islands contain 65% G+C, are not depleted in CpG and are lkb long, whereas bulk DNA is 40% G+C with CpG occurring at 25% of the expected level, then the theoretical number of sites for each enzyme in the human genome (3xl09bp) and the proportion of these sites in islands can be calculated (Table 3.1).

Enzyme Target Total sites % sites in Sites per island [class] sequence in genome islands Expected Observed

N otl [a] GCGGCCGC 4100 89 0.12 0.3

BssHII [a] GCGCGC 47000 74 1.2 1.2 EagI [a] CGGCCG 47000 74 1.2 1.1

M lul [b] ACGCGT 37000 21 0.3 0.03 Nrul [b] TCGCGA 37000 21 0.3 0.1

Table 3.1. Distribution of selected rare-cutter sites in human DNA. For each enzyme (all of which contain two CpGs, underlined) the total number of sites, the percentage of sites in islands and the expected number of sites per island were calculated by Lindsay and Bird (1987). The observed number of sites for each enzyme was obtained from the nucleotide sequences of the CpG islands at a random selection of human genes in the databases (Bird, 1990). Class designation ([a] or [b]) is from Bird (1990).

The table shows that those rare-cutter enzymes with recognition sequences consisting entirely of C+G and containing two CpG dinuc­ leotides are predicted to be highly diagnostic for CpG islands because the majority of sites for these enzymes should occur in islands (class [a] enzymes; Bird, 1990). Most importantly, it has been demonstrated in practice that sites for class [a] enzymes are indeed clustered in CpG

76 islands and are associated with transcribed sequences (Brown and Bird, 1986; Lindsay and Bird, 1987; Bird 1990). The m apping and cloning of rare-cutter sites in a genomic region of interest has successfully led to the identification of novel genes in mouse and man (Rappold et al., 1987; Estivill et al., 1987). In contrast, rare-cutter enzymes with recog­ nition sequences containing A+T in addition to C+G (class [b] enzymes, Table 3.1.), are predicted in theory and shown in practice to cut more frequently in inter-island DNA (Lindsay and Bird, 1987; Bird, 1990).

The feasibility of using PFGE to map the class II region has been dem onstrated by H ardy et al. (1986) and more recently by D unham et al. (1989) and Inoko et al. (1989). The aim of these studies was to use PFGE to construct detailed physical maps of the region from which the relative order of the different subregions and the distances between them could be established. In the present study it was decided to map the class II region by PFGE with the specific aim of locating class [a] rare- cutter sites which could mark the presence of genes. These sites could then be cloned and analysed in detail for evidence of transcribed sequences.

3.2. Choice of materials for construction of the PFGE map

3.2.1. Restriction endonucleases A discussed above, class [a] rare-cutter enzymes are predicted from theory and shown in practice to cut far more frequently in CpG islands than in inter-island DNA (Table 3.1.) These were selected as the enzymes of choice for finding CpG islands. However, it was also necessary to use class [b] rare-cutters, which tend to cut between CpG islands, so that complementary, overlapping fragments were generated which could be used to link together the class [a] fragments and build up the map. The class [a] enzymes used in this study were BssHII, EagI and Notl. The class [b] enzymes used were Mlul and Nrul.

3.2.2. DNA Most individuals in the population possess two distinct class II haplo- types, the long range maps of which may differ due to polymorphisms at the rare-cutter sites, through variation in DRB gene number, or

77 through differences in the amount of DNA between subregions (Dunham et al., 1989; Lawrance and Smith, 1990). To avoid possible difficulties in interpretation of data which could arise from using the DNA of an MHC heterozygote, PFGE blots were prepared from the DNA of the EBV-transformed B-lymphoblastoid cell line PGF, which is homozygous for the entire MHC as judged by . The haplotype of this cell line is A3 B7 DRwl5(2) Dw2 DQw6 DPw4.

3.2.3. PFGE apparatus PGF DNA digested with rare-cutter enzymes was resolved by pulsed field gradient gel electrophoresis using LKB Pulsaphor apparatus with the electrodes in the double inhomogeneous field configuration (Figure 3.1.).

• • •

• +

+

Figure 3.1. Schematic illustration of a double inhomogeneous pulsed field gel electrophoresis apparatus. Electrodes are represented by black dots. Arrows indicate the direction of motion during electrophoresis of a DNA molecule loaded in the central well under the influence of the electric field which pulses first in the north-south direction and then in the west-east direction. Fractionation of molecules is believed to be dependent on the greater ability of small molecules to re-orientate themselves when the field direction changes (Smith et al., 1987).

78 A constant pulse time of 45-60sec and a run time of 20-24hr was found to resolve molecules of up to 900kb. One disadvantage of this gel system is that electrophoresis results in tracks which are bowed rather than straight, which can complicate comparison of fragment sizes between the lanes. However, in practice it was found that only a small amount of distortion occurred during short runs (Figure 3.2.a.). Hybridisation of blots generated with this system resulted in sharp bands (Figure 3.2.b.) Blots generated using the LKB hexagonal electrode array, which resolves DNA in straight tracks, were found to give less well defined bands (data not shown).

3.2.4. Probes The probes for the class II DPB1, DPA1, DOB, DQA2, DRB1 and DR A genes were those described in the report of the Tenth International Histocompatibility Workshop (Marcadet et al., 1989). The probe for the DNA gene was 8bal, a 1.8kb PstI genomic fragment isolated from the cosmid JG8b (Trowsdale and Kelly, 1985).

3.3. Construction of the PFGE map

DNA from the cell line PGF was digested with rare-cutter enzymes, resolved by PFGE, transferred to nylon membranes and hybridised with probes for class II genes. The sizes of the fragments detected by each probe with each enzyme or combination of enzymes are shown in Table 3.2. Representative autoradiographs are shown in Figures 3.2.b., 3.3. and 3.4.

79 i n J N UON > , X mu/M T3 o> j> O 0»co

i n J N T 3 C rd UON c £ o mu/M X c/i CD 0> g >. N i n J N C < o> o» UON CL X mu/M Q

T30> to inJN tJj0> CO Q. < UON z Q Q |n ||/\| (X U Ch t>o c S • *-H rd C uo 'aJ 5C X5 QJ C rd QJ X5 g O

CO •a

i- x 3 w <<-> o o o o o o o LO 6C o c o cc LO X (N rd X (b) Autoradiographs of the blot of the same gel hybridised sequentially with the class II gene probes indicated. tN X LO"T CO (N CM E pulsed-field electrophoresis. Size markers are yeast chromosomes (YC). LM, limiting mobility. o o o o o o o o o o ^ m ro cnj r-

# N+IAI # I/M CD a+iAi O a a N + a N

N+l/\l « I/M « a+IAI < z Q a i N+a t N

N+IAI # I/M a+i\i CO CL Q a « N + a < N Figure Figure 3.3. Autoradiographs of a PFGE filter sequentially hybridised with probes for the class genes II DPB1, DNA bacteriophage lambda DNA. LM, Limiting mobility. d> and DOb. X, Xotl; BssHII; 13, M. Mlul. The Mlul+Notl double digest is partial. Size markers are concatemers of no o o O O O ID o CO ^ in co cm h - i n CO CM CM 20 rz • • 4 u . 'S>

CO IAJ CM V < X 3 O CL Q 23 a 2 0 rz C3 UJ

2 3 r \ co IAI 3 CO CD 2 3 3 o CO a o Z)20 a» o 27" 'ViCO toZ) (2 TZ) 3 a> co 23 o o o o l o C j- i •—■ 00 D CO CM ® X) 23 i n r f CO CM CM > > .13 2 3 > V co § • • • • • '35 rM *20 a *3 3J 4 «s T 3 3 ^-^-1 m o co a+a % O X Q "H. CO rz CO i - 2 3 a 20 O *5 CJ rz U fO 3 3J 13 T3 h* m 3 < 3+a o. 20 (b) (b) Autoradiographs of a PFGE filter hybridised sequentially with DOB and DQA2 probes, illustrating the Size Size markers are yeast chromosomes in each case. a o strong and weak bands obtained with the DQA2 probe as discussed in the E, Eagl; text. M, Mini. B, 111; Bssl 03 DPB1 DPA1 DNA DOB DQA2 DRB DRA S W

BssHII 250 250 250 210 210 500 500 500

EagI 250 250 250 210 210 500 500 500

M lul 130 450 450 450 450 570 570 570

N o tl 370 370 370 >900 >900 >900 >900

N ru l >900 >900 >900 >900 >900 >900 >900

B+E 250 250 250 210 NT 500 NT

B+M 130 120 120 210 210 500 500 500

B+N 250 250 250 210 210 500 500 500

M +N 130 240 240 210 210 570 570 570

Table 3.2. Sizes of rare-cutter fragments detected by MHC dass II probes on PFGE blots. Fragment sizes are in kilobases. Probes are shown across the top and rare-cutter enzymes are shown down the side (B, BssHII; E, EagI; M, Mlul; N, Notl). The DQA2 probe hybridised to two bands in each track: one strong (S) corresponding to the fragment carrying the DQA2 gene, and the other weaker (W), corresponding to the fragment carrying the cross-hybridising DQA1 gene (see text for further discussion). NT, not tested. Fragment sizes were deduced from more than one blot.

The genes of the DP subregion are known from family studies to be at the centromeric end of the class II region (Shaw et al., 1981). From cosmid walking it has been established that the configuration of the genes within the subregion is DPB2-DPA2-DPB1-DPA1 (Trowsdale et al., 1984). It has also been shown by deletion mapping that the cluster is oriented on the short arm of chromosome 6 with DPB2 toward the

83 centromere (Erlich et al., 1986). This information was used as a starting point for the construction of the PFGE map. Since the DPB1 and DPA1 genes are only 2kb apart (Kelly and Trowsdale, 1985), it was not surprising that the DPB1 and DPA1 probes detected similarly sized Notl, BssHII and EagI fragments (Table 3.2.). With Mlul, however, the DPB1 probe detected a 130kb fragment while the DPA1 probe detected a 450kb fragment (Figure 3.2.b.). This result indicated that that there must be an Mlul site in the genomic DNA between the DPA1 and DPB1 probes. These data allowed the immediate positioning of three Mlul sites: the first between theDPA1 and DPB1 probes, the second 130kb proximal and the third 450kb distal (Figure 3.5.). The 250kb BssHII fragment detected by both DPB1 and DPA1 probes was cut by Mlul to give a 130kb fragment detected by DPB1 and a 120kb fragment detected by DPA1 (Table 3.2.). The 370kb Notl fragment detected by both DPB1 and DPA1 probes was also cut down by Mlul to give a 130kb fragment detected by DPB1 and a 240kb fragment detected by DPA1. W ith additional data from EagI single and double digests (Table 3.2. and Figure 3.4.a.) it was possible to position BssHII, EagI and Notl sites on either side of the HLA-DP subregion as shown in Figure 3.5. It was already apparent that clusters of class [a] rare-cutter enzyme sites were present in the class II region. Cluster 1 contained sites for BssHII, EagI and Notl while cluster 2 contained sites for BssHU and EagI.

The DNA gene probe hybridised to similarly sized fragments as the DPA1 probe in all single and double digest combinations tested. (Figure 3.3. and Table 3.2.). It was concluded that the DNA gene mapped distal of the DP Al gene, but within about 120kb as defined by the position of the distal BssHU site.

The DOB gene probe detected a 450kb Mlul fragment, like the DNA and DPA1 probes (Figures 3.2. and 3.3.). However, the DOB probe detected different Notl (>900kb), EagI (210kb) and BssHII (210kb) fragments, indicating that it must map distal of the DNA gene but still within the same 450kb Mlul fragment. The double digest data obtained with the DOB probe allowed the positioning of EagI and BssHII sites coincident with the central Notl site. This defined a third cluster of class [a] rare- cutter sites (Cluster 3, Figure 3.5.)

84 The DQA2 probe hybridised to two fragments in each track, one strongly, corresponding to the fragment carrying the DQA2 gene, and one weakly, corresponding to the fragment carrying the DQA1 gene (Figure 3.4.b.). This pattern of hybridisation has also been reported in other studies with the same probe (Biro et al., 1989; Bontrop et al., 1989). With BssHII, EagI and Mlul, the strongly hybridising band ( DQA2) was of a similar size to that detected with the DOB probe, while the weaker band ( DQA1) was novel (Figure 3.4.b.). It was concluded that there was a fourth cluster of rare-cutter sites between the DQA2 and DQA1 genes (Cluster 4, Figure 3.5.). Because the DQA2 and DOB probes detected similarly sized fragments in each single and double digest tested, it was not possible to determine the positions of these genes relative to one another. However, the relative positions of the DOB and DQA2 genes has been established as cen-D0B-DQA2-te\ for a number of different cell lines, including DR2 homozygotes, in other studies (Hardy et al., 1986; D unham et al., 1989; Inoko et al., 1989; Blanck and Strom inger, 1988). Probes for the DQB1, DQB2 and DQB3 (formerly DVB) genes were not used in the present study, since their close proximity to the DQA1 and DQB2 genes is well documented (Okada et al., 1985b; Jonsson et al., 1987; Blanck and Strominger, 1988; Ando et al., 1989; Inoko et al., 1989). The overall order of the genes in the DO and DQ subregions has been established as cen-DOB-DQB2-DQA2-DQB3-DQBl-DQAl-tel by pulsed field mapping in a DR2 homozygous cell line and by chromosome walking in a DR7 homozygous cell line (Inoko et al., 1989; Blanck and Strominger, 1988). This information was used in the construction of the m ap shown in Figure 3.5.

The DRB1 and DRA gene probes both hybridised to similarly sized fragments as the weakly hybridising ( DQA1) bands detected by the DQA2 probe (Table 3.2.). It was not possible from these data to determine the relative positions of the DQA1,DRB1 and DRA genes, but it was clear that there were no further clusters of rare-cutter sites in the class II region. For the construction of the map shown in Figure 3.5., the positions of the genes in this region were assumed to be as previously determined for a DR2 homozygous cell line: cen -DQA1- DRBl-DRB2-DRB3-DRA-tel (Inoko et al., 1989; Kawai et al., 1989).

85 CLUSTER1 CLUSTER 2 CLUSTER 3 CLUSTER 4 I w □ cu E fo. fo. E o .o .E -O « ® o a $ a w a o>® -Q O O • • O cn in V CD 60 3 p h H h 3 1 3 - 3 1 • H 13 U w • ^ -54 3 1 • H p4 O a O O -t-> HJ S J • ^H 'cd A 0> .3 • CO O O to . a 3 . >> £ . a , a 3 o g o g O cd j c cd j c cd 3 e 5 s . s to 0) 0 6 o 3 co V cd O a> co j c 6 p h h h h H h .52 h h , X X 3 1 H "cd h • • 3 T 3 1 f • • • 4 - 4 A CO p d 0 6 < 3 a> co O CO u < l-l 0 6 cd O o O 3 3 3 J-l CD CO CO p d d H H H H h h JO • ^H X O .44 .44 3 4 Q 3 - 3 1 < O 3 - 3 1 O O W 3 1 £ 4-4 3 1 4-4 j T 3 J O 0 6 3 (-i - 4 cd 0) co O 3 4 - ( 4 - 4 cd cd l-H £ 4 - 4 a> o 6 O CO cd 3 d d I 3 h h h h h 3 r o O 3 T 3 - • H M- 3 - 3 1 4—> • • 3 1 • • 43 co cd 4-4 o 3 0) O CO O O O 3 0 6 u 6 < 3 > CD »H cd p 1 - 4 0) CD 4_( 13 T i • ^H T • • -H 4-4 JO 13 3 - 3 - -O • ‘5b 3 CJ cd O 3 3 o cd td_ p CD O 0 6 O cd CD 3 CO cd l-H CO CD cd CJ O 3 3 O 3" CJ -1 1 a> £ p CO 3 H j H h The PFGE data were used to construct a physical map of the class II region of the PGF haplotype (Figure 3.5.) The novel feature of interest of this map is the presence of four clusters, as defined at the level of resolution of the PFGE technique, of class [a] rare-cutter enzyme sites, which as discussed previously are highly diagnostic for CpG islands. Each cluster contains one or more sites for EagI, BssHII and Notl. These clusters were considered likely to be the sites of novel genes and were selected as targets for cloning and further analysis.

3.4. Summary and discussion

The physical mapping data presented here reveal that the organisation of the class II region of the DR2-homozygous cell line PGF is very similar to that of other cell lines. The class II gene order deduced for PGF (centromer e-DPBl-DPAl-DNA-[DOB, DQA2]-[DQA1, DRB, DRA]- telomere) is consistent with that reported for other cell lines in other studies. In another study in which DR2 haplotypes were mapped with some of the same enzymes (Mlul, BssHII and Notl), the fragment sizes detected were in good agreement, within experimental error (Dunham et al., 1989). The fragment sizes detected in PGF DNA with DRA and DRB gene probes are consistent with the conclusion of Dunham et al. (1989) that DR2 haplotypes possess more DNA in the DR subregion than DR3, DR5 and DR6 haplotypes, but less than DR4 haplotypes.

The PFGE map of the PGF class II region has revealed the presence of four clusters of sites for rare-cutter enzymes which are highly diagnostic for CpG islands and which could mark the positions of novel genes (Figure 3.5.). The next step towards testing this hypothesis was to clone the regions containing the clusters of rare-cutter sites by walking from previously cloned class II genes. In general, the precise distance from each class II gene relative to the nearest rare-cutter sites could not be determined from the map. However, the DP subregion is anchored to the physical map because an Mlul site occurs between the probes for the DPB1 and DPA1 genes, which map only 2kb apart (Kelly and Trowsdale, 1985). Cluster 1 maps 130kb centromeric of this site and cluster 2 maps 120kb telomeric. The DP subregion has been cloned on

87 overlapping cosmids, which extend 70kb centromeric and 20kb telo- meric of the Mlul site as shown in Figure 3.5. (Trowsdale et al., 1985). This previously established cosmid walk was potentially a useful starting point for chromosome walking towards clusters 1 and 2, since cluster 1 was estimated to be about 60kb from the proximal end of the DP region cosmid clones, while cluster 2 was about lOOkb from the distal end of the cloned region. In fact, the walk towards cluster 1 was accelerated by the serendipitous discovery that the gene for the a l chain of type XI collagen mapped between cluster 1 and the HLA-DP subregion, as described in Chapter 4.

88 4. M apping of the COL11A2 gene to the class II region

This chapter describes the mapping of the gene for the a2 chain of type XI collagen, a component of cartilage fibrils, to the centromeric end of the class II region of the MHC.

4.1. Introduction

The mechanical strength of cartilage is provided by a three-dimen­ sional mesh of rigid, rod-like collagen fibres composed of type n, type IX and type XI collagen molecules (Mendler et al., 1989). Type II and type XI collagens are of the fibrillar class, containing a single unin­ terrupted triple helical domain as shown in Figure 4.1. (Miller and Gay, 1987). In cartilage collagen fibres, type II and type XI molecules are packed together in a ratio of about 8:1 in a supramolecular staggered array (Mendler et al., 1989).

Triple helical domain

Figure 4.1. Schematic diagram of a fibrillar collagen molecule. The mature molecule is 300nm long and is composed of three polypeptide chains, al, a2 and a3. In fibrillar collagens the majority of the length of the molecule is accounted for by the long, uninterrupted triple helical domain (Miller and Gay, 1987).

Type XI collagen is a heterotrimer of al, a2 and a3 polypeptide chains (Morris and Bachinger, 1987) which are encoded by three separate

89 genes. The a3(XI) chain is believed to be a heavily glycosylated variant of the a l chain of type n collagen and would therefore be a product of the COL2A1 gene at 12ql4.3 (Furuto and Miller, 1983; Law et al., 1986). The al(XI) chain gene (COL11A1) and the a2(XI) chain gene (C0L11A2) have been cloned and sequenced more recently (Bernard et al., 1988; Kimura et al., 1989). COL11A1 has been mapped to human chromosome 1, region p21 (Henry et al., 1988). COL11A2 was first mapped in the Human Cytogenetics Laboratory, ICRF, to human chromosome 6, region 21.3-22, by in situ hybridisation (Figure 4.2.; Hanson et al., 1989). The MHC has previously been shown to map to 6p21.3 (Spring et al., 1985).

The possibility that COL11A2 might map near the MHC was considered extremely interesting because of the involvement of cartilage collagen in a clinically important MHC-associated disease, rheumatoid arthritis (RA; Lotz and Vaughan, 1988). About 75% of RA patients have antibodies autoreactive to type II collagen, and type II collagen-reactive T-lymphocytes are found at high frequency in the synovial infiltrates of chronically affected individuals (Londei et al., 1989). The importance of an autoimmune response to cartilage collagen in the development of the arthritic condition has been demonstrated in rodents and primates where the disease can be induced with injections of type II collagen (Stuart et al., 1984). The disease can also be transferred by injection of antibodies or T-cells from affected individuals. The arthritogenic potential of type XI collagen has not yet been investigated in such detail because type XI collagen was discovered much more recently. However, it has now been reported that inoculation of mice with type XI collagen also induces the development of arthritis (Boissier et al., 1990). In view of these results, it was decided to map the COL11A2 gene in detail relative to the MHC. If the gene was found to be closely linked it was hoped to perform RFLP studies on diseased and normal individuals to test whether certain COL11A2 alleles were associated with the incidence of RA.

90 number of train 10 Analysis of 96 m etaphase chromosome spreads revealed a clustering of of clustering a revealed spreads chromosome etaphase m 96 of Analysis Histogram showing distribution of silver grains over m etaphase hum an an hum etaphase m over grains silver of distribution showing Histogram grains over bands 6p21.3-22 (Hanson et al., 1989). al., et probe. (Hanson 6p21.3-22 COL11A2 the bands over with grains hybridisation situ in following chromosomes Figure 4.2. Figure bY jwii Wjki1 tEuf aiiki ai ro a^iW ii ijiik a iWijnkWii1] ibiYijuwJili tfEjudfo jqo nhib ilh ijdi d|i cn «|i «i grjrnn pa gfrrjTrinnn ip «!|iii «|i3 ccjnj du|ai i d j i ii|lih njhriab tjiqfop 3 5 6 7 8 9 0 1 2 Y X 22 21 20 19 18 17 16 15 H 13 7 9 0 1 12 11 10 9 B 7 B 4.2. Mapping of COL11A2 using somatic cell hybrids

To confirm and refine the COL11A2 map position, it was decided to test for the presence of the gene in the DNA of a pair of complementary human/mouse somatic cell hybrids, MCP-6 and 56-47, which contain portions of chromosome 6 overlapping at band 6p21 as shown in Figure 4.3.a. MCP-6 contains the translocation chromosome t(6:X) (6qter-6p21::Xql3-Xqter) (Goodfellow et al., 1982) and 56-47 contains the translocation chromosome t(6:17)(6pter-6p21::17pl3-17qter) along with other human chromosomes (Nagarajan et al., 1986). A Southern blot was prepared from EcoRI-digested DNA from the hybrids, and from total human and mouse genomic DNA as controls. Hybridisation of this blot with theCOL11A2 probe, a 4.5kb EcoRI/BamHI fragment from the cosmid cosHcol.ll (Hanson et al., 1989), gave the result shown in Figure 4.3.b. The probe detected a fragment of about llkb in total human, MCP-6 and 56-47 DNAs but not in mouse genomic DNA at the stringency used for washing (O.lx SSC, 65°C). It was therefore concluded that theCOL11A2 gene must map to 6p21.1-3, the region shared by the hybrids (Figure 4.4.a).

Taken together, the results from in situ hybridisation (6p21.3-22, Figure 4.2.) and somatic cell hybrid mapping (6p21.1-3, Figure 4.3.) suggest that the most likely location of COL11A2 was in band 6p21.3, the map position of the MHC. It was therefore decided to attempt to link the COL11A2 locus with the MHC by PFGE mapping.

4.3. Mapping of COL11A2 by PFGE

W hen the COL11A2 gene probe was hybridised to PFGE blots it was found that it detected fragments similar in size to those detected by the DPB1 gene probe in all single and double digest combinations tested (Figure 4.4. and Table 4.1.).

92 a. b.

1 2 3 4 — 2 3 .1 6 p

22 MCP-6 = 21 .3 • — — 9 .4 MHC[ 21 .2 21 .1 — 6.6 56-47

— 4 .4

kb

6q — 2 .3

Figure 4.3. (a) Intact human chromosome 6 ( showing the position of the MHC), and regions of chromosome 6 contained in the human/mouse hybrids 56-47 and MCP-6. (b) Autoradiograph of Southern blot of EcoRI-digested DNA hybridised with the COL11A2 probe and washed to a final stringency of O.lxSSC, 65°C. Lane 1, total human DNA; lane 2, total mouse DNA; lane 3,

MCP-6; lane 4, 56-47. Z CO Z z CD z + + + + + + Z CD CD IE SE SE Z CD CO EIE

k b

— 3 7 0

— 2 5 0

— 1 3 0

DPB1 COL11A2

Figure 4.4. Autoradiographs of a PFGE filter hybridised sequentially with probes for the DPB1 and COL11A2 genes. The filter was stripped and checked for retention of signal by autoradiography between hybridisations. B, BssFIII;

M, Mlul; N, Notl. PROBE DIGEST DPB1 COL11A2

BssHII 250kb 250kb EagI 250kb 250kb M lul 130kb 130kb Notl 370kb 370kb N ru l >900kb >900kb BssHII+Eagl 250kb 250kb BssHII+MluI 130kb 130kb BssHII+Notl 250kb 250kb M lul+ N otl 130kb 130kb

Table 4.1. Sizes of fragments detected by hybridising probes for the COL11A2 and DPB1 genes to PFGE blots of DNA from the cell line PGF cut with the enzymes indicated. (Figure 4.4. and Hanson et al., 1989).

It was concluded that the COL11A2 and DPB1 genes were in close physical proximity at the centromeric end of the class II region. The upper limit on the maximum distance between the two probes is given by the length of the shortest common band detected, a 130kb Mlul fragment (Figure 4.4.). As shown in Chapter 3, the telomeric end of this 130kb Mlul fragment falls between the DPA1 and DPB1 genes, while the centromeric end occurs within cluster 1 (Figure 3.5.).

4.4. Cosmid walking between the DP subregion and COL11A2.

The genes of the DP subregion have previously been isolated on overlapping cosmid clones (MANN 2.3, MANN 3.6 and MANN 2.2, Figure 4.5.; Trowsdale et al., 1985). These extend approximately 70kb distal of the Mlul site between DPA1 and DPB1, The COL11A2 probe did not hybridise to blots of DNA from these cosmids, however (data not shown). The COL11A2 gene was therefore deduced to map centromeric of the previously cloned region but still within the same 130kb Mlul fragment, as shown in Figure 4.5.

95 CO CO TO 6 z *-H O z CD _ •S '-2 < 'r- ^ *ss e 0^ ft □ I co CM _ fl j! o r CM _ <-? 5 o > S! z o S -8 3 g Q_ ■ t f— _ M'S 5 ? Q — f— ICO ICO I Sh jb CO o o _ ^ 8 b * T“ a bOTO x CM CD o M i Q_ 05 — 1g j* S S Q I CO T - S 'S -S3 o & o ^ | CO — a>* 3 8 < < • s ° § ° cd CD co a> CL Q . o I I X — — — 6

= o _ CO a0 1 CO __ o o lO — o

o a> co -M- St o o I CO “ o 1 20

— o _ T— BU s | o — aJ * oj . g JO “I

O CO a § &&CJ 2 co o o LJ 5 z z CO COs £Q- LU LU O O £ o o O O 2

96 The DP-region cosmid walk was extended in the centromeric direction to confirm this and in so doing to link the COL11A2 locus to the DP genes.

A 4.5kb Clal/Sall fragment was subcloned into the Bluescript vector from the end of MANN 2.3, the most centromeric cosmid in the cloned region (Figure 4.5.). The Clal site was in the genomic insert of MANN 2.3 and the Sail site was at the cloning site of the cosmid vector, pTCF. The subcloned fragment, pCS2, was found to contain many repetitive sequences when used as a probe on Southern blots of human genomic DNA. pCS2 was therefore cleaved with Rsal to generate several smaller fragments, each of which was tested for the presence of repeats by hybridising back to genomic Southern blots. One 500bp Rsal fragment (WP in Figure 4.5.) was found to be single copy and was used to screen the cosmid library. Two new cosmids, HPB.ALL 1 and HPB.ALL 8, were isolated with this probe and were shown to span the gap between MANN 2.3 and cosHcol.ll, the cosmid clone carrying the COL11A2 gene (Figure 4.5.) Since the position and orientation of the COL11A2 gene within cosHcol.ll was already known (Dr. Kathy Cheah, personal communication), this result established that the COL11A2 gene was 45kb centromeric to DPB2 and oriented with the 3' end of the gene nearest the MHC.

4.5. Summary and discussion

Using mapping techniques of increasing resolution, the gene for the a2 chain of type XI collagen has been localised to the centromeric end of the class II region of the MHC. This is consistent with two other reports localising COL11A2 to the short arm of chromosome 6, although neither of these studies mapped the gene relative to the MHC (Kimura et al., 1989; Law et al., 1990). This assignment is in keeping with the general trend for diverse map locations of human fibrillar collagen genes. The COL3A1 and COL5A2 loci are both found within the chromosomal region 2q24.3-31, but COL1A1, COL1A2, COL2A1, COL11A1 and COL11A2 map to 17q21-22, 7q21.3-22.1, 12ql4.3, lp21 and 6p21.3 respectively (Vuorio and de Crombrugghe, 1990).

97 The finding that the COL11A2 locus maps only 45kb proximal to the DPB2 gene is intriguing, given the association of RA with the class II region, and the involvement of cartilage collagens described in section 4.1. However, the major association of RA with the class II region is seen with the DR subregion (Nepom, 1990) and in studies where a possible association of RA with the DP subregion has been investigated, no statistically significant difference in the frequency of DP alleles in patients and controls has been observed (Begovich et al., 1989; Stephens et al., 1989). Thus it is unlikely that the close linkage of the COL11A2 gene to the class II region is significant for the understanding of the association of classical RA with the class II region. However, another MHC-associated disease, pauciarticular juvenile rheumatoid arthritis (PJRA), does demonstrate an association with DP alleles; specifically, the incidence of DPw2 is increased in patients compared to controls (Odum et al., 1986; Begovich et al., 1989; Fugger et al., 1990). Anti­ collagen antibodies have been observed in PJRA patients (Lotz and Vaughan, 1988). It remains possible therefore that the association of PJRA with the DP subregion might be at least partly explained by linkage disequilibrium between alleles of the DP genes and alleles of the closely physically linked COL11A2 locus. Consequently it was decided to search for RFLPs at the COL11A2 locus which could be used to extend the disease association studies. This work is described in Chapter 8.

In the course of the work described in this chapter, overlapping cosmid clones were isolated which extended approximately 130kb proximal of the Mlul site between DPA1 and DPB1. The genomic insert of cosHcol.ll was mapped and shown not to contain an Mlul site. The proximal end of the 130kb Mlul fragment (Figure 4.5.) must therefore be located just proximal of cosHcol.ll. Since this Mlul site was known to map within the most centromeric of the clusters of rare-cutter sites (Cluster 1, Figure 3.5.), the cosmid walk was extended further to encompass this region, as described in Chapter 5.

98 5. Identification of novel genes associated with clusters of rare-cutter sites

This chapter describes the cloning of three of the clusters of rare-cutter sites identified by PFGE mapping of the class II region (Chapter 3), and the subsequent identification of five novel genes.

5.1. Introduction

The work described in Chapter 4, i.e. the positioning of the COL11A2 gene centromeric of the DP subregion, provided a useful step towards the cloning of the first cluster of rare-cutter sites identified by PFGE (Cluster 1, Figure 3.5.). As shown in Figure 4.5., the centromeric end of the 130kb Mlul fragment which carries theCOL11A2 and DPB1 genes must fall just beyond the proximal end of cosmid cosHcol.ll. Since this Mlul site occurs in cluster 1, the cosmid walk was extended in the centromeric direction in order to encompass the rare-cutter sites of cluster 1.

5.2. Cloning of cluster 1 by cosmid walking

A 7.5kb EcoRI fragment (E3, Figure 5.2.) was isolated from the centromeric end of cosHcol.ll and used as a probe to screen the cosmid library. Four new overlapping cosmids, HPB.ALL 25, HPB.ALL 31, HPB.ALL 33 and HPB.ALL 42, were isolated and mapped with restriction enzymes, including the rare-cutters. A complete map of the cloned region is presented in Figure 5.1. and a more detailed restriction map of the new cosmids is shown in Figure 5.2. As discussed above, the PFGE data indicated that there should be an Mlul site just centromeric of the cosmid cosHcol.ll. This was indeed found to be the case. In addition to the single Mlul site, this region was found to contain one BssHII site, five EagI sites and one Notl site (Figure 5.2.). To prove that

99 C0L11A2 DPB2 DPA2 DPB1 DPA1 It I t ■ I ■t Z- Z L C 03 m c - CL C 1- - C\J M" ^ °= 8

O O LO M O CM 1 w 0 o I I 00 CM CO c\i CM CO CO \ o CD c lu 8 o x o o CM o o CO o CO o o O CO o o o o o o co C'- CO 5 0 CM co o o o M" o r-. co co _ • 1/5 PH 01 bO £ V h • ^ Q P < V4-H ■J ■4—* H-< T3 o £ • H d c V bC < Tj £ 0 0/ £ C/5 bO cu 'H ’V ♦ H £ i O V (0 u O £ C/5 h h l h a—I 'a! T3 > — 4 M-l < CD d 6 £ i V o cd u CD d h h h h

T3 X X * »H h X T3

Restriction maps of new overlapping cosmid clones, showing the cluster of sites for the rare-cutter enzymes BssHII, EagI, Mlul and Notl. Brackets indicate a portion of cosmid HPB.ALL 33 that was found to be rearranged. Also shown are the positions of various restriction enzyme fragments which were isolated for use as probes. these were the cluster 1 sites as detected by PFGE, it was necessary to demonstrate that they were unmethylated in genomic DNA.

The methylation status of the BssHII, Mlul and Notl sites in genomic DNA were determined by hybridising PFGE blots with probes mapping distal ( COL11A2) and proximal (33X1, Figure 5.2.) of these sites as deduced from the cosmid map. The results are shown in Figure 5.3. and sum m arised in Table 5.1.

PROBE DIGEST 33X1 COL11A2

BssHQ lOOkb 200kb M lul 230kb 130kb N o tl 440kb 370kb BssHII+MluI lOOkb 130kb BssHII+Notl lOOkb 250kb Mlul+Notl 230kb 130kb

Table 5.1. Sizes of fragments detected by probes distal ( COL11A2) and proximal (33X1) of the BssHII, Mlul and Notl sites in the new cosmid clones.

The two probes detected differently sized fragments with BssHII, Mlul and Notl, showing that these sites were cut and therefore unmethylated in genomic DNA.

To determine the methylation status of the EagI sites, Southern blots were prepared from genomic (PGF) DNA and cosmid (HPB.ALL 42) DNA cut with EagI and resolved by conventional gel electrophoresis. These blots were then hybridised with fragments adjacent to or spanning the EagI sites as deduced from the cosmid map. One result is shown in Figure 5.4. In cosmid DNA, which is unmethylated and therefore cut at all EagI sites, the probe 33X3 detected two EagI fragments because it spans the central EagI site. In PGF DNA, however,

102 E E E N/E M B/E 10kb cen

Probes: czi ! I 33X1 C0L11A2

H o X x Z (/) H CO h- + C/) o CO o GO z co z + + — X + — X X —+ ------I— CO CO 3 3 )"“ (/)(/) 3 3 3 O CO CO O cn c/) Z CO CD z co m

4 4 0 — * kb 3 7 0— # kb

2 5 0 — 2 3 0— 1 30 — 1 00 — • %

33X1 COL11A2

Figure 5.3. Determination of the methylation status of cloned BssHII, Mlul and Notl

sites in genomic DNA. At the top is shown a map of the rare-cutter sites in the overlapping cosmid clones. The autoradiographs show a pulsed field blot hybridised sequentially with probes mapping proximal (33X1) and distal (COL11A2) of the BssHII, Mlul and Notl sites in the cosmid clones. Figure 5.4. Determination of the methylation status of cloned EagI sites in genomic DNA. At the top is shown a summary map of the rare-cutter sites in the overlapping cosmid clones. The autoradiographs show blots of cosmid DNA

(HPB.ALL 42) and genomic DNA (PGF) cut with EagI and hybridised with the genomic fragment 33X3. Bars represent lambda/Hindlll DNA size markers; from the top these are 23.1 kb, 9.4kb, 6.6kb, 4.4kb, 2.3kb and 2.0kb. the same probe detected a single fragment whose length was the sum of the two cosmid fragments. From these data it was deduced that the central EagI site was uncut and therefore methylated in genomic DNA while the two flanking sites were cut and therefore unmethylated. In the same way it was shown that all but one of the EagI sites were unmethylated in the genome.

The conclusion from these studies was that all but one of the rare- cutter sites cloned by extending the cosmid walk were unmethylated in genomic DNA and therefore that these must be the sites in cluster 1 as detected by PFGE mapping. The methylation data are summarised in Figure 5.13.a.

The fact that the unmethylated rare-cutter sites in cluster 1 were spread over 20kb, whereas a typical CpG island spans l-2kb, suggested that there may be more than one gene in this region. To test this hypothesis, cosmid fragments close to the unmethylated rare-cutter sites were isolated and used to probe northern blots, zoo blots and cDNA libraries to obtain evidence for transcribed sequences as described in section 5.5.

5.3. Cloning of cluster 2 by chromosome jumping

The cloning of the unmethylated rare-cutter sites in cluster 1 made feasible the use of rare-cutter jumping libraries to clone adjacent clusters (Poustka and Lehrach, 1988). In this technique, the two ends of large genomic DNA fragments generated by cleavage with a rare-cutter enzyme are co-doned, as shown in Figure 5.5. A probe adjacent to a site for that rare-cutter enzyme in the genome (i.e. at one end of a rare- cutter fragment) can be used to screen the library and obtain cloned DNA from the other end of the same fragment, which may be hundreds of kilobases away. It was dedded to use fragments adjacent to, and distal of, the BssHII and Notl sites of cluster 1 as probes to screen BssHII and Notl jumping libraries respectively. In theory, this approach should facilitate the cloning of the nearest distal unmethylated BssHII site (i.e. in cluster 2) and the nearest distal unmethylated Notl site (i.e. in d u ster 3).

105 Cleave genomic DNA with rare-cutter enzyme R I f I Circularise in presence W of selectable marker rnmmi F F

F F

Cleave with frequently cutting enzyme F I F F F F F F F F If f y u F F R R F F F F u

Ligate into phage arms and select for presence I of marker

R R

Figure 5.5. Construction of a rare-cutter jumping library (Poustka and Lehrach, 1988). Details of the enzymes and selection system used are given in the text (this chapter and Materials and Methods). The probe used to screen the Notl jumping library was 31KN2, a 1.9kb Kpnl/Notl fragment mapping adjacent to, and distal of, the Notl site in cosmid HPB.ALL 31 (Figure 5.2.). Unfortunately, no positives were obtained with this probe. One explanation for this negative result was that by chance the nearest site for the second enzyme used to construct the library was too far from the Notl site, thereby generating a fragment too large to clone into the phage vector.

The probe used to screen the BssHII jumping library was jBK, a 3.7kb BssHII/Kpnl fragment isolated from cosmid HPB.ALL 31. (Figure 5.2.). This probe detected five positives which were picked and the recombinant phage DNA purified. By restriction enzyme mapping it was shown that four of the positives were identical (clone X]2) while the fifth was different (clone A.j3).

The jumping clone Xj3 was shown by restriction enzyme mapping with Sail and BamHI to contain two genomic inserts, a 0.7kb Sall/BamHI fragment and a 2.0kb Sall/BamHI fragment. In each case the Sail site was derived from the polylinker of the marker plasmid pMLS-Mlu- Not while the BamHI site was presumably genomic. The inserts were subcloned into the Bluescript plasmid vector. The 2.0kb Sall/BamHI fragment was shown by hybridisation to derive from the starting probe jBK. The 0.7kb fragment, when hybridised to PFGE blots, detected similarly sized fragments as the COL11A2 and DPB1 genes; in particular it detected the 130kb Mlul band (data not shown). This result was unexpected because a successful 'jump' to the nearest unmethylated distal BssHII site (in cluster 2) should have resulted in the cloning of a fragment which hybridised to the 450kb Mlul band (Figure 3.5.). In order to work out where the 0.7kb insert was derived from, it was hybridised to a blot of DNA from the entire cosmid walk, spanning from DPA1 to cluster 1 (Figure 5.1.). The probe hybridised strongly to cosmid MANN 3.6 and more weakly to MANN 2.2 and MANN 2.3. This suggested that the 0.7kb insert could be derived from the DPB1 gene and was also cross-hybridising to the related DPB2 gene. To test this, a portion of the subcloned insert was sequenced using primers complementary to the Bluescript vector cloning site, and the partial nucleotide sequence obtained was found to be identical over 153 nucleotides (8195-8348 in Kelly and Trowsdale, 1985) to a region of the

107 second intron of the DPB1 gene, except for a single nucleotide mis­ match at position 8270. These data confirmed that the 0.7kb insert of X)3 was derived from the DPB1 gene. The sequenced region mapped near a BssHII site in the second exon of the DPB1 gene (position 7736 in Kelly and Trowsdale, 1985).

It was therefore concluded that X)3 contained the ends of an unusual jump from the BssHII site in cluster 1 to the BssHII site in the DPB1 gene (Figure 5.8.). The BssHII site in the DPB1 gene must have been unmethylated, and hence cleaved, in the cell line from which the library was constructed, although in PFGE analysis of this region in PGF DNA (and in other published maps from other cell lines) the cleavage of this BssHII site has never been observed.

The jumping clone Xj2 was shown by restriction mapping with Sail, BamHI and Hindlll to contain two genomic inserts, a l.lkb Sall/Hindm fragment and a 2.0kb Sall/BamHI fragment (Figure 5.6.a.). The Sail site in each case was derived from the polylinker of the marker plasmid, while the BamHI and Hindlll sites were presumably genomic. The 2.0kb Sall/BamHI insert was shown by hybridisation to be derived from the starting probe, jBK, and was probably identical to the 2.0kb BamHI/Sail insert identified in Xj3. The l.lkb Sail/Hindlll fragment, designated AJ2SH, was hybridised to PFGE blots and was found to detect a 250kb BssHII fragment, a 450kb Mlul fragment and a 370kb Notl fragment. This pattern of hybridisation would be expected for a probe mapping between DPA2 and cluster 2 (Figure 3.5.). A,j2SH was then used to screen the cosmid library and a positive clone, HPB.ALL 71, was isolated and characterised by restriction enzyme mapping (Figure 5.6.b.). Xj2SH hybridised just centromeric of the most centromeric BssHII site in HPB.ALL 71.

From the PFGE map, cluster 2 contained unmethylated sites for BssHII and EagI (Figure 3.5.). Cosmid HPB.ALL 71 was found to contain five BssHII site and two EagI sites spread over about 7kb (Figure 5.6.b.). To prove that the sites in HPB.ALL 71 mapped to cluster 2 (and not to cryptic rare-cutter sites, as was found for Xj3) it was necessary to demonstrate that at least one of the BssHII sites and one of the EagI sites in HPB.ALL 71 was unmethylated in genomic DNA.

108

h sae a apis o oh clones. both to applies bar scale The

o cen h cN lbay ae shown. are library, cDNA the screen to

al eoi isr fo A2 Te w BsI famns 7B3 n 7Bl wih ee usqety sd s probes as used subsequently were which 71Bsl, and 71Bs3 fragments, BssHII two The Aj2. from insert genomic Sail

b Rsrcin a o te omd P.L 7, hc ws sltd y cenn te irr wt te .k Hindlll/ l.lkb the with library the screening by isolated was which 71, HPB.ALL cosmid the of map Restriction (b)

*

X CD co a> >■4 Cl, CD 1-4 £ td o G G 6 td td 4 - > T3 £ > TJ x > -G T3 'O X £ hh E £ PQ co" CO CO T3 Q J h 6 U U 4= M) G cu td M 0) 0) -zj «» G3 CO B (3 h O u > C 0 cp X a> Q X cd cd jo a> td B a> a> a> co CO t-i 4 - * — 4 - * Q J PH . y . £ '^ 4 H CM .G A 1 1 § 5 3 & A n & at VO id ----- o uo CO o O CM CO IT) LO O CM LO — O — t/> CD CO co CO _td O a ? LU to to X CO cd CO a> c 0 1 < d o . x CD CO CO CO CO CO z (e) X U)

9.4

— 6.6

■4.4

3.9

2.3

■2.0

1.6

Figure 5.7. Determination of the methylation status of cloned BssHII and FagI sites in genomic DNA. The autoradiographs show blots of genomic DNA (PGF) cut with BssHII or FagI and hybridised with the 3.9kb

BssHII fragment 7 1 Bs 1 from cosmid HPB ATI. 71 (Figure 5.6 ). All si/es are in kilobases. 73 vq w 2 6 S a; 2 n CM p < ic 0 ^ a» © cc in m £ 73 UJ CL ^ *2 aJ h [Q « 'd CM « " O (O £ ° «* ^ d ■§■8 * S. is a . m — a> 'S/C ■£ .£ OS ’0 W> ’g > •£ ~ 2 ^ - 8«§ 8 2 .2. 'H 13 _ X V h O **-> <0 < £ ~ <*H CO CL ^ 7 3 O C /5 q. 73 o3 a> M h <— -a -* z C z o co^ o '55 < CO 6©£> ~ ^ / < □ f / < 2 <8 - □ CQ — □ p o . • Q a u s ^ *o t)Oa)c 60_J j CM ^ & i-| < \ / m Mg £ 2 \ i o. \ i □ o •S I q s ' i H£ o rC £ O) > ' i CM .2^73 *- c < ^«> §c s u ° 0 ^ft V h _ h g £ 73 □ o 1 ° > o O 2 -C '•0 £Q — b; r ■g ■o O>. OJx JU ^ cc 7JZ, **(D ° ^ 111 £12 •£"4_’ 60 C *-HI« h 2 e *S o CO CO O> 5 ur- <9 U73 O) O -I 3 £ 3 ^ O o £ £"73 < V h C/5 ^ (*J 0 J cd 2 oi - >.- 3 CL cu 73^ — £._ X H - # « J ^ ^ 't? i-H O '3 ° roq 1______I J L GO ^ i co nj CO in «J«N O 0) o CO Z LLI o UJ 2. . >< g & z r* S EP -3 tn E z 2 UJ tp > Oh 2 0 §3 ♦ CO o a ojc SB o This was done by hybridising cosmid fragment 71Bsl (3.9kb BssHII fragment, Figure 5.6.b.) to blots of genomic (PGF) DNA cut with BssHQ or EagI and resolved by conventional electrophoresis. Probe 71Bsl detected a 3.9kb BssHII fragment and a 1.6kb EagI fragment in genomic DNA (Figure 5.7.). Thus the two BssHII sites flanking 71Bsl and the two EagI sites within 71Bsl were shown to be unmethylated in genomic DNA (Figure 5.13.b.). This fact, coupled with the PFGE mapping data from Xj2SH, led to the conclusion that HPB.ALL 71 must m ap to cluster 2 (Figure 5.8.).

To test for the presence of genes at cluster 2, fragments mapping between the BssHII sites in HPB.ALL 71 were isolated and used as probes for zoo blots and cDNA libraries as described in section 5.5.

5.4. Identification of cluster 3 in previously isolated cosmid clones

Cluster 3 was cloned by taking advantage of a previously established cosmid walk around the DOB gene (Figure 5.9.; Blanck and Strominger, 1988). These cosmid clones extended 60kb centromeric of the DOB gene. From the PFGE map shown in Figure 3.5. it was apparent that cluster 3 mapped on the centromeric side of the DOB gene, although its precise position could not be determined. It was decided to test for the presence of rare-cutter sites in the cosmids mapping centromeric of the DOB gene.

Cosmid clones U10 and U15 (a gift from George Blanck) were digested with BssHII and Notl. It was found that cosmid U15, the most proximal clone in the walk of Blanck and Strominger (Figure 5.9.a.), contained two BssHII sites and two Notl sites as shown in Figure 5.9.b. The methylation status of these sites in genomic DNA was not determined at this stage. However, a cDNA clone subsequently isolated with a probe from U15 was hybridised to a PFGE filter of human DNA digested with BssHII and Notl in single and double digest combinations (Figure 5.10.). The cDNA probe detected a fragment of about lOkb in each track (the fragment of about 440kb in the Notl track of Figure 5.10. was a residual signal from a prior hybridisation).

112 0 5 0 100 150 2 0 0 2 5 0 (a) i r-1 i i i i j i 1 1 1 1—i— r r ■ | i i 1 I I 1 1 1—ii "| —t- t- | DO/3 DX/3 DX a DQ/3 DQa ■ I GENES ------■— ■ DV/3 II 1 .. Sal I I 1 I — U— Cla I i i ii i 11 I I ------1------Xho I ii I II = 11 1 II I I I 1 Itlll Asp 718 i l+l 1 1 I I 1 II III 111! Ill II 1 1+1— Bam HI 1 HI 1 1 i i i i 1 1 I J15 T 2 0 T 1 6 i i 1 1 i i COSMID U10 T 8 U 5 CLONES 1 1 i ii i i M 4 M 15A M 5 i i j 1 ------1 T 1 8 A U13 U 16 i i i i i VI2 7 8 U 51A use i i i i 1 J UI------118 1 IH L 9 A U 9B

(b) J U1 5

U10

■cen

Notl

BssHII

Sail

Kpnl

5kb U15KN

Figure 5.9. (a) Map of overlapping cosmid clones isolated by Blanck and

Strominger (1988). The DXp, DXa, DVP, DQp and DQa genes are now known as DQB2, DQA2, DQB3, DQB1 and DQA1 respectively. (b) Detailed restriction map of cosmid U15 and the overlapping portion of cosmid U10. The position of the probe U15KN is shown. Figure 5.10. Determination of the methylation status of cloned BssHII and Notl sites in genomic DNA. The autoradiograph shows the result of hybridising the RING4 cDNA probe 2.1 to a PFGE filter of human genomic DNA (PGF) digested with BssHII, BssHII+Notl and Notl. Size markers are concatemers of bacteriophage lambda DNA. This result indicated that both Notl sites and both BssHII sites in U15 are unmethylated in genomic DNA (Figure 5.13.C.).

5.5. Analysis of cloned regions for coding sequences

5.5.1. Zoo blots Transcribed DNA sequences are more highly conserved during evolution than non-coding regions, and genomic fragments which cross-hybridise with the genomic DNA of other species may indicate the presence of genes (Monaco et al., 1986). This principle was used to test for the presence of genes in clusters 1 and 2. Restriction fragments mapping close to the unmethylated rare-cutter sites were isolated from the cosmid clones and probed onto 'zoo-blots* of EcoRI-cut genomic DNA from different vertebrate species. A representative result is shown in Figure 5.11. A summary of the results obtained with all probes tested is given in Table 5.2.

PROBE SPECIES 33X1 31K1 jBK 71Bs3

Rhesus monkey + + + + Pig + + + +

Rat + - + - M ouse + + + + W h ale + + + +

Chicken — --

Table 5.2. Summary of results obtained by probing cosmid fragments from clusters 1 and 2 onto zoo blots of EcoRI-cut genomic DNA from a variety of vertebrate species. Positions of probes are shown in Figures 5.2. and 5.6. (+), cross hybridisation detected; (-), no cross hybridisation detected. Blots were washed to a final stringency of 2x SSC, 65°C. A representative result is shown in Figure 5.11.

115 > * JTv c o a) 0) S 52P ■* C w 3 -S ro o ° 3 o & a o ^ ! E I S Q. CE I ^ O k b

L - 23.1

- 9.4

— 6.6

- 4.4

» «

* ■ ' 2.3 * 2.0

— 0.6

PROBE: 33X1

Figure 5.11. Zoo blot analysis of genomic fragment 33X1. The autoradiograph shows the result of hybridising probe 33X1 to a Southern blot of

EcoRI-cut genomic DNA from different vertebrate species. The final w ashing stringency was 2x SSC at 65°C. The results obtained with the probes listed in Table 5.2. provide good evidence for the conservation of these sequences in the genomes of other organisms. These fragments may therefore contain exons of genes. To tested this hypothesis, the probes were subsequently hybrid­ ised to northern blots to obtain evidence for transcription.

5.5.2. RNA blots The conserved fragments identified through zoo blotting were hybridised to northern blots of poly(A)+ RNA from a variety of cell lines to test whether or not they detected transcribed sequences. A representative result with the genomic fragment 33X1 is shown in Figure 5.12. 33X1 detected an RNA species of about 1.6kb in RNA from the cell lines K562 (erythroleukaemia) and U937 (histiocytic leukaemia) but not HL60 (promyelocytic leukaemia) or WEHI (mouse promyelo- cytic leukaemia). The RNA species detected by 33X1 is probably a polyadenylated mRNA because it is increased in abundance in poly (A)+ RNA (Figure 5.12.). Similar results were obtained using the probe 31K1 from cluster 1 and Xj2SH from cluster 2 (data not shown). However, this was not conclusive evidence that these genomic fragments encoded the gene (or part of the gene) giving rise to the transcript; it remained possible that the transcribed locus was elsewhere and that the genomic fragments tested here contained pseudogenes or other sequences related to the true expressed locus which were cross- hybridising with the RNA species under the conditions used. More convincing evidence that these probes were really part of functional genes was obtained by the isolation of cDNA clones, as described below.

No hybridisation to northern blots was detected with the probe jBK. In practice, however, it was often difficult to obtain convincing results by hybridising northern blots with genomic fragments. Negative results were difficult to interpret because a given genomic fragment could contain a very small portion of an exon, which may not hybridise under the conditions used. Furthermore, a given gene may be not be expressed in the cell lines tested at a level detectable by the sensitivity of the northern blotting technique. A more satisfactory approach to determining whether a genomic fragment encoded transcribed sequences was to hybridise that fragment directly to cDNA libraries.

117 |—PolyA +-||— T otal - 1 CM O ^ " CM O ^ — (OlOCOllDlDCOl lO_JO>UJlO_IC)LLI

2 8 s

1 8 s

I

PROBE: 3 3 X1

Figure 5.12. Northern blot analysis of genomic fragment 33X1. The autoradiograph shows the result of hybridising probe 33X1 to a northern blot of total and poly(A)+ RNA from different cell lines as described in the text.

The final washing stringency was 2x SSPE at 65°C. 5.5.3. cDNA libraries cDNA dones were isolated by screening cDNA library filters with each of the probes 33X1,31K1,31KN2 (Cluster 1), 71Bs3, 71Bsl (Cluster 2) and U15KN (Cluster 3). All of these genomic fragments detected positive colonies. In each case, DNA from positive colonies was purified and the longest clone was generally selected for detailed characterisation (Table 5.3.). However, four cluster 2 cDNA clones were characterised because these were isolated using two different genomic fragments, 71Bs3 and 71Bsl, and it was not initially certain that only one gene was present, although this was quickly established by restriction enzyme mapping of the dones. The cDNA clone inserts were hybridised back to pulsed field blots to confirm that the only location in the human genome of sequences related to that cDNA clone was within the MHC. Each cDNA probe was found to give the hybridisation pattern expected from the known location of the corresponding genomic probe. This eliminated the possibility that the genomic fragments contained pseudogenes which were detecting cDNAs derived from transcripts from genes elsewhere.

The cDNA clones were hybridised back onto the relevant cosmid clones to define the approximate positions of the cognate genes (Figure 5.13.). These genes were designated RING (Really Interesting New G ene) 1-5. RING1, RING2 and RING5, in cluster 1, were respectively 95kb, 90kb and 85kb proximal to the DPB2 gene. RING3, in cluster 2, was llOkb distal of the DPA1 gene. RING4 , in cluster 3, was 25kb proximal of the DOB gene (Figure 5.16.)

119 (a) CLUSTER 1

o o O o o E N/E M E B I____I Probe Probe Probe 33X1 31K1 31KN2 2kb I RING1 | | R1NG2 | RING5 | cDNA CEM15 cDNA CEM21 cDNA JJU5

(b) CLUSTER 2 o o E B

Probe 71BS3 1 kb U Probe 71Bs1 RING3 cDNA CEM32 cONA CEM35 cDNA CEM41 cDNA CEM44

(C ) CLUSTER 3 O N B

Probe U15KN

1 kb l RING4

cDNA 2.1

Figure 5.13. New genes in clusters 1, 2 and 3. The positions of the rare-cutter sites within each cluster, together with the methylation status of each site in genomic DNA, are shown (B, BssHII; E, EagI; M, Mlul; N, Notl; o, site unmethylated; x, methylated; -, not tested). The positions of the cosmid fragments from each cluster used to probe the cDNA library are indicated. Boxed regions show the positions of the five novel genes, RING1-5, as defined by mapping the cDNA clones back onto the cosmids. The maps are oriented with the centromere at the left. GENOMIC cDNA cDNA INSERT PROBE LIBRARY CLONE SIZE GENE

33X1 CEM (T-LCL) CEM15 1.5kb RING1 31K1 CEM (T-LCL) CEM21 1.5kb RING2 31KN2 U937+IFNy yU5 2.3kb RING5 71Bsl CEM (T-LCL) CEM32 4.0kb RING3 CEM (T-LCL) CEM35 2.3kb i t 71Bs3 CEM (T-LCL) CEM41 3.0kb i t CEM (T-LCL) CEM44 1.6kb i t U15KN U937 2.1 2.6kb RING4

Table 53. cDNA clones isolated by screening cDNA libraries with genomic fragments from clusters 1,2 and 3. The positions of the probes and the cognate cDNAs within each cluster are summarised in Figure 5.13.

5.6. Expression patterns of RING 1-5

Northern blots containing RNA from a range of cell lines were hybridised with the cDNA clones for the novel genes RING1-5 and washed at high stringency (O.lx SSPE, 65°C). Representative results are show n in Figure 5.14. and sum m arised in Table 5.4.

The RING1 cDNA probe CEM15 detected a transcript of about 1.6kb in poly(A)+ RNA samples from all cell lines tested. The transcript was expressed at particularly high levels in Molt4, a T-lymphoblastoid cell line (LCL), although expression was also significantly higher in Mann (B-LCL) than in K562 (erythroleukaemia) or U937 (macrophage).

The RING2 cDNA probe CEM21 detected a transcript of about l.lkb in poly(A)+ RNA from T- and B-LCLs. No expression was detected in the macrophage cell line U937 or the erythroleukaemia line K562, even after prolonged autoradiographic exposure.

121 CO

CM O + d . i

H X SC c CQ '7, I !§ t i CO x CM >> mm 3 in to LU O n> n CD x

w m xto I I Xto co o Cu >_fO CM CM SO .2 •5rc

LU co O o> X CD H LfS I H 5 I I z LO LO cx to ^ CO s c 73 > LU CM 0 o> X CD w, o to '7 rs C a

x c QJu X LU O Z Z CD DC

in u.a> P s c respectively respectively in the absence (-) and presence (+) of gamma interferon. Approximate transcript sizes are shown in kilobases. macrophage (U937). In the right hand panel B is B-LCL (Raji), T is T-LCL (Molt4); 2 1, and 3 are colon carcinoma lines SW1222, 5W620 and CC20 panels) or total (right hand panel) RNA from different cell lines with cDNA probes from the genes RINC.1-5. Probes are: RINC1, CEM15; RING2, CEM21; CEM21; RING3, CEM41; RlNC5,flU5; RING4, 2.1. In the first four panels, T is T-LCL (Molt4), B is B-LCL (Mann), E is erythroleukaemia (K562), M The RING3 cDNA probe CEM41 detected two large transcripts, a major species of about 3.5kb and a less abundant species of about 4.5kb, in all cell lines tested. The same transcripts were detected in y-interferon induced and uninduced colon carcinoma cell lines SW1222, SW620 and CC20, indicating that RING3 expression is not y-interferon inducible in these cell lines.

The RING4 cDNA probe 2.1 detected a transcript of about 2.8kb which was expressed at high levels in the B-LCL tested (Raji), and at lower levels in the T-LCL (Molt4). No RING4 expression was detected in resting colon carcinoma cell lines SW1222, SW620 and CC20, but the transcript was strongly induced in y-interferon-treated cells. In addition, RING4 expression was found to be both a and y-interferon inducible in cells of the fibrosarcoma line HT1080 (J. John and G. Stark, personal communication).

The RING5 cDNA probe yU5 detected two equally abundant RNA species of about 2.0 and 2.3kb in all poly(A)+ samples tested.

cDNA RNA SIZE GENE PROBE (APPROX.) T BME H IFN

RING1 CEM15 1.6 kb + + + + + NT

RING2 CEM21 l.lk b + + -- NTNT

RING3 CEM41 3.5+4.5 kb + + + + + - RING4 2.1 2.8 kb + + NTNTNT + RING5 yU5 2.0+2.3 kb + + + + + NT

Table 5.4. Summary of expression patterns of the genes RING1-5. +, expression detected; -, no expression detected. T, T-LCL; B, B-LCL; M, macrophage U937; E, erythro­ leukaemia K562; H, HeLa; IFN, a- and y-interferon inducible (Figure 5.14.).

123 5.7. Refinement of the physical map of the class II region

Blanck and Strominger (1990) have recently published a cosmid walk which extends 120kb in the vicinity of the DNA gene (Figure 5.15.a). These cosmid clones were not linked to any other class II genes, and the precise position and orientation of the DNA gene therefore remained ambiguous. From the PFGE mapping data summarised in Figure 3.5. it was hypothesised that cluster 2 might fall within this cosmid contig. The probe 71Bsl (Figure 5.6.b.) from the cluster 2 cosmid HPB.ALL 71 was hybridised to DNA from the published cosmids (a gift from George Blanck). This probe cross-hybridised to the overlapping cosmid clones U22, 027 and HA14 (Figure 5.15.b). From the map shown in Figure 5.15.a., the cross-hybridising region was about 35kb from the DNA gene. When these cosmids were mapped with BssHII, it was found that they contained the same pattern of sites as HPB.ALL 71 (Figure 5.15.b.), providing additional evidence that cluster 2 is indeed contained within the cosmid walk. Since PFGE data show that the DNA gene is on the centromeric side of cluster 2 (Figure 3.5.) it follows that the DNA gene is about 35kb centromeric of RING3. Thus, the cosmid walk of Blanck and Strominger is oriented within the class II region such that the DNA gene has the same transcriptional orientation as DPA1 (i.e. the map shown in Figure 5.15.a. is oriented with the centromere towards the right). From the PFGE map it is estimated that theDNA gene is 75kb from the DPA1 gene.

The mapping of cluster 3 rare-cutter sites in the cosmid U15, which maps just centromeric to the DOB gene, anchors the cosmid walk of Blanck and Strominger (1988; Figure 5.9.a.) within the class II region relative to the PFGE map. It is estimated that the DOB gene is 160kb telomeric of the DNA gene.

A physical map of the class II region showing the positions of RING1-5 and the accurately determined positions of the class II genes relative to the clusters of rare-cutter sites is shown in Figure 5.16.

124 (a)

0 O ?0 50 «0 5 0 6 0 TO SO 90 100 (10 1?0 ------1 1------I------T------T “ — 1------' 1------T ------T------1------1------1------1 OZ* OfNfS------■ ------

COSMIO ClONfs

CM h- N O) CM < CM CM T™ T— T" CMCMy— t— X 3 o o o o => o O r-

(i) (ii)

Figure 5.15. (a) Map of cosmid clones in the region of the DNA gene (formerly DZA)

from Blanck and Strominger (1990). (td Autoradiographs of cosmid DNA digested with Kpnl (i) or BssHII (ii)

and hybridised with the 3.9kb BssHII fragment from the RING3 locus in cosmid HPB.ALL 71. ’71' is cosmid HPB.ALL 71; other cosmids are shown in (a). Bars represent lam bda/1 lindlll DNA size markers; from the top these are 23.1 kb, 9.4kb, 6.6kb, 4 4kb, 2.3kb, 2.0kb and 0.56kb. CLUSTER 1 CLUSTER 2 CLUSTER 3 O < CM ▼ I DC S CO o o .* n □ □ □ □ □ □ □ □ □ 126 OQ o O o CO Q cc < Q > OQ O o OC s a OC CM Q OC OQ Q OQ CM a < CM Q OQ Q < o a Q. OQ CM Q < CL CM Q OL OQ Q CL < Q z o CQ T" < i G) o E 0) CL o O) G Vh 60 to o O) O G O) 3 * 60 O) 8 2 O G V (0 G .

the novel genes RING1-5 with respect to the previously known class II genes and the clusters of rare- cutter sites. Gene orientations are shown by arrows where known. The orientations of RING1, RING3 and RING4 were determined during nucleotide sequencing (Chapter 6). 5.8. Summary and discussion

The data presented here provide evidence for five new genes in the class II region of the human MHC. RING1, RING2 and RING5 m ap centromeric of the DP subregion, RING3 maps 35kb telomeric of DNA and RING4 maps 25kb centromeric of DOB (Figure 5.16.). The express­ ion patterns and transcript sizes of these genes are dissimilar to those characterised for the classical class II genes. In addition, sequence data from each cDNA clone (see Chapter 6) indicate that none is related to the class II genes. Thus the discovery of RING3 and RING4 in the middle of the class II region demonstrates for the first time that the classical class II genes are interspersed with non-dass II sequences. This finding has implications for the understanding of phenotypes associated with the class II region, since genes dosely physically linked to the known class II genes are candidates for involvement in class II- associated diseases (section 1.5.). Furtherm ore, both RING3 and RING4 map in the interval believed to contain the gene responsible for a defect in antigen presentation by class I molecules (Cerundolo et al., 1990; section 1.6.2.). Elucidation of the function of these novel genes is clearly of importance. As a first step towards this goal, nucleotide sequencing of RING1-5 cDNA dones was undertaken (Chapter 6).

The finding of non-class II genes in the class II region also has implications for the understanding of the evolution of the region. The class II region of the MHC is thought to have evolved from a single primordial a/p gene pair through a series of duplication and divergence events (Klein, 1986; Bodmer et al., 1986). The similarity of the overall organisation of the classical class II genes of mouse and man have led to the hypothesis that these duplication and divergence events occurred before the radiation of rodents and primates (section 1.4.). To test whether theRING3 and RING4 genes have become inserted into the middle of the human class II region after the divergence of rodents and primates, or whether their position was established beforehand, the presence of the novel genes in the mouse class II region was tested. This work is described in Chapter 7.

127 The strategy used to detect novel genes, i.e. the identification of CpG islands, is likely to influence the type of gene discovered. In a search of the nucleotide sequence databases, CpG-rich regions were found to be associated with all ubiquitously expressed 'housekeeping' genes, but only some tissue-specific genes (Bird, 1986; Gardiner-Garden and Frommer, 1986). Consistent with this, expression of RING1-5 was detected in most cell lines tested (Figure 5.14. and Table 5.4.).

It is likely that there are further genes associated with the clusters of rare-cutter sites. Cosmid U15, for example, contains two Notl sites, both of which are unmethylated in genomic DNA. Given that most Notl sites are found in CpG islands (Table 3.1.), it would be surprising if there were no further genes in this region. Experiments to test this are currently underway in the laboratory.

In the present study, the methods used would specifically detect those genes associated with unmethylated CpG islands. CpG islands generally remain stably unmethylated in all tissues of the body, regardless of the site of expression of the associated gene (Bird, 1987). However, it has recently been reported that in some cell lines a proportion of CpG islands become irreversibly methylated and would therefore not be detectable in genomic DNA (Antequera et al., 1990). Such islands would, however, be detectable by clustering of rare-cutter sites in cloned DNA and could be tested for in the recently described yeast artificial chromosome clones which span the entire class II region (Ragoussis et al., 1991b). Such clones could also be used to test for the presence of non-island-assodated genes. The recently developed 'exon trapping’ technique, to test for the presence of splice junctions in a cloned region, is a potentially powerful method to identify genes regardless of whether or not they are associated with CpG islands (Duyk et al., 1990).

128 6. Characterisation of novel genes by nucleotide sequencing

6.1. Introduction

One method of obtaining valuable preliminary information about the possible functions of a gene is to determine the nucleotide sequence of that gene. A gene encoding a protein product should contain an open reading frame (ORF), the predicted amino acid sequence of which can be used to search protein sequence databases. Important clues as to the function of the protein may be obtained from significant matches (Doolittle, 1986). Since the RNA species detected by the the probes for the novel genes RING1-5 were all found at increased frequency in poly(A)+ selected RNA samples compared to total RNA (Figure 5.14. and data not shown), it was likely that the transcripts of these genes were polyadenylated, and therefore encoded proteins. This chapter describes the complete nucleotide sequencing of cDNA clones for the RING1, RING3 and RING4 genes and the partial nucleotide sequencing of the RING2 and RING5 genes.

6.2. Nucleotide sequence of RING1

The nucleotide sequence of RING1 was obtained from the cDNA clone CEM15. The insert was estimated to be about 1.5kb long, and contained an internal Xhol site. The CDM8 vector contained Xhol sites either side of the cloning site so that two insert fragments were liberated when CEM15 was digested with Xhol. These were subcloned into the Bluescript vector to give two constructs: pXB, containing the ~0.65kb Xhol insert fragment, and pXT, containing the ~0.85kb Xhol insert fragment. The two ends of each subclone were sequenced using oligonucleotide primers complementary to the Bluescript cloning site. Thereafter, the novel sequence data obtained were used to design new complementary oligonucleotides which were used as primers to extend

129 the sequence further. Finally, primers were designed to sequence across the internal Xhol site of the CEM15 insert.

The CEM15 insert was 1439 nucleotides in length. The first 1116 of these encoded a single uninterrupted ORF of 372 amino acids, ending with a stop codon. The ORF was followed by a 3f untranslated region of 323 nucleotides, terminating in 6 A residues which were potentially the start of a poly(A) tail. The sequence AACAAA, which resembles the consensus poly(A) addition signal AATAAA, started 25 nucleotides upstream of the first of the 6 A residues. The AACAAA variant seems to be relatively inefficient in initiating 3' processing of RNAs, since in an in vitro 3' processing assay it was found to direct message cleavage and polyadenylation with only 4% of the efficiency of the consensus AATAAA (Sheets et al., 1990). Nevertheless, in a search of the nucleotide sequence databases, the AACAAA variant was detected in 0.8% of vertebrate cDNAs, and does therefore seem to be a naturally occurring poly(A)-addition signal, albeit rare (Sheets et al., 1990). It may be that additional nucleotides in the vicinity of the AACAAA sequence increase the efficiency of 3’ processing of RING1 mRNAs in vivo. It is also possible that the 6 A residues are not part of the poly(A) tail, in which case the cDNA clone is truncated at the 3' end before the true tail.

The ORF began at the first codon of the CEM15 insert and there was no upstream in-frame stop codon. The CEM15 cDNA clone was therefore probably truncated at the 5' end. Since the length of the cDNA insert (1.44kb without a poly (A) tail) corresponded well to the transcript size detected in poly(A)+ RNA (1.6kb with a poly(A) tail), it was concluded that the CEM15 cDNA insert was probably missing only a few nucleo­ tides at the 5' end. It was therefore decided to obtain the sequence of the corresponding genomic region to see whether a starting methionine residue and an in-frame upstream stop codon were present.

The two subclones of CEM15, pXT and pXB, were hybridised back to cosmid DNA to determine the orientation of the RING1 gene. pXT hybridised to the 4.1kb Xhol fragment (33X3) of cosmid HPB.ALL 33 (Figure 5.2.) while pXB hybridised to the adjacent 1.7kb Xhol fragment, 33X1 (data not shown). From the sequence data it was known that pXT

130 encoded the 5' end of RING1, while pXB encoded the 3' end. Thus, the RING1 gene must be oriented with the 3’ end towards the centromere. The 4.1kb genomic Xhol fragment, 33X3, containing the 5' portion of the RING1 gene, was subcloned from the cosmid HPB.ALL 33 into the Bluescript vector. A part of the resulting construct, p33X3, was sequenced using a primer complementary to the 5' end of the CEM15 insert and extending further in the 5’ direction. The sequence so obtained revealed the presence of a methionine codon just 5 codons upstream of the first codon in the CEM15 insert (Figure 6.1.). The methionine codon was in a sequence context of CCATAATGG, corresponding well to the general initiation consensus of CCACCATGG (Kozak, 1986). An in-frame termination codon was found a further 21 nucleotides upstream. This methionine was judged likely to be the initiating codon of the RING1 gene product, and for the purpose of the following discussion it will be assumed that the sequence shown in Figure 6.1., with an ORF of 377 amino acids, is correct. However, because this sequence was obtained from a genomic clone rather than a cDNA clone, there is no evidence that it is present in a mature RING1 transcript. This region could be part of an intron in the RING1 gene, and subsequently spliced out during transcript processing.

The predicted protein product of the RING1 gene was rich in glycine in the C-terminal two-thirds (27%, compared to a mean of 7% in human proteins; Doolittle, 1986). The N-terminal third, in comparison, was relatively depleted of glycine. The protein contained 10 cysteine residues, 7 of which were clustered together in the N-terminal region. Another feature of interest was the motif KRPR, starting at residue 172. The same sequence is found in the nuclear localisation signal of polyoma virus large-T antigen (Richardson et al., 1986).

The amino add sequence of the predicted protein product of the RING1 gene was used to search the OWL(9.1) protein sequence database (collaboration with Dr. Paul Freemont, Protein Structure Laboratory, ICRF). When the entire RING1 sequence was used as the query sequence, weak identity was detected with a number of database sequ­ ences, all of which were related to RING1 solely on the basis of their high glycine content. However, when the glycine-rich C-terminal two-

131 GGC TGC TGT TTC TAA AAC CCC TTT CCC TCT AAC CCA CAC CAC CTT TCT ACT CAC 54

TGA TGC CTT CAG GAA GCC ATA ATG GAT GGC ACA GAG ATT GCT GTT TCC CCT CGG 108 • MD G T E 1 I AVSPR 11

TCA CTG CAT TCA GAA CTC ATG TGC CCT ATC TGC CTG GAC ATG CTG AAG AAT ACG 162 SL H SE LM PI LDML K N T 29 0 ©

ATG ACC ACC AAG GAG TGC CTC CAC AGA TTC TGC TCT GAC TGC ATT GTC ACA GCC 216 MTTKE 0 L 0 RF 0 SD 0 I V T A 47

CTA CGG AGC GGG AAC AAG GAG TGT CCT ACC TGC CGA AAG AAG CTG GTG TCC AAG 270 LRSGNK E 0 PT 0 RKKL V SK 65

CGA TCC CTA CGG CCA GAC CCC AAC TTT GAT GCC CTG ATC TCT AAG ATC TAT CCT 324 RS L R PDP N F DALISKIYP 83

AGC CGG GAG GAA TAC GAG GCC CAT CAA GAC CGA GTG CTT ATC CGC CTG AGC CGC 378 SREEYEAH Q DR V LI RS R 101

CTG CAC AAC CAG CAG GCA TTG AGC TCC AGC ATT GAG GAG GGG CTA CGC ATG CAG 432 LH N QQ AL SSS IEEGL RM Q 119

GCC ATG CAC AGG GCC CAG CGT GTG AGG CGG CCG ATA CCA GGG TCA GAT CAG ACC 4 86 AMH RA Q R VRRP £ PGSD Q T 137

ACA ACG ATG AGT GGG GGG GAA GGA GAG CCC GGG GAG GGA GAA GGG GAT GGA GAA 540 TTM S GG E Q E p GG E G D G E 155

GAT GTG AGC TCA GAC TCC GCC CCT GAC TCT GCC CCA GGC CCT GCT CCC AAG CGA 594 DVSSDSAP D SAP G PA p KR 173

CCC CGT GGA GGG GGC GCA GGGGGG AGC AGT GTA GGG ACG GGG GGA GGC GGC ACT 648 P R GGG A GG S SV G T GG G G T 191

GGT GGG GTG GGT GGG GGT GCC GGT TCG GAA GAC TCT GGT GAC CGG GGA GGG ACT 702 GG V G G G AG SEDSG D R G G T 2 09

CTG GGA GGG GGA ACG CTG GGC CCC CCA AGC CCT CCT GGG GCC CCC AGC CCC CCA 756 L GGGT L G P P SP p GAP S p P 227

GAG CCA GGT GGA GAA ATT GAG CTC GTG TTC CGG CCC CAC CCC CTG CTC GTG GAG 8 10 EP GGE I EL V F R p H P L L V E 245

AAG GGA GAA TAC TGC CAG ACG AGG TAT GTG AAG ACA ACT GGG AAT GCC ACA GTG 864 K G E Y C Q TR Y V KTTG NA T V 263

GAC CAC CTC TCC AAG TAC TTG GCC CTG CGC ATT GCC CTC GAG CGG AGG CAA CAG 918 DH L S K Y L A LR I ALERR Q Q 281

CAG GAA GCA GGG GAG CCA GGA GGG CCT GGA GGG GGC GCC TCT GAC ACC GGA GGA 972 Q EA G EPGG P G G G ASD T G G 2 99

CCT GAT GGG TGT GGC GGG GAG GGT GGG GGT GCC GGA GGA GGT GAC GGT CCT GAG 1 026 PDGC G G EG GG A G G G D G PE 317

GAG CCT GCT TTG CCC AGC CTG GAG GGC GTC AGT GAA AAG CAG TAC ACC ATC TAC 1080 E PALPS LE G V S £ K Q Y T I Y 335

ATC GCA CCT GGA GGC GGG GCG TTC ACG ACG TTG AAT GGC TCG CTG ACC CTG GAG 1134 I A PGGGA F T TL N G S L T L E 353

CTG GTG AAT GAG AAA TTC TGG AAG GTG TCC CGG CCA CTG GAG CTG TGC TAT GCT 1188 L V NEKF W K V S R ? L EL C Y A 371

CCC ACC AAG GAT CCA AAG TGA CCC CAC CAG GGG ACA GCC AGA GGA AGG GGA CCA 1242 PTKDPK 377

TGG GGT ATC CCT GTG TCC TGG TCT ATC ACC CCA GCT TCT TTG TCC CCC AGT ACC 1 296

CCC AGC CCA GCC AGC CAA TAA GAG GAC ACA AAT GAG GAC ACG TGG CTT TT A TAC 1 350

AAA GTA TCT ATA TGA GAT TCT TCT ATA TTG TAC AGA GTG GGG CAA .AAC ACG CCC 1404

CCA TCT GCT GCC TTT TCC ATT GCC CTG CAA CGT CCC ATC TAT ACG AGG TGT TGG 1458

AGA AGG TGA AGA ACC CTC CCA TTC ACG CCC GCC TAC CAA CAA CAA ACG TGC TTT 1512

TTT CCT CTT TGA AAAAA 1 529

Figure 6.1. Nucleotide sequence of the RING1 gene. Nucleotides after the vertical line were obtained from the cDNA clone CEM15. Nucleotides before the line were obtained from a genomic subclone as described in the text. The clustered cysteine and histidine residues are circled, and the potential nuclear localisation signal is underlined. 0 i CJ X CO z a Cm CO Cm CL > a ►J 0 4 O) > cj Cm > CO Cm X CL CJ 04 cj Eh Oi X Cm < cj > CJ > £5 ►J 2 CL oi ol Ql Ql u CJ CJ CJ u CJ CJ CJ 1 CO ■ -~3 ►J _3 m > Ph , 04 04 04 04 04 04 04 u cj CJ U CJ n C) ?! 1 X Eh a 0 4 CO z CJ

> cj I motit

z SJ j cj MC

Q u Eh CO z z > > Eh z Z K CJ 04 a: CO iL CJ CL Z 2 a hj Eh « < Z CJ z a CO CL Eh CO CO « z ■ 2 Eh Z CJ CJ oi iJ a 2 S CL s X CJ 35 Eh < M cj z < os Eh Z Ql »-q H al... CL.. 01 . . H.... > H H 2 H H ►5 H H iu u 0 CJ c j CJ CJ p > .Jl 04 > 2 < Q CL CO M Cm CO ( ? CL CO U u CJ cj cj u U CJ Cm »-q H Ph Cm fH 2 Q n Z CO CL 35 35 . 35 35 35 35 rd 1 2 CO S n 2 y P CJ a CJ U CJ o 1 z 04 04 04 Eh QQ CJ Eh Cm 2 Q iL CJ Eh Eh Eh hJ .J < Eh ►J Q SJ CJ 2 CO Eh 03 > cu M EH l-q o 01 01 Q > > CJ M 2 > 2 04 35 h3 04 04 Oi Eh C/3 a > Q a CJ CJ Z 0) < z <2 CO < Z u ►j hJ M > Cm Cm 03 ►J C5 M Cm CJ Eh X X 03 2 o 35 Q Q CO Z o» CJ Q U CO <5 U a 0 cj CJ CJ O CJ I O H £1- > M M > H M 1 C5 rn y ffn Z Hi O i . U u p u CJ _ CJ Cl u J al CO 01 t> Eh a H Eh 2 M Q Z .-q Eh > 03 0) CO hJ CJ Q CJ CJ CJ -5 « Eh CJ CO > O CJ CO

00 O r H r H r H iH r H O 1 CJ CJ Q r H CO r o Eh Eh z < 2 CJ CO CJ CJ 04 M 2 Cm M > CJ a! 0 1 0 1 are are conserved, or conservatively substituted, between proteins and constitute the described described in the text. The identities the of text. the homologous genes are shown in Table 6.1. and described in thirds of the RING1 amino acid sequence were eliminated and the 139 N-terminal amino acids were used as the query sequence, a much more significant match was detected with seven other proteins (Table 6.1.).

TOTAL GENE NAME ORGANISM AMINO ACIDS MOTIF START

RING1 H u m an 377 19 RAG-1 H u m an 1043 290 rfp (ret) H u m an 513 16 rpt-1 M ouse 353 15 RAD18 S. cerevisiae 487 29 IE110 Herpes simplex 775 116 v iru s VZ61 Varicella-zoster 467 19 v iru s CG30 Baculovirus 264 8

Table 6.1. Homologues of RING1 detected by searching the protein databases with the N-terminal 139 amino acids of the putative RING1 product. The total number of amino acids in the protein product of each homologue, and the position of the first cysteine residue of the conserved motif, are shown.

When these proteins were aligned to maximise identity between them, it was apparent that they all shared a previously undescribed cysteine- histidine motif, with a consensus sequence of:

-C-X-(I,V)-C-X ii.3o -C-X-H-X-(F,I,L)-C-(I,L,M)-X io -i8-C-P-X-C-

(Figure 6.2.). In each protein, the motif was always contained within the N-terminal third, and frequently started very close to the N- term inus (Table 6.1.).

The motif is reminiscent of those found in zinc-binding transcription factors, in which cysteine residues, or a combination of cysteine and

134 histidine residues, co-ordinate zinc ions, thereby stabilising the structure of a sequence specific nucleic-acid binding domain (Berg, 1990). One class of zinc-binding transcription factors includes the well- studied archetypal ’zinc-finger' protein TFIIIA from Xenopus laevis, which has zinc-dependent sequence-specific DNA and RNA binding activity. It is now known that there are large families of TFIHA-related genes in the genomes of both vertebrates and invertebrates which include developmentally important loci such as Kruppel in Drosophila melanogaster (Dressier and Gruss, 1988). The consensus amino acid sequence of the zinc-binding domain of these proteins is:

-(Tyr,F)-X-C-X2 /4 -C-X3 -F-X5 -L-X2 -H-X3/4-H-X-

A single protein may contain several of these domains. The pair of cysteine residues and the pair of histidine residues within each domain co-ordinate a single zinc ion, which is thought to induce folding of the domain to form an a-helix which interacts with DNA (Lee et al., 1989b; Berg, 1990).

A second well-characterised class of transcription factors, the steroid- hormone receptors, also have zinc-dependent sequence-specific DNA binding domains. Each receptor molecule contains one zinc-binding domain with a consensus sequence of:

-C-X2 -C-X13 -C-X2 -C-X15.17 -C-X5 -C-X9 -C-X2 -C-X4 -C-

The first two pairs of cysteine residues co-ordinate one zinc ion while the next two pairs co-ordinate a second ion. The structure thus created is believed to contain a short a-helical region around the fourth cysteine residue which makes sequence-specific contacts with the DNA double helix (Hard et al., 1990).

A number of other distinct cysteine-histidine motifs have now been recognised in other proteins and although these are generally less well characterised some of them are known to be responsible for a zinc- dependent nucleic acid-binding function (Berg, 1990). It is therefore very tempting to speculate that the RING1 motif may also be involved in metal-dependent nucleic acid binding.

135 The known functions of some of the RING1 homologues detected in the databases are highly suggestive of a role for the motif in DNA binding and, in some cases, transcriptional activation. The IE110 gene of herpes simplex virus type 1 encodes a nuclear phosphoprotein which is required for transcriptional activation of later virus genes (Perry et al., 1986). The m ouse rpt-1 gene is expressed in resting CD4+ T- lymphocytes and encodes a nuclear protein which down-regulates expression of the interleukin-2 receptor a chain gene (Patarca et al., 1988). Normal expression of the human rfp gene is up-regulated during spermatogenesis, but the 5’ end of this gene, including the region encoding the cysteine-histidine motif, has also been identified in a fusion gene, ret, which can transform NIH 3T3 cells (Takahashi and Cooper, 1987; Takahashi et al., 1988). The yeast RAD18 gene is required for postreplication repair of DNA damaged by UV light or other agents. The hum an RAG-1 gene activates DNA recombination at V(D)J sequences found at the T-cell receptor and immunoglobulin gene loci, a process which leads to the generation of diversity in the protein products of these genes (Schatz et al., 1989). The functions of products of the varicella-zoster virus gene VZ61 and the baculovirus gene CG30 are unknown (Davison and Scott, 1986; Thiem and Miller, 1989). Despite the tantalising suggestion of a DNA binding function for these gene products, there is as yet no evidence that any of the RING1 homologues bind DNA or zinc, or even that the cysteine-histidine motif is necessary for the function of these proteins. However, based on the speculation that the phylogenetically conserved cysteine and histidine residues of the RING1 motif are involved in co-ordinating divalent metal ions and creating a DNA-binding domain, a hypothetical structure for such a complex has been proposed and is presented in Figure 6.3. (Dr. Paul Freemont, personal communication).

One approach to establishing the biological function of the RING1 motif involves the PCR-based cloning of human genomic sequences which bind to the cysteine-histidine-rich region (Kinzler and Vogelstein, 1989). This method is currently being used (collaboration with Dr. Ruth Lovering) to determine whether the motif has a sequence specific DNA binding activity. The possible role of the RING1 gene product in MHC-associated phenotypes is discussed in Chapter 9.

136 Figure 6.3. (a) Schematic illustration of the co-ordination of zinc ions by two TFIIIA-type zinc fingers. (b) Schematic illustration of the hypothetical co-ordination of metal ions by the RINGl-related motif. RING2 CEM21 5' end

GCC CGG TCC GGC GTG TTC TGT CCT ACC TCA GGA GCG GGG AGC GGC ATC GGC CGA 54 ARSGVFCPTSGAGSGIGR 18

GCG GTC AGT GTA CGC CTG GCC GGA GAG GGG GCC ACC GTA £CT GCC TGC GAC CTG 108 AVSVRLAGEGATVAACD L 36

GAC CGG GCA GCG GCA CAG GAG ACG GTG CGG CTG CTG GGC GGG CCA GGG AGC AAG 162 DRAAAQETVRLLGGPGSK 54

GAG GGG CCG CCC CGA GGG AAC CAT GCT GCC TTC CAG GCX GAC GTG TCI GAG GCC 216 EGPPRGNHAAFQADVSEA 72

AGG GCC GCC AGG TGC CTG CTG GAA CAA GTG CAG GCC TGC TTT TCT CGC CCA CCA 27 0 RAARCLLEQVQACFSRPP 90

TCT GTC GTT GTG TCC TGT GCG GGC ATC ACC CAG GAX GAG TTT CTG CTG CAC ATG 32 4 SVVVSCAGITQDEF L L H M 108

TCX GAG GAX GAC TGG GAC AAA GTC ATA £CT GTC AAC CTC AAG GTG GCG ATC TCT 378 SED D W D K V IA V N LK V A I S 126

GAA CCT CGC GAG TTT GGC CCC TTA £CT GGG AGG AGT TGG AGG AGG GCT GTC 42 9 EPREFGPLAGRSWRRAV 143

Figure 6.4. Nucleotide and predicted amino acid sequence of the 5' end of the RING2 cDNA clone CEM21. The sole methionine codon within the open reading frame is shown in bold type. The termination codons in the other two reading frames are underlined. 6.3. Partial nucleotide sequence of RING2

The two ends of the RING2 cDNA done CEM21 were sequenced using primers complementary to the cDNA vector CDM8 either side of the cloning site. One end (primed with the oligonucleotide CDM8-F) clearly contained a long tract of A residues, characteristic of a poly(A) tail, but could not be read further, presumably due to secondary structure created by the homopolymeric tail. The nucleotide sequence at the other end of CEM21, presumably near the 5’ end of the gene, was obtained by priming the sequencing reaction with the oligonucleotide CDM8-B. This sequence was extended further in the 3' direction by priming the reaction with an oligonucleotide containing nucleotides 86-107 (Figure 6.4.). A total of 429 nucleotides were sequenced in this way (Figure 6.4.). This region potentially encoded a single long ORF of 143 amino acid residues, including a single methionine at position 108. This methionine codon was not in the optimal initiation consensus sequence (CCACCATGG: Kozak, 1986). Both other reading frames contained termination codons throughout the region sequenced. It is not possible at this stage to tell whether the ORF detected at the 5' end of the cDNA clone is that encoding the RING2 protein product (in which case the CEM21 cDNA clone is truncated at the 5' end), or whether it occurs by chance in the 3' untranslated region of the RING2 gene. This question will be resolved with the solution of the complete nucleotide sequence of CEM21. The complete amino acid sequence of the ORF was used to search the PIR and Swiss-Protein sequence databases but no significant match was found.

6.4. Nucleotide sequence of RING3

Four RING3 cDNA clones were isolated (Figure 5.13.). CEM32 and CEM35 were isolated by screening the library with the genomic BssHII fragment 71Bsl, while CEM41 and CEM44 were isolated by screening the library with an adjacent BssHII fragment, 71Bs3. Theoretically the two genomic probes could encode part of two separate genes. However, when the cDNA clones were mapped with restriction enzymes it was apparent that they were derived from the same gene because they all

139 shared several sites in common (Figure 6.5.). Together, the four cDNAs provided a useful nested set of clones within the RING3 gene. The nucleotide sequence of the two ends of each cDNA was determined, using CDM8-F and -B primers, in order to obtain the sequence of RING3 at several points.

The sequence data from the truncated 3' ends of the CEM44 and CEM41 cDNA clones revealed potential ORFs whose predicted amino acid sequences were used to search the databases. Both sequences identified the product of fsh, a Drosophila homeotic gene, as the top match. Because of the striking homology between fsh and these partial sequ­ ences from RING3, and the extremely interesting phenotypes of mutat­ ions in the fsh locus (see below) it was decided to determine the comp­ lete nucleotide sequence of the longest RING3 cDNA clone, CEM32.

The nucleotide sequence CEM32 was obtained by sonicating the CEM32 insert and blunt-end ligating the resulting short fragments into M l3. The M13 clones were then sequenced and aligned to build up the sequence of the CEM32 insert (collaboration with Dr. Stephan Beck). The nucleotide sequence of the 4004-nucleotide CEM32 insert is shown in Figure 6.6. The sequence at the 5' end of the CEM32 insert exactly matched that previously determined at the 5' end of the CEM35 insert, beginning at the same nucleotide. Therefore it is very likely that this is the true 5' end of the RING3 transcript. The CEM32 insert contained a single long ORF. The first potential methionine residue within this ORF was not in a good context for initiation of translation, while the second was in the context CCACAATGG, which differs by only one nucleotide from the consensus of CCACCATGG (Kozak, 1986). The second methionine was therefore considered most likely to be the initiating codon of the RING3 protein. Use of the second methionine would give a protein product of 754 amino acids and a 5’ untranslated region of 1153 nucleotides (Figure 6.6.). The ORF was followed by a 3' untranslated region of 589 nucleotides which was truncated before the poly(A) tail but which contained a sequence exactly matching the consensus poly(A) addition signal (AATAAA) just before the end (starting at nucleotide 3991).

140 CEM32

primers which were used to obtain the nucleotide sequence at the end of each cDNA clone as described in the text. G 1 CGG ACC ACA OCT CAC CCA ATT QGC TTG GAG ATG TGG CGG GTT GCC ACT TCC CTO TGG GTC TCT GCC GCA CTC TTC TGC CTG GTG ACT GAC ACC TTG GAA ATG AAG TTT 1 03 ATG ACG TCA TCG TTC CGG CTG OCC AAT ACA AAA AGC TCC CGC OCA GAG GTG TTC CTT CCC CTT CGA CTC AGC TTC TTC ACC CGC GTG AGC GAG CGC GCC CGC GCG GAG 217 GCG GTC QGC AAA ATC TCA AGC AGG GTG GCG CGC ATG AGC QGC GAA GCT CCT CCT OCC CGC CTA TAT ATA AAG GGC TGG CGC GGG GCT CGG OGG CGC CAT TTC CTG CTG 32 5 GAG TGG ACC AGC CTC TAG AAC GAG CTG GAG GAT TCT OCC TAC CGA TAC ACA OCC TTC GAG TCG TCC OGG OCC GCC ATT ACA ATC CAC CTC CAT CCG CTT GGA AAT GGC 433 CTT CGT CCC GCC CTA TGA CTG GTC OCA GCG QGC ACT ACA GAC CCC TTA GAA OCC OCT OGA OCT OCC CTT TTT CGG OCC CCG CCC AAT CCT CGG AGT CTG TCC ACC CCC 541 TCT ACT CCG CCC TCA AGA OGA TTT CAA ACA TGG AGG CGG CGG CTC CCT AAA CCA CTT TTC GTG TTC ATC GCC CTC CAT CCG AGA TCG AAA CGG GAC CTC GTC QGC CCC € 4 3 GTA OGA OCC CGA CAA GAA GAG OCA ATC CCT OCA GAC CAA CAG CGG OCT ATA TTG ACG ACG GTG TCT GAG ATC OGG GAC CGT CTT TTG AAG AGT CAG TCC CTC CTT ACT 757 TGC CCC CCT CAG CTG AGG CCG CCG CCA TTT TCT TGC TGT GCG CCG TCT OCA GAG OCC OCC AAG CTG CCC OGA OCT CTC CGA GAG OCC CCA AAG AGA CTC CTT TCG TOC i€ 5 CGG OCA OCC ACG GGG TTT GTC OCC TGG ACG OCC AAG AGG AAC QGC CTC OCC CCA ACT TAG OGG GTT ATG CTG GAC CGG GCG GTG AX GCA ACC GAG GCC M X CGG ACT 973 TTC CGC o o c TCA QGO CAG COC OGG TTC CTT GCG GTC AAC|lATClCTC CAA AAC GTG ACT CCC CAC AAT AAG CTC CCT OGG GAA OGG AAT OCA GGG TTC CTG OGG CTG GOC 10S1

CCA GAA OCA GCA OCA OCA OOG AAA AGG ATT CCA AAA OCC TCT CTC TTG TAT GAG QQC TTT GAG AGC OCC ACA ATG OCT TCG GTG OCT OCT TTG CAA CTT ACC CCT GCC 1 1 1 3 M A a V P A 1 Q LTPA 12

AAC CCA OCA CCC CCG GAG GTG TCC AAT CCC AAA AAC OCA OGA CGA GTT ACC AAC CAG CTG CAA TAC CTA CAC AAG CTA GTG ATG AAG OCT CTG TGG AAA CAT CAG TTC 1 2 3 7 II r r r r S V S ■ P XXPGA V TX 0 L Q Y 1* a XVVMXA L XX a Q F 4 3

OCA TGG OCA TTC CGG CAG OCT GTG GAT OCT GTC AAA CTG OCT CTA CCG GAT TAT CAC AAA ATT ATA AAA CAG CCT ATG GAC ATG OCT ACT ATT AAC AGG AGA CTT GAA 1 4 0 5 A M P r a Q P V DA VKLGLP DY a X I IX Q P M 0 M G T X X AA L E • 4

AAC AAT TAT TAT TCG OCT OCT TCA GAG TCT ATG CAA GAT TTT AAT ACC ATG TTC ACC AAC TCT TAC ATT TAC AAC AAG OCC ACT GAT GAT ATT GTC CTA ATG GCA CAA 1 5 1 3 X a Y Y X A A SEC M Q D r X T M FTXCYX YX X p T DD I VLHA Q 120

ACG CTC GAA AAG ATA TTC CTA CAG AAG GTT OCA TCA ATG OCA CAA GAA GAA CAA GAG CTG GTA GTG ACC ATC CCT AAG AAC ASC CAC AAC AAG GGG GCC AAG TTC GCA l € 2 l T LEKIFL Q XV ASM P Q EE Q ELVV T I P XX 3 a XX G AXL A ISC

GCG CTC CAG GGC ACT CTT MX AGT OCC CAT CAG GTG OCT OCC CTC TCT TCT GTG TCA CAC K A OCC CTG TAT ACT CCT CCA OCT GAG ATA CCT ACC ACT GTC CTC AAC 1 7 2 3 AL Q G S V T 3 A a Q V P A V 3 a V a a T ALYTPPPEIPTT V L X 192

ATT CCC CAC CCA TCA GTC ATT TCC TCT OCA CTT CTC AAG TCC TTG CAC TCT OCT OGA CCC CCG CTC CTT OCT GTT ACT OCA OCT CCT CCA OCC CAG CCC CTT GCC AAG 1 1 3 7 I p I PS V I s a P LLX s L a a AG p PLLAVT AAPP A Q p LA X 2 20

AAA AAA QGC CTA AAG CGG AAA OCA GAT ACT M X ACC OCT ACA OCT ACA GCC ATC TTG OCT CCT OCT TCT OCA OCT AGC CCT OCT OGG ACT CTT GAC OCT AAG GCA GCA 194 5 KKGVK AX ADT TTPTPTA I 1* A PG 3 P A a P P G a 1> E P X A A 2 €4

CGG CTT OCC CCT ATG CGT AGA GAG AGT GGT CGC OCC ATC AAG OCC CCA CGC AAA GAC TTG CCT GAC TCT CAG CAA CAA CAC CAG AGC TCT AAG AAA GGA AAG CTT TCA 2 0 5 3 A L p P M A A E S G A p I X p P A X D L P 0 a QQQ a 0 3 3 XXGX 1. 3 3 0 0

GAA CAC TTA AAA CAT TCC AAT OCC ATT TTG AAG GAG TTA CTC TCT AAG AAG CAT OCT GCC TAT GCT TGG CCT TTC TAT AAA CCA CTG GAT OCT TCT GCA CTT GGC CTG 21C1 E 0 L K a CNG I L X EL 1* a X X a AA YA X PFYXPVDA3ALGL 33C

CAT GAC TAC CAT GAC ATC ATT AAG CAC OCC ATG GAC CTC AGC ACT CTC AAG CGG AAG ATG GAG AAC OCT GAT TAC OGG GAT OCA CAG GAG TTT GCT GCT GAT GTA COG 22C 9 a D Y a DII K a P MD L S T V XAXM EX A 0 Y A D A Q E F A A D V A 3 72

CTT ATG TTC TCC AAC TGC TAT AAG TAC AAT CCC CCA GAT CAC GAT CTT GTG OCA ATG OCA CGA AAC CTA CAG GAT CTA TTT GAG TTC OGT TAT GCC AAG ATG CCA GAT 2 3 7 7 L M F 3 X CY X Y X PPD a D VV A M A a X L Q 0 V F EF a YA KMP 0 4 o a

GAA CCA CTA GAA CCA OGG CCT TTA OCA GTC TCT ACT OCC ATG CCC OCT QGC TTG OCC AAA TCG TCT TCA GAG TCC TCC ACT GAG GAA AGT AGC AGT GAG M X TCC TCT 2 4 1 5 EPLEPGP L P V 3 T A H p P OLAX a a a E 3 a 3 E E 3 3 3 E 3 3 3 444

GAG GAA GAG GAG GAG GAA GAT GAG GAG GAC GAG GAC GAA GAA GAG ACT GAA MX TCA GAC TCA GAG GAA GAA ACG OCT CAT CGC TTA OCA GAA CTA CAG GAA CMS CTT 2 5 9 3 EEEEEEDEEO EEEE E a E a a 0 3 EE E a A a a L A E L Q E 0 L 410

CGG OCA CTA CAT GAA CAA CTG OCT OCT CTC TCC CAG OCT CCA ATA TCC AAG OCC AAG ACG AAA AGA GAG AAA AAA GAG AAA AAG AAG AAA CGG AAC GCA GAG AMS CAT 2 701 AAV a E Q L A A X. 3 Q G P 1 a K p X a X a EXX E K I X X X a A E X B 5 1 €

CGA QGC CGA OCT OGG OCC GAC GAA GAT GAC AAC GGG CCT M X OCA OCC CCC OCA CCT CAA CCT AAG AAG TCC AAG AAA GCA AGT OCC ACT OGG OCT O X AGT GCT GCT 1 1 0 9 AGAA OADEDD XGP a A p X P P Q PXX a XXA 3 G 3 G G G 3 A A 5 52

TTA GGC OCT TCT QGC TTT OGA OCT TCT OGA OGA AGT OCC ACC AAG CTC CCC AAA AAC OCC MCA AAG ACA OCC CCA CCT OCC CTC CCT MCA GGT TAT GAT TCA GAG GAG 2 9 1 7 L G P 3 G F G P a G G a GTXL p XXATXTAPPALPTG Y 0 3 E E 59 3

GAG GAA GAC AGC M X CCC ATG ACT TAC GAT GAG AAC OGG CAC CTG AGC CTG GAC ATC AAC AAA TTA CCT OGG GAC AAG CTG QGC CGA CTT CTG CAT ATA ATC CAA G X 3 0 2 5 E E E 3 A p M a YD EXX Q L a L D 1 a XLPGEXLGaVV B I 1 Q A €24

AGG GAG CCC TCT TTA CCT GAT TCA AAC CCA GAA GAG ATT GAG ATT GAT TTT GAA ACA CTC AAG CCA TCC MCA CTT AGA GAG CTT GAG CGC TAT CTC CTT TCC TGC CTA 3 1 3 3 A ‘E p 3 L A D a a P E E I E 1 D r ET 1. X P 3 T L a E LE a Y V L 3 C L €€0

OCT AAG AAA CCC CCG AAG OCC TAC M X ATT AAG AAG OCT GTG OGA AAC MCA AAG GAG GAA CTG GCT TTG GAG AAA AAC CCG GAA TTA GAA AAG CGG TTA CAA GAT GTC 3 2 4 1 A X X p A X P YTI XXPVG XT X EELA L E X X a E LE X a L Q DV € 9 €

AGC GGA CAG CTC AAT TCT ACT AAA AAG CCC CCC AAG AAA GCG AAT GAG AAA ACA GAC TCA TCC TCT OCA CAG CAA CTA OCA CTG TCA CGC CTT MX OCT TCC A X T X 3 3 4 9 3 G O L X 3 T X X p p X XA X E X T E 3 3 3 A QQ VAV 3 a L 3 A 3 3 3 7 3 2

AGC TCA GAT TCC AGC TCC TCC TCT TCC TCG TCG TCG TCT TCA GAC AC C AGT GAT TCA GAC TCA OCC TAA OGG CTC AGG CCA GAT GGG OCA OCA AGG CTC CGC A X A X 3 4 5 7 3 3 D 3 a a a a a 3 3 3 3 3 0 T 3 D 3 D a G 7 54

OGA OCC CTA GAC CAC CCT OCC OCA CCT OCC OCT TCC CCC TTT OCT CTG a c a CTT CTT CAT CTC ACC CCC CCC CTG CCC CCC TCT AGG MCA GCT O X TCT OCA GTG G X 35C 5 GAG OGA TGC MX GAC ATT TAC TGA AGG M X GAC ATG GAC AAA ftCA ACA TTC AAT TCC CAG CCC CAT TGG OGA CTG ATC TCT TGG ACA CAG M X CCC CAT TCA AAA T X 3C 73 GCC AGG OCA AGG CTG GGA GTG TCC AAA OCC CTG ATC TGG ACT TAC CTG M X CCA TAG CTG OCC TAT TCA CTT CTA ACG GCC CTG TTT TGA GAT TGT TTG TTC TAA TCT 3 7 3 1 ATT TTA M X TAG CTA ACG CTC OGG GCA OGG MX QGC CCT GCT OCC CTC JXC CTC CAT OGG GAG OGA ACA AGG OGG AGC TCT TTT CTT ACG TTG ATT TTT TTT TTT CTA 3499 CTC TGT TTT CCC TTT TTC CTT CCG CTC CAT TTG OGG OCC TGG GCG TTT CAG TCA TCT CCC CAT TTG GTC OCC TGG ACT CTC TTT GTT GAT TCT AAC TTG TAA ATA AAG 3 9 9 7 AAA ATA T 4 0 0 4

Figure 6.6. Complete nucleotide sequence of the RING3 cDNA clone CEM32. The predicted amino acid sequence is shown, starting at the methionine residue in the best context for initiation. An additional upstream methionine codon is boxed and the in-frame upstream termination codon is underlined. The potential nuclear localisation signal within the predicted amino acid is boxed. The two regions of internal homology are indicated by bold underlining. The predicted protein product contained a potential nuclear localisation signal, KKKRK, starting at residue 508. In addition it contained regions which were enriched in one particular amino acid. Thus, a highly acidic glutamate-rich region was present between resid­ ues 445-468, a basic lysine-rich region was present between residues 496- 516 and a serine-rich region was present between residues 729-753. A further feature of interest was an internal duplication in the amino acid sequence (Figure 6.6.). Two stretches of 26 amino acids starting at positions 50 and 322 shared 19 residues in common (73% identity).

When the amino acid sequence of the predicted protein product of the RING3 gene was used to search the protein sequence databases a highly significant match was obtained with the 1106-amino add product of the 5.9kb transcript of the Drosophila melanogaster gene female sterile homeotic (fsh; Haynes et al., 1989). The amino acid sequence identity between the two proteins is shown in Figure 6.7. This alignment reveals three domains which are strikingly conserved. In one stretch of 120 amino acids (positions 22-142 in RING3), 91 residues (76%) were identical in the two proteins and a further 9 were highly conservative substitutions. The human protein seems to have two laTge regions deleted relative to the Drosophila protein, which accounts for its smaller size. Interestingly, the two regions which have appear to been deleted from RING3 contained proposed transmembrane regions in fsh. It had previously been suggested that the fsh protein may be an integral membrane protein (Haynes et al., 1989). However, the absence of these regions in the human protein raises the possibility that they are not transmembrane domains in fsh. It may be significant that many of the stretches of amino acids which were proposed to be membrane- spanning domains were rich in alanine, the hydrophobidty of which is controversial (Haynes et al., 1989).

A further similarity between RING3 and fsh is that both give two major transcripts. The RING3 transcripts are 3.5kb and 4.5kb while the fsh transcripts are 5.9kb and 7.6kb. The fsh transcripts arise through alternative RNA processing and the predicted product of the 7.6kb transcript contains the same 1106 amino acids encoded by the 5.9kb transcript with an additional 946 residues at the C-terminal end (Haynes et al., 1989). The CEM32 insert is thus related to both fsh gene

143 % >

r> o>M 3 60 the the homology. The alignment starts with residue 1 of the RING3 protein and residue 8 of the fsh protein. h Alignment of the amino acid identical sequences in of the the two human proteins RING3 are and indicated Drosophila by 5.9-fsh colons. gene Dashes products. indicate gaps Residues which have which been are introduced to maximise products but seems to be analogous to the 5.9kb transcript because the ORF terminates at the same position (Figure 6.7.). It is not clear at this stage whether CEM32 is derived from the 3.5kb or 4.5kb RING3 transcript. It will be necessary to characterise additional cDNA clones and the genomic structure of the RING3 locus in order to determine the relationship between theRING3 transcripts and the fsh transcripts.

A role for the fsh locus in Drosophila development has been suggested by the observation that fsh expression is required both maternally during oogenesis and then later in embryogenesis for normal embry­ onic pattern formation to occur. Furthermore, certain mutant fsh alleles interact synergistically with mutant alleles at other develop- mentally important loci (such as the homeotic gene Ultrabithorax) to increase the frequency of abnormalities in the segmentation pattern of the embryo (Haynes et al., 1989). It has also been directly shown, using immunofluorescence-tagged antibodies against the product of the Kruppel gene (another developmentally important locus) that Kruppel expression is altered in /s/i-mutant embryos (Huang and Dawid, 1990). These observations suggest that the fsh gene product(s) plays a role in the complex interaction between gene products which is known to be critical for normal pattern formation in the developing Drosophila embryo. It is not yet known how this interaction is mediated.

Mammals, including humans, are known to have embryonically- expressed homologues of many Drosophila developmental genes and it is possible that a similar system of interacting gene products is involved in establishing the body pattern of mammals (Dressier and Gruss, 1988). This raises the exciting possibility that RING3 m ay be critical in human embryonic development. It will be of great interest to test the expression of RING3 in embryonic tissues.

The region which is duplicated within the predicted protein product of the RING3 gene (Figure 6.6.) shows homology to a duplicated domain in the human CCG1 protein (Sekiguchi et al., 1988). CCG1 w as identified in a transfection assay as the gene which complements temperature-sensitive mutant baby hamster kidney cell lines which are arrested in the G1 phase of the cell cycle. The mechanism by which the CCG1 gene product overcomes the block in the cell cycle is unknown,

145 and the function of the duplicated domain therefore remains obscure. The homology between the CCG1 and RING3 gene products is confined to the region of internal homology. This duplicated motif may define a novel protein domain.

6.5. Nucleotide sequence of RING4

The complete nucleotide sequence of the ~2.6kb RING4 cDNA clone 2.1 was determined by Ian Mockridge and Adrian Kelly in this laboratory. The cDNA clone contained a single long ORF which started at the 5' end. A methionine codon occurred at nucleotide 84 but this was not preceded by an in-frame upstream stop codon. A portion of the cosmid clone U15 containing the 5' end of the RING4 gene was sequenced (collaboration with Dr. Stephan Beck, ICRF), and the ORF was found to extend a further 180 nucleotides in the 5' direction before a stop codon was encountered. The sequence presented in Figure 6.8. is a composite of the cDNA sequence and the genomic sequence. As with RING1, there is as yet no evidence that this genomic sequence is part of the transcribed region, but at present it will be assumed that the sequence shown in Figure 6.8. is correct. The composite ORF contained two potential initiating methionine residues (arrowed in Figure 6.8.), one immediately following the upstream in-frame stop codon in the genomic sequence and a second at amino add position 61 (nucleotide 211) in the cDNA clone. Use of the first methionine would give a protein containing 808 amino acids, while use of the second would give a product of 748 amino acids. The ORF was followed by a 3' untranslated region of 370 nucleotides and a poly (A) tail. Starting 19 nucleotides upstream of the tract of A residues was the sequence AATAAA, which is identical to the consensus signal for 3’ processing of mRNAs (Sheets et al., 1990).

The N-terminal two-thirds of the predicted protein product of the R1NG4 gene was highly hydrophobic in character. The potential membrane-spanning regions indicated in Figure 6.8. were revealed in a hydropathicity analysis by Dr. Michael Sternberg (Figure 6.9.). The C- terminal third, in contrast, was more hydrophilic. When the predicted amino add sequence of the RING4 product was used to search the pro-

146 T MAELLASAGSACSWDFPRAPPSFP GCGGCCGC777CGA777CGC777CCCC7AAA7GGC7GAGC77C7CGCCAGCGCAGGA7CAGCC7GTTCC7GGGAC77TCCGAGAGCCCCGCCC7CG77CC IOC

PPAASRGGLJGG7RSFRPHRGAESPRPGRDRDCV CTCCCCCAGCCGCCAGTAGGGGAGGAcjrCGGCGGTACCCGGAGCTTCAGGCCCCACCGGGGCGCGGAGAGTCCCAGACCCGGCCGGGACCGGGACGGCGT

RVPMASSRCPAPRGCR C Z P G A S L A W L G 7 V L L L L CCGAG7GCCAA7GGC7AGC7CTAGG7G7CCCGC7CCCCGCGGG7GCCGC7GCC7CCCCGCAGC77CTCTCGCATGGCTGGGCACAG7AC7GC7ACT7C7C

ADHVLLRTALPRIFSLLVPTALPLLRVHAVGLSR GCCGACTGGGTGCTGCTCCGGACCGCGCTGCCCCGCATATTCTCCCTGCTGGTGCCCACCGCGCTGCCACTGCTCCGGGTCTGGGCGGTGGGCCTGAGCC

MAVLHLCACGVLRATVGSKSENAGAOGWLAALK CCTCGGCCGTGCTCTGGCTGGGGGCCTGCGGGGTCCTCAGGGCAACCCTTGGCTCCAAGAGCGAAAACGCAGGTGCCCAGGGCTGGCTGGCTGCTTTGAA

PLAAALGLALPGLALFRELISHGAPGSADSTRL GCCATTAGCTGCGGCACTGGGCTTGGCCCTGCCGGGACTTGCCTTGTTCCGAGAGCTGATCTCATGGGGAGCCCCCGGGTCCGCGGATAGCACCAGGCTA

LHWGSHPTAFVVSYAAALPAAALHHKLGSLWVPG CTGCACTGGGGAAGTCACCCTACCGCCTTCGTTGTCAGTTATGCAGCGGCACTCCCCGCAGCAGCCCTGTGGCACAAACTCGGGAGCCTCTGGGTGCCCG

goggsgnpvrrelgclgsetrrlslflvlvvls GCGGTCAGGGCGGCTCTGGAAACCCTGTGCGTCGGCTTCTAGGCTGCCTGGGCTCGGAGACGCGCCGCCTCTCGCTGTTCCTGGTCCTGGTGGTCCTC7C

SLGEMAIPFF7GRL7DWILQDGSAD7F7RNL7I C7C7C77GGGGAGA7GGCCA77CCA77C777ACGGGCCGCC7CAC7CAC7GGA77C7ACAAGA7GGC7CAGCCGA7ACC77CAC7CGAAAC77AAC7C

MSIL7IASAVLEFVGDGITHN7MGHVHSHL0GEV A7G7CCA77C7CACCA7AGCCAG7GCAG7GC7GGAG77CG7GGG7GACGGGA7C7A7AACAACACCA7GGGCCACG7GCACAGCCAC77GCAGGGAGACG

FGAVLROE7EFFOON07GNIMSRV7ED7S7LSD 7G777GGGGC7G7CC7GCGCCACGAGACGGAG77777CCAAC AGAACCAGACACG7AACA7CA7G7C7CGGG7AACAGAGGACACG7CCACCC7GAG7GA

SLSENLSLFLWYLVRGLCLLGIMLWGSVSL7KV 77C7C7GAG7GAGAA7C7GAGC77A777C7G7GG7ACC7CG7GCGAGGCC7A7G7C7C77GGGGA7CA7GC7C7GGGGA7CAG7G7CCC7CACCA7GG7C

7LI7LPLLFLLPKKVGKWYQLLEVQVRESLAKSS ACCC7GA7CACCC7GCC7C7GC7777CC77C7GCCCAAGAAGG7GGGAAAA7GG7ACCAG77GC7GGAAG7GCAGG7GCGGGAA7C7C7GGCAAAG7CCA

0VAIEALSAMP7VRSFANEEGEA0KFREKLCEI GCCAGG7GGCCA77GAGGC7C7G7CGGCCA7GCC7ACAG77CGAAGC777GCCAACGAGGAGGGCGAAGCCCAGAAG777AGCCAAAAGC7GCAAGAAA7

K7LNQKEAVAYAVNSW77S I SGMLLKVG I LY I G AAAGACAC7CAACCAGAAGGAGGC7G7GGCC7A7CCAG7CAAC7CC7GGACCAC7AG7A777CAGG7A7GC7GC7GAAAG7GGGAA7CC7C7ACA77GG

GQLV7SGAVSSGNLV7FVLYQMQF7QAVEVLLS1 GGGCAGC7GG7GACCAG7GGGGC7G7AAGC AG7GGGAACC77G7CACA777G77C7C7ACCAGA7GCAG77CACCCAGGC7G7GGAGG7AC7GC7C7CCA

YPRV0KAVGSSEKIFEYL0R7PRCPPSGLL7PE TC7ACCCC AGAG7ACAGAAGGCTGTGGGC7CCTCAGAGAAAA7A777GAC7ACC7GGACCGCACCCC7CGC7GCCCACCCAG7GG7C7G7 rGAC7CCC77

AC AC77GGAGGGCC77G7CCAG77CCAAGA7G7C7CC777GCC7ACCCAAACCGCCCAGA7G7C77AC7GC7ACAG 7GGC7GACA77CACCC7ACCCCC7

GEV7ALVGPNGSGKS7VAALLQNLYQP7GGQL1L. GGCGAGG7GACGGCGC7GG7GGGACCCAA7GGG7C7GGGAAGAGCACAG7CGC7GCCC7GC7GCAGAA7C7G7ACCAGCCCACCGGGGGACAGC7GC7G7

dgkplpoyehrylhrovaavgqepovfgrslce 7GGA7GGGAAGCCCC77CCCCAA7A7GAGC ACCGC7ACC7GCACAGGCAGG7GGC7GCAG7GGGACAAGAGCCACAGG7A777( j GAAGAAG7C77CAAGA

NIAYGL70KP7KEE I 7AAAVKSGAHSFI SGLPO AAA7AT7GCC7A7GGCC7GACCCAGAACCCAAC7A7GGAGCAAA7CACAGC7GC7GCAG7AAAG7C7GGGGCCCA7AG777CA7C7C7GGAC7CCC7CAG

GY07EVDEACS0LSGG0R0AVALARALIRKPCVL GGC7A7GACACAGAGG7AGACGAGGC7GGGAGCCAGC7G7C AGGGGG7CAGCGACAGGCAG7GGCG77GGCCCGAGCA7TGA7CCGGAAACCG7G7G'

L D A N S Q OVEOLLYE 77A7CC7GGA7GA7GCCACCAG7GCCCTGCA7GCAAACAGCCAC 77ACAGG7GGAGCAGC7CC7G7ACGAAAGCCC7GAGCGG7AC7CCCGC7CAG7GC7

LI7QHLSLVEQADHILFLEGGAIREGG7HQ01.K 7C7CA7CACCC AGCACC7CAGCC7GG7GGAGCAGGC7GACC ACA7CC7C777C7GGAAGGAGGCGC7A7CCGGGAGGGGGGAACCCACCAGCAGC7CA7G 2 C

EKKGCYWAHVOAPADAPE* GAGAAAAAGGGG7GC7AC7GGGCCA7GG7GCAGGC7CCTGCACA7GC7CCAGAA7GAAAGCC77C7CAGACC7GCGCAC7CC A7C7CCC7CCC7777C77 2iZ C7C7C7G7GG7GGAGAACCACAGC7GCAGAG7AGCAGC7GCC7CCAGGA7GAG77AC77GAAA777GCC77GAG7G7G77ACC7CC777CCAAGC7CC7C 2tZ G7GA7 AA7GCAGAC77CC7GGAG7ACAAAC ACAGGA777G7AA77CC7AC7G7AACGGAG777AGAGCCAGGGC7GA7GC777GG7G7GGCCAGCAC7C7 2'Z CAAAC7GAGAAA7C77CAGAA7C7ACGGAAACA7GA7CAGC7A7777CAACA7AAC7CAAGGCA7A7GC7GGCCCA7AAACACCC7G7AGG77C77GA7A 2 8 C 777A7AATAAAA77GG7G7777G7AAAAAAAAAAAAAAAAAA 2842

Figure 6.8. Nucleotide sequence of the RING4 gene. Nucleotide sequences after the vertical line were obtained from the cDNA clone 2.1. Sequences before the vertical line were obtained from a genomic subdone. Two potential initiating methionine residues are arrowed. Three consensus N-linked glycosylation sites are indicated by squares. Potential trans- membrane regions in the ORF are underlined. The polyadenylation sequence in the 3' untranslated region is underlined. The region encoding the ATP-binding domain is boxed. o

o

o u

on

CO

CO CM o a> co co o o CT\ v d a>M 3 • 60pH the the algorithm of Rao and Argos (1986) to predict the location of transmembrane regions. Short horizontal tJU, lines indicate the positions of putative membrane-spanning domains. tein sequence data bases, significant homology was found between a stretch of amino acids within the C-terminal region (boxed in Figure 6.8.) and the ATP binding site of members of the 'ABC* (ATP-binding cassette) superfamily of energy-dependent transport proteins (Figure 6.10.), which includes P-glycoprotein, the plasma-membrane-associated pump responsible for the multi-drug resistance phenotype found in some human tumour cells (Juranka et al., 1989; Hyde et al., 1990). The physiological substrate of P-glycoprotein is uncertain, but other members of the ABC superfamily are each specialised for the transport of a particular substrate. Known substrates include sugars, inorganic ions, amino acids, peptides and proteins.

In general, ABC transporters have a similar overall organisation to P- glycoprotein, shown in Figure 6.11., with two highly hydrophobic integral membrane domains and two ATP-binding domains (Hyde et al., 1990; Juranka et al., 1989). The four domains may be present in one polypeptide encoded by a single gene, or they may be encoded by more than one gene so that the functional transporter is a multi-subunit complex. P-glycoprotein, for example, is encoded by a single gene, while the oligopeptide permease of bacteria is the product of four separate genes. The RING4 product which contains one hydrophobic domain and one ATP-binding domain falls between these two extremes, resembling one-half of a functional ABC transporter molecule, like the hlyB haemolysin transporter of E. coli (Hyde et al., 1990; Felmlee et al., 1985). If the RING4 product is part of an ABC transporter, the functional molecule may be a R/NG4-homodimer, or a heterodimer w ith a RING4 -related gene product.

The possibility that theRING4 product is part of an ABC transporter may be particularly relevant for the understanding of the class II- associated defect in antigen presentation by class I molecules as discussed in section 1.6.2. Studies of mutant cell lines in man and rat have implicated a role for a class H-encoded function in normal class I antigen presentation in these species (Livingstone et al., 1989; Cerundolo et al., 1990). The RING4 gene maps within the interval known to contain this function in the human mutant B-cell line LBL 721.174. This cell line expresses normal class I molecules but these are unable to present intracellular viral antigens and do not progress from

149 o Ed Ed DC CO l 0 Ex] Cx3 CO 1 1 X 1 > t-H 1 1 2 Ed Ed > > Eh CO Ex] < < H £h M CO > > >H >H Q 1 1 1 1 1 1 1 I 4 3 1 1 1 1 1 1 1 I < 1 1 1 1 1 1 1 1 CO 1 1 1 1 Eh 1 1 1 Eh 1 1 > 1 1 > 1 1 < 1 1 1 o CU 1 1 1 Q Ed Ci3 u u Ci3 1 Ex] Ci3 O 1 1 1 i 1 1 1 1 4 ) 1 1 1 i < 0 1 Eli M 4 1 i-3 Eli i 1 1 1 t-H t-3 1 1 1 i 1 M 1 1 > w w t-H M 41 43 1 M 0 5*3 3 3 CU SC XSCXSC a . 1 1 Eh CO 1 1 1 1 2 0 Q 1 DC Ex] X 2 G 1 1 1 1 0 41 X 2 M > > 1 41 41 1 1 > i-3 1 1 Eli 1 1 1 1 1 1 1■ 1 1I 1 1 1 1 1 tt, d ! 1 1 1 1 X 1 1 1 D h < 1 1 1 1 1 1 1 1 ►3 l M M MMM t-H M Oh < 1 1 1 0 X 0 0 1 o > M M 1 .-3 1 M M M < x C C DC DC DC DC X X T 3 a 1 1 1 1 1 2 1 1 G a : XSC 0 < 1 0 1 1 cti 0 1 1 1 1 X Ex] 1 1 0 1 1 1 1 1 1 1 1 o 1 1 1 1 1 1 1 1 Q CO 1 1 1 1 1 1 1 1 cu ►3 1 1 1 1 bu Ex. 1 1 cu a 1 1 EH _3 Ei3 Ci] 0 0 CO < Eh > Eh X X > < O 0 I 1 1 Eh CU DU 1 1 < X SC 0 Q >H >H 2 0 Eh 2 cu u 1 a M X 1 1 >4—* a 0 0 0 DC 2 2 0 0 Ul O DC CO Ed 41 43 < 2 CU O 2 CO Q Q > CU 0 0 C /1 H Ed 2 1 0 Ed Ex] CU Cu x a CO 1 X 2 G 2 < etf 0 1 i 1 1 43 t-H Eh < *-■ >H 1 i h3 1 0 M 43 i-3 < DC i 41 Eh Ci3 Cl] 1 CO M 1 i 1 43 0 0 1 1 2 1 i 1 1 > M 1 1 Ex3 1 i SC q DC Eh q Q 0 o < < DC 2 X 2 2 t-H 1-3 M t-H i-3 l—l » 2 > M V+H CO Eh 1 H Eh cu cu 1 1 X Eh u a 0 2 2 1 1 6 0 0 < Q 2 2 41 41 2 2 G Eli 1 l 1 1 CO CO 1 41 •4-^ > 4 3 41 43 41 Eh < 43 43 o > w Eh 41 2 41 Eh CU 1 l O 1 I 1 CO 2 6 W 1 1 DCSC a a 1 Q H-i a 1 1 1 1 i i 1 1 c/l 0 CO CO Ed Ed Eu bu 43 43 00 > 1 1 1 1 t-H W 1 1 CU < > HH > > 2 2 > > < 0 0 Eh CO t o 0 0 0 > t-H 41 M M M M 41 1

2 DC DCDCSC 0 0 XX £ a 1 Ed 41 Eh 2 M 1 1 CD j 1 1 1 I HM M 2 • V 41 i 1 1 1 1 1 < < G < a 0 2 43 bu 2 X SC < > > CO > 1 1 Eh Eh (U > Eh 1 41 43 Eh bu 43 43 H-* Eh 1 1 1 1 0 1 1 1 o CO 1 1 1 1 1 1 1 1 4h X 1 1 1 1 1 1 1 1 cu 0 1 1 1 1 1 1 1 t CO u 0 1 Eh 1 0 1 t cu 0 1 1 1 1 1 1 1 1 u 2 CO CO CO c o CO CO CO CO G CU 2 CO :*3 Ed Ci3 Ei] XX <0 0 1 1 1 1 1 1 l 1 > 1 1 1 M 1 1 l 1 l 1 M W t-H > > M to < 1 1 lu 0 0 0 0 0 CU Eh > 43 1 41 43 43 > M Cu > Eh Eh Eli Eh Eh Eh I 1 6 0 Ed a 0 0 0 1 1 1 1 G 0 1 1 1 1 1 1 I 1 CU CO SC < 0 < Ex] l 0 X 0 SC CO Eli 1 >H < SC T3 41 > > Eli 2 1 1 t-H t-H •43 Eh 1*3 Ed 2 a 1 XX t o Cu 43 41 4 3 i 1 43 41 4 1 G g 00 EH CO CO 2 2 1 c o 2 2 vO £ 41 1 1 2 1 > t-H cu vd 0 1 1 2 Q 1 2 o» 6 0) G M G 4h <0 0 M VO a bu CO CQ 60 G s 2 T3 0) a a <0 > 1 S> .8 6 0 G M e 4-> a a > i

F ig u re 6 .1 1 .

Model of the transmembrane topology of P-glycoprotein, illustrating the structure of the members of the ABC superfamily of transporters. Each circle represents an amino acid; branched structures represent N-linked carbohydrate. The tandemly duplicated structure is clearly evident. The predicted product of the RING4 gene contains one membrane-associated domain and one ATP-binding domain and would therefore have a structure similar to one half of P-glycoprotein. the endoplasmic reticulum (ER), which is thought to be the site of class I/peptide interaction. The defect can be relieved by the addition of exogenous viral peptides, which induce folding of the class I molecules, association with p-2 microglobulin, and transport of the complex to the cell surface where normal antigen presentation occurs. This observ­ ation has been interpreted as showing that LBL 721.174 is defective in the transport of intracellularly derived peptides from their site of generation in the cytosol to their site of binding to class I, in the ER (Cerundolo et al., 1990). It is known that some members of the ABC family are specialised for the transport of peptides and proteins; thus, the product of the STE6 gene in S. cerevisiae is responsible for export of the 12-amino acid a mating factor (McGrath and Varshavsky, 1989), the product of the E. coli hlyB gene transports the protein haemolysin (Mr 107kD; Felmlee et al., 1985), and the product of the opp (oligopeptide permease) operon of S. typhimurium transports peptides of up to 5 amino acids in length (Hiles et al., 1987). These members of the ABC superfamily are specialised for the transmembrane transport of polypeptides which do not have signal sequences by a mechanism independent of the normal secretory pathway. An attractive hypothesis is that theRING4 product is part of a signal-independent peptide transporter which is associated with the ER membrane and which pumps peptides from the cytosol into the lumen of the ER, where binding to class I molecules then occurs (Trowsdale et al., 1990). The role of the RING4 product in class I antigen presentation is currently being tested by transfecting the RING4 gene into mutant cell lines and testing for complementation of the defect (collaboration with Dr. Alain Tow nsend).

The RING4 gene is an exciting candidate for explaining some class II- associated autoimmune diseases. It can be speculated that different allelic forms of this gene could have a profound influence on the immune response by determining which peptides ultimately become associated with class I molecules and presented to cytotoxic T- lymphocytes. Thus, in theory, the RING4 protein could promote the presentation of certain autoantigens to autoreactive T-cells, and participate in the initiation of an autoimmune response.

152 6.6. Partial nucleotide sequence of RING5

The two ends of the insert of the 2.3kb RING5 cDNA clone yU5 were sequenced using the two synthetic oligonucleotide primers (CDM8-F and CDM8-B) complementary to the CDM8 vector on either side of the cloning site. The sequence primed by CDM8-F extended for 162 nucleotides and contained a single ORF of 54 amino acid residues (Figure 6.12.) This ORF contained a stretch of hydrophobic amino acid residues followed by a region unusually rich in histidine residues. The partial amino add sequence of the putative RING5 protein product was found to be very similar to the N-terminal region of the predicted product of a mouse gene, KE4 (St.-Jacques et al., 1990). When the RING5 and KE4 gene sequences were aligned, 145 out of 162 nucleotides (89.5%) were identical. At the amino acid level, 45 out of 54 (83%) residues were identical (Figure 6.12.). The yU5 sequence primed by the CDM8-B oligonucleotide contained stop codons in all three reading frames. When this sequence was compared with the sequence of the KE4 cDNA, significant identity was detected with the 3' untranslated region of XE4. It was therefore concluded that R1NG5 and KE4 are homologues of the same gene. Consistent with this conclusion are the observations that the RING5 cDNA clone yU5 detects cross- hybridising sequences 45kb proximal of the mouse Pb gene, which is known to be the map position of KE4 (see Chapter 7 for further details), and that KE4 and RING5 were both found to be expressed in all tissues tested (St.-Jacques et al., 1990; Table 5.4.). It is interesting however that RING5 appears to give rise to two major RNA species, while KE4 produces only one (Figure 5.14.; St.-Jacques et al., 1990). Furthermore, the KE4 transcript was estimated to be 2.8kb in length, while the RING5 transcripts are 2.0 and 2.3kb. The explanation for these differences will await the complete sequencing of cDNA clones for the two forms of RING5 and comparison of these sequences to KE4. The protein coding region of the 2.7kb KE4 cDNA which was sequenced in the study of St.- Jacques et al. (1990) spanned only 1.3kb of the total and it is therefore possible that the differences in length observed between the mouse and human transcripts occur in the untranslated regions.

153 KE4 —T G-T —G ------RING5 CCC CAC TGG GTG GCG GTG GGA CTG CTG ACC TGG GCG ACC TTG GGG CTT CTG GTG (5 4 ) RING5 P H W V A V G L L T W A T L G L L V (18) hj x cd * z x z z x Cd CD CDCd E CD < CD < < CD < O < O CD Cd CD O* CD E CD I CD I CD I CD I X I t-3 CD I I I CD I I CD I CD I E I I CD I I CD I I E I CD I h Eh I E CDI I CD I I O I I 1-3 CD I CD I CD I I I Cd CD I < I CD I I C Q I Q CD I CD I E I E I O I ! 2 O X ! X O I I EH I Eh CD CD i/l i/l 'J X X M M < CD CD CD CO O < < < < 3 0 < < < h h h h h h

WO D I CD < D CdCD D I CD I Q I X CD Q -3 > I Q I X lu I I

J L LO LO HJ* X 2 2 X Cd CD CD Cd CD D Eh CD CDCD CDCD CD I CD CD cd O CD X X M H CD CD CD CD CD < < < CD CD CD CD Eh CD CD < < £ CD I CdCD CD E E CD CD CD WO IT) CM ’T < < E < Eh CD CD CD CD < CD CD CD 2 < < < h h h h

D I CD I X I CO I X u I Cu CD X I X I co I X I X I CD I X I Q I X I CO I X < CD

» a y - . c c. u- - •§ -a a o —J 41 3 3 <3 CO * 73 T> 73 3 7 ♦ H • ^ • ON ON o C/D £ , -t-> • ^ 3 T 73 £ O