De Novo Genome Assembly and Analysis of Non- Allelic Recombination in Pathogenic Yeast Candida Glabrata

DE NOVO GENOME ASSEMBLY AND ANALYSIS OF NON- ALLELIC RECOMBINATION IN PATHOGENIC YEAST CANDIDA GLABRATA by Zhuwei Xu A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland November 2019 © 2019 Zhuwei Xu All Rights Reserved Abstract Candida glabrata is an opportunistic pathogen in humans, responsible for approximately 20% of disseminated candidiasis. C. glabrata’s ability to adhere to host tissue is mediated by GPI- anchored cell wall proteins (GPI-CWPs); the corresponding genes contain long tandem repeat regions and form large gene families. These tandem repeats cause mis-assemblies of GPI-CWP genes in C. glabrata genome. Subtelomeres of C. glabrata are particularly rich in GPI-CWP genes and share homology with each other. Consequently, the subtelomeres are mis-assembled in genome sequences assembled from short sequencing reads. In this thesis, we used the long single-molecule real time (SMRT) reads and performed de novo genome assembly of the C. glabrata genome to establish the correct structure of GPI-CWP genes and the subtelomeres. We assembled the genome of six C. glabrata strains: the type strain, CBS138; our lab strain, BG2; four serial clinical isolates, BG3993-96 to assess genome changes during infection. With high quality sequences in hand, we then assess recombinational exchange between GPI-CWP genes by non-allelic mitotic recombination. This question is difficult to address with normal aligners, and we developed a k-mer based method to identify recombination. Our assembly established the correct subtelomere structure of Candida glabrata and provides correct structure of the GPI-CWP gene families. Our analysis of the clinical isolates showed a very modest level of genetic change during the period of infection. Two of the four isolates are hyperadherent, and we identified a mutation in the gene encoding the transcription factor Yap6 as a likely candidate resulting in this phenotypic change. Our k-mer based method was applied genome wide to identify non allelic mitotic recombination events including in complex repeat regions. We documented a higher apparent recombination rate between subtelomeric genes and ii overall between GPI-CWP genes, independent of their location. In addition, we could document mitotic exchange between non-subtelomeric, non-GPI-CWP genes. Thesis Advisor Brendan Cormack Thesis Reader Michael Schatz iii Acknowledgement I am grateful to my parents, Zhongyuan Xu and Xiuhua Yang for their assistance and support during my graduate career. I am grateful to Jef Boeke in whose lab I worked for the first part of my graduate career, and for the opportunity to do research with him on synthesis of yeast chromosomes. I thank Brendan for his tremendous help and guidance. He welcomed me in my transition from benchwork to computational biology. He introduced me to the computational analysis of C. glabrata genome, and gave me an interesting and complicated problem to study - mitotic recombination events in the complex C. glabrata GPI-CWP genes. I would like to thank Cindy Rogers for all her help during my time in the Cormack lab. I also would like to thank my thesis committee members Jeff Corden, Sarah Wheelan, and Michael Schatz for their guidance. Michael and Sarah are also co-mentors for my thesis, and made many invaluable intellectual contributions to my thesis work. I also thank Michael for reading my thesis, and I am grateful for his expeditious review to finish my thesis in time. Finally, I would like to thank the BCMB program for giving me the chance to do research at Hopkins. I am particularly grateful to Carolyn Machamer for her help during and after my transition to the Cormack Lab. Lastly, I want to thank Arhonda Gogos, who is the heart of the program, for all her help during my time at Hopkins. iv Table of Contents Abstract ii Acknowledgement iv Table of Contents v List of Tables vii List of Figures viii Chapter 1 Background and Significance 1 Cell wall structure 3 Structure and function of adhesin genes 4 Adhesins in Candida albicans 5 Adhesins in Saccharomyces cerevisiae 8 Adhesins in Candida glabrata 9 Subtelomeres in Candida glabrata 11 Subtelomeric silencing in Saccharomycese cerevisiae 12 Subtelomeric silencing in Candida glabrata 13 Genome assembly of Candida glabrata 13 De novo genome assembly using long sequencing reads 15 Reference 17 Chapter 2 De novo assembly of the Candida glabrata type strain reveals cell wall protein complement and structure of dispersed tandem repeat arrays 26 Introduction 27 Experimental Procedures 30 Results 40 Reference 56 Chapter 3 De novo assembly of C. glabrata BG2 strain 62 Introduction 63 Experimental Procedures 65 Results 72 Discussion 89 Reference 92 Chapter 4 De novo assembly of four clinical serial isolates 96 Introduction 97 Experimental Procedures 99 v Results 103 Discussion 114 Reference 120 Chapter 5 Non-allelic mitotic recombination in Candida glabrata 122 Introduction 123 Experimental Procedures 125 Results 139 Discussion 171 Reference 172 Conclusion 173 De novo assembly of the type strain CBS138 174 De novo assembly of strain BG2 176 De novo assembly of serial clinical isolates, BG3993-96 177 Analysis of non-allelic mitotic recombination in C. glabrata 178 Future directions 178 Reference 180 Appendices 181 Supplementary Tables 181 Supplementary Figures 183 Curriculum Vitae 185 vi List of Tables Table 1 Metrics of Pacbio sequencing 116 Table 2 Chromosome assembly metrics 117 vii List of Figures Figure 1 Illumina Read Coverage over the assembled genome 42 Figure 2 Structure of the C. glabrata subtelomeres 47 Figure 3 Gene Length Difference between two assemblies 51 Figure 4 Phylogenetic tree of GPI-anchored adhesins 55 Figure 5 Chromosome rearrangements 74 Figure 6 Illumina read coverage 76 Figure 7 Subtelomere structure 80 Figure 8 Gene comparison between BG2 and CBS138 83 Figure 9 Phylogenetic tree of CWP-GPIs 88 Figure 10 Chromosome rearrangements between BG3993 and BG2 105 Figure 11 Subtelomere structure of BG3993 107 Figure 12 Phylogenetic tree of GPI-CWPs 111 Figure 13 Diagram of k-mer based recombination analysis 126 Figure 14 Example of one conflict recombination event 133 Figure 15 Optimization of k-mer size in the first step 141 Figure 16 Optimization of k-mer size in the second step 142 Figure 17 Recombination of BG2 CAGL0A01284g (EPA10) relative to 144 CBS138 Figure 18 Multiple sequence alignment of one recombination event 147 Figure 19 Structure of CAGL0B05061g ORF sequence in strain BG2 and 150 CBS138 Figure 20 Recombination of BG2 CAGL0B05061g and BG2 151 CAGL0F09251g Figure 21 Megablast alignment of CBS138 CAGL0B05061g to BG2 154 ORFs viii Figure 22 Recombination of BG2 CAGL0J11968g (EPA15) and BG2 157 CAGL0M00132g (EPA12) Figure 23 Genome-wide recombination map in BG2 relative to CBS138 161 Figure 24 Histogram of weighted average of bitscores 166 Figure 25 Ratio of recombined pairs to homologous pairs 169 Figure 26 Influence of distance between gene pairs in recombination 170 ix Chapter 1 Background and Significance 1 Candida glabrata is an opportunistic pathogen of humans that causes both superficial mucosal infection and severe disseminated infection (Gonçalves et al., 2016; Pappas et al., 2018). C. glabrata accounts for up to 30% of Candida bloodstream infections in hospitalized patients in Europe and the United States (Andes et al., 2016; Astvad et al., 2018; Chapman et al., 2017; Cleveland et al., 2015). The contributing factors of C. glabrata virulence are not completely understood, but secretion of hydrolytic enzymes, ability to evade phagocytic killing and adherence to host tissue are all thought to contribute to the virulence of C. glabrata (Kumar et al., 2019). Cell adherence is mediated by some of the C. glabrata surface proteins (Timmermans et al., 2018). The GPI-anchored cell wall proteins (GPI-CWPs) are the main class of surface proteins and the main class of proteins mediating adherence. The GPI-CWPs form large gene families, and the encoded ORFs are characterized by the presence of long tandem repeats. In C. glabrata, the repetitive nature of the GPI-CWPs, and the homology between members of GPI- CWP families, complicates genome assembly and has resulted in gross mis-assemblies of C. glabrata genome sequences. Since these mis-assemblies correspond to the GPI-CWP class of gene, a major impact of these errors is to complicate genetic investigation of the GPI-CWPs. This thesis provides an approach to generate high-quality genome assembly of C. glabrata that permits further study of GPI-CWPs with the application of de novo genome assembly using long Single-Molecule Real-Time (SMRT) sequencing reads. In addition, using very high quality genome sequences generated for C. glabrata, this thesis studies whether the GPI-CWP encoding genes exchange information through mitotic recombination, and investigates general mitotic recombination events between C. glabrata strains. In this introductory chapter, I review the current knowledge of cell adherence, subtelomeric localization of the GPI-CWP encoding genes, transcriptional regulation of subtelomeric GPI-CWP genes, and the current status of the C. glabrata genome assembly. 2 Cell wall structure The fungal cell wall is composed of proteins and polysaccharides. The cell wall has a general structure with an alkali-insoluble layer close to the plasma membrane, an alkali-soluble layer close to the outer face, and an outermost layer where most cell wall proteins are located. The alkali-insoluble layer is constituted of a complex of β1,3-glucans and chitin: β1,3-glucans are cross-linked to chitin, forming a complex to bind to other polysaccharides. (Latgé and Beauvais, 2014). In contrast, the alkali-soluble layer is composed of amorphous polysaccharides (Latgé and Beauvais, 2014). The polysaccharide components of the two layers vary amongst pathogens. For instance, β1,3- and β1,4-glucans and galactomannans bind to the glucan-chitin complex in Aspergillus fumigatus, while, β1-6 glucans are the primary component in Candida albicans (Latgé, 2010).

De Novo Genome Assembly and Analysis of Non- Allelic Recombination in Pathogenic Yeast Candida Glabrata

(PWP2) Gene Mapping to 21Q22.3 Ronald G

Assembly and Annotation of an Ashkenazi Human Reference Genome

WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT

Investigation of RNA Binding Proteins Regulated by Mtor

Hippo and Sonic Hedgehog Signalling Pathway Modulation of Human Urothelial Tissue Homeostasis

Role and Regulation of the P53-Homolog P73 in the Transformation of Normal Human Fibroblasts

The Genetics of Bipolar Disorder

WO 2012/174282 A2 20 December 2012 (20.12.2012) P O P C T

Down Syndrome Congenital Heart Disease: a Narrowed Region and a Candidate Gene Gillian M

WO 2013/064702 A2 10 May 2013 (10.05.2013) P O P C T

DNA Physical Properties Determine Nucleosome Occupancy from Yeast to Fly. Vincent Miele, Cédric Vaillant, Yves D’Aubenton-Carafa, Claude Thermes, Thierry Grange

Weighted Gene Correlation Network Meta-Analysis Reveals Functional Candidate Genes Associated with High- and Sub-Fertile Reproductive Performance in Beef Cattle