Structure and Function of KH Domain in DDX43 and DDX53 Cancer Antigen

A Thesis Submitted to the College of Graduate and Postdoctoral Studies In Fulfillment of the Requirements For the Degree of Master of Science In the Department of Biochemistry, Microbiology and Immunology University of Saskatchewan Saskatoon

By Manisha Yadav

© Copyright Manisha Yadav, April, 2020. All rights reserved.

PERMISSION OF USE STATEMENT I hereby present this thesis in partial fulfilment of the requirements for a postgraduate degree from the University of Saskatchewan and agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, either in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised this thesis or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts of it for any financial gain will not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis.

Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to:

Head of the Department of Biochemistry, Microbiology and Immunology

University of Saskatchewan

Saskatoon, Saskatchewan, S7N 5E5

Dean

College of Graduate and Postdoctoral Studies

University of Saskatchewan

116 Thorvaldson Building, 110 Science Place

Saskatoon, Saskatchewan, S7N 5C9

Canada

i

ABSTRACT DDX43 and DDX53 are two cancer/testis antigens that are overexpressed in various tumors, but absent or expressed at very low levels in normal tissues; therefore, both are proposed to be valid candidates for therapeutic targets. DDX43 and DDX53 belong to the DEAD-box RNA family, and contain a K- (KH) domain in their N-terminal regions and a helicase core domain in their C-terminal regions. The KH domain was first characterized in the human heterogeneous nuclear ribonucleoprotein K (hnRNP K). The typical function of KH domain is to bind single-stranded (ss) RNA or ssDNA in a sequence specific manner. The helicase core domain contains two conserved RecA-like domains and unwinds double-strand nucleic acids in a sequence-unspecific manner; therefore, we hypothesize that the KH domain may play a critical role in sequence-specific function of DDX43 and DDX53. In this study, we expressed and purified DDX43-KH and DDX53-KH domains using Ni-NTA column chromatography and size exclusion chromatography. Using electrophoretic mobility shift assay (EMSA), we found that DDX43-KH domain binds ssDNA and ssRNA substrates efficiently but not blunt-end double-stranded (ds) DNA or dsRNA substrates. We applied 1H-15N HSQC NMR spectroscopy to determine the binding affinity of DDX43-KH with dT5, dA5, dC5, dG5 and rU5 oligonucleotides, and found that there are significant chemical shift changes in the HSQC spectra for pyrimidines (dT5, dC5, rU5), but not for purines (dA5 and dG5). Titrations with dT10 and dT5 oligonucleotides revealed that the GRGG loop in DDX43- KH domain is involved in -nucleic acids interactions. In addition, we found that the Ala81, adjacent to the GRGG loop, is involved in the binding. A81G and A81S reduced protein stability and binding affinity. To further identify the specific nucleotides bound by the DDX43-KH and DDX53-KH domains, we used SELEX (Systematic evolution of ligands by exponential enrichment) and ChIP-seq (Chromatin Immunoprecipitation Sequencing)/CLIP-seq (Cross-linking immunoprecipitation-sequencing), and found CT-rich sequences are commonly occurring motifs for DDX43-KH domain. However, GTTGA, TG/TTG/TG/T motifs are also common in the SELEX and CLIP samples. Finally, NMR spectroscopy showed that GTTGT has the lowest Kd value (0.0435 mM); suggesting, DDX43- KH binds to TG-rich sequences with high affinity and might partially dictate the substrate specificity for the full length DDX43 protein.

ii

ACKNOWLEDGEMENTS I wish to express the deepest appreciation to my supervisors Dr. Yuliang Wu and Dr. Miroslaw Cygler for giving me the opportunity to carry out this project in their labs, their endless support on academic studies and their helpful guidance on my career. Their resolute enthusiasm for science kept me constantly engaged with my research. I would also like to express gratitude to my committee member Dr. Jeremy Lee and Dr. Oleg Dmitriev, for their insightful suggestions and guidance throughout my project. Dr. Dmitriev has guided and supported me throughout the NMR spectroscopy experiments and his inputs in developing this project kept me motivated. Dr. Lee has always suggested very important ideas related to the project. Together, my work is a result of constant support and suggestions from this team of professors and I’m highly grateful to work under their supervision. My sincere thanks also go to Dr. Anthony Kusalik and Daniel Hogan for their processing and analyzing the sequence data. I would like to thank Dr. Ivy Yeuk Wah Chung as she has been involved with me in all the protein purification and crystallization experiments. I’m grateful to Dr. Venkatasubramanian Vidhyasagar for his guidance in ChIP/CLIP-seq experiments. I would like to thank Mrs. Eva Ulhemann for her suggestions and support in sharing the protocols for 15N-labelled protein purification. All members of Dr. Wu and Dr. Cygler labs deserve recognition for their constant advice and friendship: Dr. Ravi Shankar Singh, Kingsley Ekumi, Dr. Michal Boniecki, Shizhuo (Sarah) Yang, Aanchal Aggrawal. I would like to extend my thanks to the Saskatchewan Structural Sciences Centre for the 600MHz NMR spectrometer. I would like to thank all members of the Cancer Cluster and faculty of Biochemistry Department.

I must express my profound gratitude to my parents, my sister and brother for providing me with continuous encouragement to pursue my dreams and for their patience throughout my years of research. I am also thankful to all my friends for their help and support through my master program.

iii

TABLE OF CONTENTS PERMISSION OF USE STATEMENT…………………………………………………..……..i ABSTRACT……………………………………………………………………………………..ii ACKNOWLEDGEMENTS…………………………………………………………………….iii TABLE OF CONTENTS………………………………………………………………...……..iv LIST OF FIGURES………………………………………………………………………….....vii LIST OF TABLES……………………………………………………………………………….x LIST OF ABBREVIATIONS…………………………………………………………………..xi

1. INTRODUCTION…………………………………………………………………………...1 1.1 Helicases are classified into six superfamilies…………………………………………...1 1.2 DEAD-box helicases……………………………………………………………………..2 1.3 DDX43 (HAGE) and DDX53 (CAGE) helicases…………………………...………...…4 1.4 The KH domain and KH domain-containing human ..………………………....8 1.4.1 Two types of KH domain………………………………………………...……..10 1.4.2 Interaction of Type I KH domains with their specific nucleic-acid sequences……………………………………………………………………….11 1.4.3 Structure of KH domains bound to specific nucleic acid sequences………… .14 2. HYPOTHESIS AND OBJECTIVES……………………………………………………….16 2.1 Hypothesis……………………………………………………………………………...16 2.2 Objectives..……………………………………………………………………………..16 3. MATERIALS AND METHODS…………………………………………………………...17 3.1 Reagent list.……………………………………………………………………...……..17 3.2 Plasmids DNA and mutagenesis………………………………………………………..19 3.3 RNA and DNA substrates………………………………………………………………20 3.4 Expression and purification of recombinant DDX43-KH and DDX53-KH domain proteins……………………………………………………………………………….....26 3.5 Western blot…………………………………………………………………………….26 3.6 Electrophoretic mobility shift assay (EMSA)…………………………………………..27 3.7 Helicase unwinding assays…………………………………………...………………...27 3.8 15N- Isotopic labelling of recombinant DDX43-KH proteins…………………………..28 3.9 Systematic evolution of ligands by exponential enrichment (SELEX) …..……………29 iv

3.10 culture…………………………………………………………………………….30 3.11 Transfection in HEK293T cells and immunoprecipitation…………………………....30 3.12 ChIP-DNA and CLIP-RNA library construction……………………………………...31 3.13 Bioinformatics analysis of SELEX, ChIP-DNA, and CLIP-RNA sequences…..…….32

4. RESULTS…………………………………………………………………………………..33 4.1 Protein overexpression and purification…..……………………………………………33 4.1.1 Purification of DDX43-KH and DDX53-KH domain proteins………………..33 4.1.2 DDX43-KH+RecA1 and DDX53-KH+RecA1 proteins are unstable and difficult to purify…………………………………………………………………………35 4.2 DDX43-KH-74 and DDX43-KH-126 grow into needle like small crystals and DDX53- KH proteins does not crystallize…………………………………………………….….37 4.3 Biochemical characterization of DDX43-KH and DDX53-KH domains using electrophoretic mobility shift assay (EMSA)…………………………………………..40 4.3.1 DDX43-KH and DDX53-KH domains binds ssDNA and RNA but not blunt-end dsDNA and dsRNA substrates………………………………………………….40 4.3.2 DDX43-KH and DDX53-KH domains prefers pyrimidines over purines by EMSA…………………………………………………………………………..42 4.4 DDX43-KH and DDX53-KH domain alone have no helicase unwinding activity…….44 4.5 Nuclear Magnetic Resonance (NMR) spectroscopy analysis of DDX43-KH domain………………………………………………………………………………….44 4.5.1 Expression and purification of 15N-labelled DDX43-KH-89 protein…………..44 4.5.2 Sequential assignment of DDX43-KH-89 protein and identifying the amino acids interacting with oligonucleotides……………………………….....45 4.5.3 DDX43-KH domain prefers binding pyrimidines than purines by NMR spectroscopy………………………………………………….…..………...... 48 4.6 Role of Ala81 and Gly87 amino acid(s) in DDX43-KH protein’s stability and function…...... 50 4.6.1 DDX43-KH G87D and A81I mutants are unstable and insoluble for purification……………………………………………………………………...50 4.6.2 1H-15N HSQC NMR spectroscopy of DDX43-KH-89-A81S and DDX43-KH-89- A81G………..………………………………………………………….……….53

v

4.6.3 DDX43-KH-89-A81S and DDX43-KH-89-A81G have reduced protein stability and binding activity……………………………………………………..…..….55 4.7 Identification of nucleic acids bound by KH domain using SELEX and ChIP/CLIP-Seq methods………………………………………………………………………………....58 4.7.1 Sequences identified by SELEX………………………………………………..58 4.7.2 Sequences identified by ChIP/CLIP-seq………..………………………………59 4.7.3 Validation of specific nucleic acid sequences bound by DDX43-KH-89 using 1H-15N- HSQC NMR spectroscopy…………………………………..………...62 5. DISCUSSION……..………………………………………………………………………..64 5.1 DDX43-KH and DDX53-KH domain binds to single strand DNA and RNA substrates………………………………………………………………………………..65 5.2 DDX43-KH domain prefers pyrimidines……………………...…………………….....65 5.3 Role of the GXXG loop in DDX43-KH domain…………………………………….....66 5.4 Oligomeric status of DDX43-KH and DDX53-KH domain...……………………. …..67 5.5 Role of KH domain in DDX43 and DDX53 helicases…………………………………68 5.6 Potential biological role of KH domain-containing helicases DDX43 and DDX53 in cancer…………………………………………………………………………………...69 6. CONCLUSIONS AND FUTURE WORK…………………………………………………70 6.1 Conclusions……………………………………………………………………………..70 6.2 Future work……………………………………………………………………………..71 7. REFERENCES…………………………………………………………………………...... 71

vi

LIST OF FIGURES Figure 1. Schematic of the core helicase domain in SF1- 6 family members……………...…....2

Figure 2. Motif structure of DEAD- and DEAH-box in SF2 RNA helicases with their potential functions………………………………………………………………………………………….3

Figure 3. Domain diagram and primary protein sequence of human DDX43 protein…………..5

Figure 4. Domain diagram and primary protein sequence of human DDX53 protein…………..6

Figure 5. Comparing sequences of DDX43 and DDX53 proteins………………………………8

Figure 6. Summary of KH domain containing human proteins……….………………………..9 Figure 7. Secondary structure arrangement of Type I and Type II KH domain………………..11

Figure 8. Multiple sequence alignment of DDX43 KH domain with other protein KH domains…………………………………………………………………………………………12

Figure 9. Structure of KSRP KH3 domain in complex with AGGGU (PDB: 4B8T)………….13

Figure 10. Structure of KH domains bound to specific nucleic acid sequences…………….….15

Figure 11. Cartoon diagram depicting the DDX43 and DDX53 KH domain constructs along with their specific primary sequences…………………………………………………….. 20

Figure 12. Purification of DDX43-KH and DDX53-KH domain proteins……………………..34

Figure 13. Optimization of expression and purification of DDX43-KH+RecA1 and DDX53- KH+RecA1 proteins…………………………………………………………………………....36

Figure 14. DDX43-KH protein in the form of needle crystals………………………………....39

Figure 15. Treatment of DDX43-KH-126 with different proteases…………………...……….39

Figure 16. DDX43 and DDX53 KH proteins prefer to bind single stranded DNA and RNA…42

Figure 17. DDX43 and DDX53 KH-domain proteins prefer binding to pyrimidines over purines ………………………………………………………………………………………………….43

Figure 18. Unwinding assays of DDX43 and DDX53 KH proteins……………………………44

vii

Figure 19. Purification of 15N-labelled DDX43-KH-89 protein for NMR analysis…………....45

Figure 20. Amino acids peak assignments of DDX43-KH-89 protein’s NMR spectra………..46

Figure 21. NMR characterization of the interaction between DDX43-KH-89 with dT10 and dT5 oligonucleotides………………………………………………………………………………...47

Figure 22. 1H-15N HSQC NMR spectrum showing DDX43 KH domain prefers pyrimidine nucleotides……………………………………………………………………………………...50

Figure 23. Multiple sequence alignment of DDX43-KH domain for determining the conservation of A81 and G87 amino acids……………………………………………………..51

Figure 24. Purification of DDX43-KH-126-G87D and DDX43-KH-89-A81I mutant proteins ………………………………………………………………………… ……………………..52

Figure 25. Purification of DDX43-KH-89 A81S and A81G mutant proteins………………….53

Figure 26. NMR spectral analysis of DDX43-KH-89 A81S and A81G mutant proteins……...54

Figure 27. Analysis of interaction between DDX43-KH-89 A81S and A81G with oligonucleotides using NMR spectroscopy...... ……………………………………………….57

Figure 28. Analysis of interaction between DDX43-KH-89 A81S and A81G proteins with oligonucleotides using EMSA………………………………………………………………….57

Figure 29. Identification of specific sequences by SELEX method……………………………58

Figure 30. Quality check by Bioanalyzer for the DNA library constructed using SELEX method………………………………………………………………………………………….59

Figure 31. Motif sequences identified for DDX43-KH-89 protein by SELEX method………..59

Figure 32. Identification of specific sequences bound by KH domain using ChIP/CLIP-Seq methods…………………………………………………………………………………………60

Figure 33. Quality check by Bioanalyzer for the DNA/RNA library constructed using ChIP/CLIP-Seq methods………………………………………………………………………..61

viii

Figure 34. Motif sequences identified for DDX43-KH-89 protein by ChIP/CLIP-seq method ………………………………………………………………………………………………….61

Figure 35. Overlay of 1H-15N HSQC titration spectrum of DDX43-KH-89 protein with indicated oligonucleotide (ranged 40- 1000 μM)………………………………………………63

Figure 36. Combined chemical shift (Δδ) vs. oligonucleotide concentration (mM) for residues G84 and G87……………………………………………………………………………………64

Figure 37. A working model showing the role of KH domain in helicase……..……..……..…69

ix

LIST OF TABLES

Table 1. KH domains with their specific sequences and Kd values………………………….....15 Table 2. List of reagents, catalog number and suppliers………………………………..……...17 Table 3. List of primers used in the study………………………………………………………21

Table 4. List of RNA and DNA substrates used in the study…………………………………..24

Table 5. Summary of DDX43 and DDX53 KH domains protein expression and their respective molecular weight and pI……………………………………………………………………...... 35

Table 6. Summary of DDX43 and DDX53 KH domains protein expression, concentration and crystal screening…………………………………………………………………………...…...38

Table 7. Affinities of DDX43-KH-89 domain with indicated oligonucleotide sequences determined by NMR spectroscopy.……………………..………………...... ……………….....62

x

LIST OF ABBREVIATIONS

ATP Adenosine triphosphate bp BSA Bovine serum albumin CML Chronic myeloid leukemia CV Column volume DMEM Dulbecco’s modified eagle’s medium DNA Deoxyribonucleic acid DTT Dithiothreitol EDTA Ethylenediaminetetraacetic acid EMSA Electrophoretic mobility shift assays ER Estrogen-receptor FBS Fetal bovine serum FMRP Fragile X mental retardation protein FPLC Fast protein liquid chromatography HAGE Helicase antigen HD Helicase domain His Hexa-histidine hnRNP K heterogeneous nuclear ribonucleoprotein K HRP Horseradish peroxidase IHC Immunohistochemistry IPTG Isopropyl β-D-1-thiogalactopyranoside KH K-homology LB Lysogeny broth MDS Myelodysplastic syndrome MEK Mitogen-activated protein kinase MOPS 3-(N-morpholino) propanesulfonic acid NE No enzyme Ni-NTA Nickel-nitrilotriacetic acid NTP Nucleotide triphosphate

xi

PAGE Polyacrylamide gel electrophoresis PBS -buffered saline PCR Polymerase chain reaction PEI Polyethylenimine Pi Inorganic phosphate PMSF Phenylmethanesulfonylfluoride PML Progressive multifocal leukoencephalopathy PVDF Polyvinylidene difluoride Sam68 Src-Associated substrate in mitosis of 68 kDa SDS Sodium dodecyl sulphate SDS-PAGE Sodium dodecyl sulfate polyacrylamide gel electrophoresis SEC Size exclusion chromatography SF Superfamily TBE Tris/Borate/EDTA TCEP Tris (2-carboxyethyl) phosphine TEMED N,N,N’,N’ – tetramethylethylenediamine TLC Thin-layer chromatography Tris Tris (hydroxymethyl) aminomethane WT Wild type

xii

1. INTRODUCTION

1.1 Helicases are classified into six superfamilies Helicases are nucleic acid dependent that bind and unwind DNA or RNA duplexes or DNA-RNA hybrid substrates (Cordin et al., 2006; Singleton et al., 2007). They are ubiquitously present in highly diverse forms in both eukaryotic and prokaryotic cells. Helicases move unidirectionally along one nucleic acid strand, either in 5′→3′ or 3′→5′ direction, however, some helicases display bidirectional catalysis activity (Patel and Picha, 2000; Tuteja et al., 2014). They utilize the chemical energy generated by ATP hydrolysis into unwinding DNA helix, thereby resulting in strand separation. Therefore, helicases belong to the general class of “motor proteins” (Lohman and Bjornson, 1996; Patel and Picha, 2000; Cordin et al., 2006). They play important roles in cell processes including DNA replication and repair, transcription, protein synthesis, RNA maturation, mRNA splicing, segregation, and telomere maintenance (Brosh, Jr. and Bohr, 2007; Singleton et al., 2007; Bernstein et al., 2010; Dillingham, 2011; Linder and Jankowsky, 2011). Based on their substrates, helicases can be classified into DNA helicases or RNA helicases, while some helicases can unwind both DNA and RNA molecules. Based on their conserved motifs, helicases can be classified into six superfamilies (SF1- 6) (Figure 1). SF3-6 families contain only a single domain that resembles that of a bacterial recombinase A (RecA) domain, and they form toroidal hexameric ring-like structures (Fairman- Williams et al., 2010). SF1 and SF2 family helicases consist of two domains, each resembling RecA. Each domain contains several characteristic conserved sequence motifs. Unlike SF3-6 proteins, proteins from SF1 and SF2 families do not form toroidal hexameric ring-like structures. SF1 and SF2 families contain both RNA and DNA helicases and are present in prokaryotes, eukaryotes and viruses. SF3 family contains DNA helicases found in viruses, while SF4 and SF5 families include DNA helicases of bacterial origin (Singleton et al., 2007). SF2 helicases form the largest superfamily and function as monomers or dimers. SF2 is further divided into several subfamilies, including DEAD-box RNA helicases, RecQ-like family, and Snf2-like enzymes (Singleton et al., 2007; Pyle, 2008; Fairman-Williams et al., 2010; Byrd and Raney, 2012).

1

Do not form toroidal Largest sub- hexameric ring structures family: DEAD- box helicases

Forms toroidal hexameric ring structures

Figure 1. Schematic of the core helicase domain in SF1-6 family members. The N-terminal RecA domain (RecA1) is represented by a white cylinder and the C-terminal RecA domain (RecA2) is shown as a red cylinder. Motifs in yellow are involved in NTP binding/hydrolysis, green are associated with translocation, and blue interact with nucleic acid. Motifs that are unique to specific superfamilies are highlighted with a red oval. The Walker A (motif I), Walker B (motif II) and arginine finger (R) are conserved across all helicase superfamilies. Modified from (Jackson et al., 2014).

1.2 DEAD-box helicases The DEAD box helicases belong to the largest subfamily within SF2 superfamily and are characterized by the presence of nine conserved motifs, designated as Q, I, Ia, Ib, II, III, IV, V, and VI (Figure 2) (Cordin et al., 2006). Motif II contains Asp-Glu-Ala-Asp (D-E-A-D), from which the name ‘DEAD box family’ was derived (Linder and Jankowsky, 2011). These motifs are usually clustered within a region of 200-700 amino acids called the helicase core domain. Depending on their interaction with partner proteins and their distinct N- and C-terminal domains, DEAD-box proteins can act as RNA chaperons, ATP-dependent RNA helicases or as co-activators and co- repressors of transcription (Schutz et al., 2010; Jarmoskaite and Russell, 2014). Numerous reports indicate that DEAD-box proteins are involved in processes that are essential in cellular proliferation and/or for neoplastic transformation (Fuller-Pace, 2013). Therefore, deregulation of expression or function of these proteins is one of the major causes of cancer development and/or progression (Abdelhaleem, 2004; Fuller-Pace, 2013).

2

(RecA 1) (RecA 2)

Figure 2. Motif structure of DEAD- and DEAH-box in SF2 RNA helicases with their potential functions. Modified from (Parsyan et al., 2011).

The simultaneous presence of all nine motifs is a criterion to classify proteins into SF2 families, or DEAD (DEAH)-box helicases (Figure 2) (Linder, 2006). Motif II, also known as Walker B motif, together with motif I, the Q motif and motif VI, is required for ATP binding and hydrolysis (Pause and Sonenberg, 1992; Pause et al., 1994; Fairman-Williams et al., 2010). The Q motif can act as a sensor for the bound nucleotide state and as a regulator of ATPase activity (Tanner et al., 2003). It activates the hydrolysis of ATP when the substrates are correctly bound to the protein. Motif I, also known as Walker A motif, is involved in the NTP binding by forming hydrogen bonds with the pyrophosphates of the nucleotide in presence of Mg2+ ion. The first conserved lysine residue in motif I interacts with the aspartate of motif II in the ligand-free form also, this first residue of motif I interacts with the closer serine residue of motif III (Cordin et al., 2006). Motif III participates in the linking between ATPase and helicase activities (Pause and Sonenberg, 1992). The function of motifs IV and V are not fully understood. Motif VI, located at the interface between domains 1 and 2, is involved in interaction with RNA (Cordin et al., 2006). Along with motif VI, motifs Ia, Ib, IV and V, are involved in the remodeling activity of RNA helicases (Cordin et al., 2006; Linder, 2006).

3

1.3 DDX43 (HAGE) and DDX53 (CAGE) helicases DDX43, DEAD-box polypeptide 43, also known as HAGE (helicase antigen gene), is an RNA helicase. It was first identified as a cancer antigen gene in human sarcoma cell line and the gene is present on (6q12-q13), encodes a protein of 648 amino acids having molecular weight of 73 kDa (Martelange et al., 2000, Talwar et al., 2017). DDX43 mRNA is expressed in a wide range of solid as well as hematologic tumors at levels at least 100-fold higher than in normal tissues, with the exception of testis (Martelange et al., 2000, Lin et al., 2014). Overexpression of DDX43 in Uveal melanoma cell lines increases RAS proteins expression by unwinding NRAS mRNA secondary structures, thereby causing resistance of these cell lines against MEK inhibitors (Linley et al., 2012, Ambrosini et al., 2014). In contrast, downregulation of DDX43 expression results in repressed KRAS, NRAS and HRAS expression, inhibiting the ERK and AKT pathways (Ambrosini et al., 2014). DDX43 is therefore proposed to be a valid candidate for designing a broad spectrum vaccine against cancer and a suitable target for immunotherapy (Mathieu et al., 2010). DDX43 belongs to SF2 family and contains all expected conserved signature motifs, including Q, I, Ia, Ib, II, III, IV, V and VI (Figure 3A). In addition, DDX43 also has conserved motifs Ic, Va and Vb that are present in certain DEAD-box proteins (Figure 3B) (Fairman-Williams et al., 2010). Biochemical characterization of DDX43 protein demonstrates that it is an ATP- dependent dual helicase — it can unwind both dsDNA and dsRNA substrates (Talwar et al., 2017). The full length DDX43 protein is purified as a homogeneous monomer and exhibits binding preference to the nucleotides in the order G > T > C > A (Wu et al., 2019). DDX43 binds preferentially single strand DNA/RNA and binds poorly blunt-end dsDNA/dsRNA substrates. It has a polarity preference of 3′ to 5′ on DNA (Talwar et al., 2017, Wu et al., 2019), but this is reversed to 5′ to 3′ on RNA substrates (Talwar et al., 2017). In addition, DDX43 contains a KH domain in its N-terminal (Talwar et al., 2017). The KH domains have a characteristic pattern of highly conserved hydrophobic residues, a GXXG segment which interacts with nucleic acids. The KH domain in DDX43 has “GRGG” as its conserved GXXG sequence and it plays an important role in the protein’s binding and unwinding activity; however, all domains are necessary to achieve high binding affinity to substrates and their successful unwinding (Talwar et al., 2017, Wu et al., 2019).

4

A Helicase domain KH domain A Q I Ia Ib Ic II III IV V Va Vb VI 1 648 B B MSHHGGAPKASTWVVASRRSSTVSRAPERRPAEELNRTGPEGYSVGRGGR 50 MSHHGGAPKASTWVVASRRSSTVSRAPERRPAEELNRTGPEGYSVGRGGRKH -50 WRGTSRPPEAVAAGHEELPLCFALKSHFVGAVIGRGGKHSKIKNIQSTTNTT 100 WRGTSRPPEDVAAGHEELPLCFALKSHFV GAVIGRGGSKIKNIQSTTNTT-100 IQIIQEQPESLVKIFGSKAMQTKAKAVIDNFVKKLEENYNSECGIDTAFQIQIIQEQPESLVKIFGSKAMQTKAKAVIDNFVKKLEENYNSECGIDTAFQ 150- 150

PSVGKDGSTDNNVVAGDRPLIPSVGKDGSTDNNVVAGDRPLIDWDQIREEGLKWQKTKWADDWDQIREEGLKWQKTKWADLPPIKKNFYKLPPIKKNFYK 200- 200

ESTATSAMSKVEADSWRKENFNITESTATSAMSKVEADSWRKENFNITWDDLKDGEKRPIPNPTWDDLKDGEKRPIPNPTCTFDDAFQCYCTFDDAFQCY 250- 250 QQ I I PEVMENIKKAGFQKPTPIPEVMENIKKAGFQKPTPIQQSQAWPIVLQGIDLIGVAQSQAWPIVLQGIDLIGVAQTGTGKTTGTGKTLCYLMPGLCYLMPG 300- 300 IaIa K292R FIHLVLQPSLKGQRNRPGMLVLTFIHLVLQPSLKGQRNRPGMLVLTPTRELAPTRELALQVEGECCKYSYKGLRSVCVYLQVEGECCKYSYKGLRSVCVY 350- 350 Ib Ib Ic II II GGGNGGRDEQIEELKKGVDIIIAGNRDEQIEELKKGVDIIIATPGRLTPGRLNDLQMSNFVNLKNITYLVLNDLQMSNFVNLKNITYLVLDEADDEADK 400K- 400 III MLDMGFEPQIMKILLDVRPDRQTVMT SATIII WPHSVHRLAQSYLKEPMIVYV-450 MLDMGFEPQIMKILLDVRPDRQTVMTSATWPHSVHRLAQSYLKEPMIVYV 450 IV GTLDLVAVSSVKQNIIVTTEEEKWSHMQTFLQSMSSTDKVI VFVSRKIV AVA-500 GTLDLVAVSSVKQNIIVTTEEEKWSHMQTFLQSMSSTDKVIVFVSRKV AVA 500Va DHLSSDLILGNISVESLHGDREQRDREKALENFKTGKVRILI V ATDLASRVa G-550 DHLSSDLILGNISVESLHGDREQRDREKALENFKTGKVRILIVb VI ATDLASRG 550 LDVHDVTH VYNFDFPVb RNIEEYVHRIGRTGRAGRVI TGVSITTLTRNDWRVAS-600 LDVHDVTHVYNFDFPRNIEEYVHRIGRTGRAGRTGVSITTLTRNDWRVAS 600 ELINILERANQSIPEELVSMAERFKAHQQKREMERKMERPQGRPKKFH ---648 ELINILERANQSIPEELVSMAERFKAHQQKREMERKMERPQGRPKKFH-- 648

Figure 3. Domain diagram and primary protein sequence of human DDX43 protein. (A) A schematic representation of the domain arrangement in DDX43. The conserved SF2 helicase motifs are indicated by yellow boxes, and the corresponding ATPase/helicase core domain is indicated. The KH domain is indicated with pink. (B) Primary protein sequence of human DDX43 protein. KH domain and conserved SF2 helicase motifs are indicated.

DDX53, DEAD-box helicase polypeptide 53, also known as CAGE (cancer-associated antigen gene), was identified by serological analysis of a recombinant cDNA expression library (SEREX) of human testis and gastric cancer cell lines with sera from patients with gastric cancer (Kim and Jeoung, 2008). DDX53 encodes a protein of 630 amino acids with possible helicase activity (Kim and Jeoung, 2008; Schutz et al., 2010). Similar to DDX43, the

5

expression of DDX53 in normal tissues is restricted to testis while it shows high expression in various tumour tissues and cancer cell lines (Kim et al., 2010; Kim et al., 2017). Bioinformatics suggests that DDX53 also contains a potential KH domain in its N-terminal region, with the GXXG motif having the GYSG sequence (Figure 4A and B). Similar to DDX43, DDX53 helicase also contains the additional conserved motifs Ic, Va and Vb along with Q, I, Ia, Ib, II, III, IV, V and VI motifs (Figure 4B).

Helicase domain A A KH domain Q I Ia Ib Ic II III IV V Va Vb VI 1 648 B B MSHHGGAPKASTWVVASRRSSTVSRAPERRPAEELNRTGPEGYSVGRGGRMSHWAPEWKRAEANPRDLGASWDVRGSRGSGWSGPFGHQGPRAAGSREPP 50 -50 KH KH WRGTSRPPEDVAAGHEELPLCFALKSHFVLCFKIKNNMVGVVIGYSGSKIKDLQHSTNTKIQIINGESEGAVIGRGGSKIKNIAKVRIFGNREQSTTNTT 100- 100 IQIIQEQPESLVKIFGSKAMQTKAKAVIDNFVKKLEENYNSECGIDTAFQ -150 MKAKAKAAIETLIRKQESYNSESSVDNAASQTPIGRNLGRNDIVGEAEPL 150 PSVGKDGSTDNNVVAGDRPLI DWDQIREEGLKWQKTKWADLPPIKKNFYK-200 ESTATSAMSKVEADSWRKENFNITSNWDRIRAAVVECEKRKWADLPPVKKNFYIESKATSCMSEWDDLKDGEKRPIPNPTCTFDDAFQCYMQVINWRKEN 200- 250 Q I Q PEVMENIKKAGFQKPTPIFNITCDDLKSGEKRLIPKPTCRFKDAFQQYPDLLKSIIRVQSQAWPIVLQGIDLIGVAQTGTGKTGIVKPTPILCYLMPGQS 250- 300 Ia K292R FIHLVLQPSLKGQRNRPGMLVLT I PTRELALQVEGECCKYSYKGLRSVCVY-350 Ib QAWPIILQGIDLIVVAQTGTGKTIcLSYLMPGFIHLDSQPISREQRNGPGMLII 300 GGGN RDEQIEELKKGVDIIIAIa TPGRLNDLQMSNFVNLKNITYLVLIb DEADK-400 III MLDMGFEPQIMKILLDVRPDRQTVMTVLTPTRELALHVEAECSKYSYKGLKSICIYGGSATWPHSVHRLAQSYLKEPMIVYVRNRNGQIEDISKGVDIII 350- 450 Ic II IV GTLDLVAVSSVKQNIIVTTEEEKWSHMQTFLQSMSSTDKVIATPGRLNDLQMNNSVNLRSITYLVIDEADKMLDMEFEPQIRKILLDVRPDVFVSRK AVA400- 500 V Va DHLSSDLILGNISVESLHGDREQRDREKALENFKTGKVRILI III ATDLASRG-550 RQTVMTSATVbWPDTVRQLALSYLKDPMIVYVGNLNLVAVNTVI VKQNIIVTTE 450 LDVHDVTH VYNFDFPRNIEEYVHRIIVGRTGRAGR TGVSITTLTRNDWRVAS-600 KEKRALTQEFVENMSPNDKVI HIADDLSSDFNIQGISAESLHGN 500 ELINILERANQSIPEELVSMAERFKAHQQKREMERKMERPQGRPKKFHMFVSQK ---648 V Va Vb SEQSDQERAVEDFKSGNIKILITTDIVSRGLDLNDVTHVYNYDFPRNIDV 550 VI YVHRVGYIGRTGKTGTSVTLITQRDSKMAGELIKILDRANQSVPEDLVVM 600

AEQYKLNQQKRHRETRSRKPGQRRKEFYFLS------631 Figure 4. Domain diagram and primary protein sequence of human DDX53 protein. (A) A schematic representation of the domain arrangement in DDX53. The conserved SF2 helicase motifs are indicated by yellow boxes, and the corresponding ATPase/helicase core domain is indicated. The KH domain is indicated with pink. (B) Primary protein sequence of human DDX53 protein. KH domain and the conserved SF2 helicase motifs are indicated.

6

Phylogenetic analysis reveals that DDX43 and DDX53 helicases are closely related to each other and likely originated from a common ancestor (Figure 5A) (Umate et al., 2011). Aligning DDX43 and DDX53 sequences using BLAST (Basic Local Alignment Search Tool) (https://blast.ncbi.nlm.nih.gov/Blast.cgi) shows that they have 62% sequence identity (Figure 5B). Furthermore, their KH domains are even more similar, showing 88% sequence identity (Figure 5C). These results indicate that DDX43 and DDX53 share similar structure and likely have similar function as both are highly expressed in cancer tissues, but not in normal tissue (discussed above).

A

B

DDX43 MSHHGGAPKASTWVVASRRSSTVSRAPERRPAEELNRT--GPEGYSVGRGGRWRGTSRPP 58 DDX53 ------MSHWAPEWK------RAEANPRDLGASWDVRGSRGSGWSGPFGHQ 39 * *. : ** * . . .**. * * KH DDX43 EAVAAGHEELPLCFALKSHFVGAVIGRGGSKIKNIQSTTNTTIQIIQEQPESLVKIFGSK 118 DDX53 GPRAAGSREPPLCFKIKNNMVGVVIGYSGSKIKDLQHSTNTKIQIINGESEAKVRIFGNR 99 *** .* **** :*.::**.*** .*****::* :***.****: : *: *:***.:

DDX43 AMQTKAKAVIDNFVKKLEENYNSECGIDTAFQPSVGKDGSTDNNVVAGDRPLIDWDQIRE 178 DDX53 EMKAKAKAAIETLIRKQ-ESYNSESSVDNAASQTPIGRNLGRNDIVGEAEPLSNWDRIRA 158 *::****.*:.:::* *.****..:*.* . : . *::*. .** :**:**

DDX43 EGLKWQKTKWADLPPIKKNFYKESTATSAMSKVEADSWRKENFNITWDDLKDGEKRPIPN 238 DDX53 AVVECEKRKWADLPPVKKNFYIESKATSCMSEMQVINWRKENFNITCDDLKSGEKRLIPK 218 :: :* *******:***** **.***.**:::. .********* ****.**** **: Q I DDX43 PTCTFDDAFQCYPEVMENIKKAGFQKPTPIQSQAWPIVLQGIDLIGVAQTGTGKTLCYLM 298

7

DDX53 PTCRFKDAFQQYPDLLKSIIRVGIVKPTPIQSQAWPIILQGIDLIVVAQTGTGKTLSYLM 278 *** *.**** **::::.* :.*: ************:******* **********.*** Ia Ib DDX43 PGFIHLVLQPSLKGQRNRPGMLVLTPTRELALQVEGECCKYSYKGLRSVCVYGGGNRDEQ 358 DDX53 PGFIHLDSQPISREQRNGPGMLVLTPTRELALHVEAECSKYSYKGLKSICIYGGRNRNGQ 338 ****** ** : *** **************:**.**.*******:*:*:*** **: * Ic II DDX43 IEELKKGVDIIIATPGRLNDLQMSNFVNLKNITYLVLDEADKMLDMGFEPQIMKILLDVR 418 DDX53 IEDISKGVDIIIATPGRLNDLQMNNSVNLRSITYLVIDEADKMLDMEFEPQIRKILLDVR 398 **::.******************.* ***:.*****:********* ***** ******* III DDX43 PDRQTVMTSATWPHSVHRLAQSYLKEPMIVYVGTLDLVAVSSVKQNIIVTTEEEKWSHMQ 478 DDX53 PDRQTVMTSATWPDTVRQLALSYLKDPMIVYVGNLNLVAVNTVKQNIIVTTEKEKRALTQ 458 *************.:*::** ****:*******.*:****.:**********:** : * IV DDX43 TFLQSMSSTDKVIVFVSRKAVADHLSSDLILGNISVESLHGDREQRDREKALENFKTGKV 538 DDX53 EFVENMSPNDKVIMFVSQKHIADDLSSDFNIQGISAESLHGNSEQSDQERAVEDFKSGNI 518 *::.** .****:*** :* :**.****: : .**.*****: ** *:*:*:*:**:*:: V Va Vb VI DDX43 RILIATDLASRGLDVHDVTHVYNFDFPRNIEEYVHRIGRTGRAGRTGVSITTLTRNDWRV 598 DDX53 KILITTDIVSRGLDLNDVTHVYNYDFPRNIDVYVHRVGYIGRTGKTGTSVTLITQRDSKM 578 :***:**:.*****::*******:******: ****:* **:*:**.*:* :*:.* ::

DDX43 ASELINILERANQSIPEELVSMAERFKAHQQKREMERKMERPQGRPKKFH--- 648 DDX53 AGELIKILDRANQSVPEDLVVMAEQYKLNQQKRHRETRSRKPGQRRKEFYFLS 631 *.***:**:*****:**:** ***::* :****. * : .:* * *:*: C

DDX43_KH PLCFALKSHFVGAVIGRGGSKIKNIQSTTNTTIQIIQEQPESLVKIFGSKAMQTKAKAVI 60 DDX53_KH ------MVGVVIGYSGSKIKDLQHSTNTKIQIINGESEAKVRIFGNREMKAKAKAAI 51 :**.*** .*****::* :***.****: : *: *:***.: *::****.*

DDX43_KH DNFVKKLEENYNSE 74 DDX53_KH ETLIRKQE------59 :.:::* *

Figure 5. Comparing sequences of DDX43 and DDX53 proteins. (A) Phylogenetic tree showing close relationship between these two proteins (inside red square). Modified from (Leitao et al., 2015). (B) Alignment of the primary sequences of DDX43 and DDX53 proteins. (C) Alignment of the KH domains of DDX43 and DDX53.

1.4 The KH domain and KH domain-containing human proteins The K-homology (KH) domain was first biochemically characterized in the human heterogeneous nuclear ribonucleoprotein K (hnRNP K) (Grishin, 2001). The KH domain is the most commonly occurring RNA binding module and consists of approximately 70 amino acid residues with a compact βααββα topology (Backe et al., 2005). The typical function of KH domains is to bind ssRNA or ssDNA in a sequence specific manner. For example, NuSA KH1 and KH2, Nova KH3 bind ssRNA in adenine-rich regions (Beuth et al., 2005). hnRNP K and

8

Poly-(C)-binding proteins (PCBP) bind through their KH domains to single stranded cytosine- rich RNA or DNA sequences with high specificity and affinity (Du et al., 2005; Yoga et al., 2012). The KH domain can exist as a single copy, like in Merlp (Siomi et al., 1993) and Sam 68 proteins (Chen et al., 1997), or as multiple copies, as noted in the fragile X mental retardation protein, containing two KH domains (Valverde et al., 2007), or hnRNP protein containing three KH domains (Braddock et al., 2002).

Function/Disease 14 × KH Vigilin 1268 Lipoprotein 158 1209 FUBP1 644 Oligodendroglioma and Mouth Disease 4 × KH 100 164 185 251275 339 376 443 IGF2BP1 577 regulate MYC transcription 195 260 276 343 405 470487 553 regulate RNA splicing NOVA1 510 49 119 174 240 424 491 mRNA process 3 × KH hnRNP K 463 42 104 144 209 387 451 Multifunctional, RNA binding PCBP1 365 13 75 97 162 279 343 BICC1 974 Regulate translation/Cystic 132 199 284 348 FMRP 632 Fragile X 2 × KH 222 251 285 314 TDRKH 561 piRNA biogenesis 52 115 124 190 post-transcriptional regulation MEX3A 520 132 193 223 284 ANKHD1 2542 scaffolding protein 1695 1759 AKAP1 903 PKA signal transduction 607 671 PNPT1 783 RNA processing 605 664 DDX43 648 Helicase 67 128 SF1 639 pre-mRNA splicing factor 141 222 DDX53 631 Helicase 1 × KH 48 109 Sam68 443 Multifunctional, RNA binding 171 197 ASCC1 400 Transcriptional regulation 86 148 KRR1 381 rRNA processing and 154 206 RNA processing QKI 341 87 153 rRNA processing and gene expression PNO1 252 173 225 7 DPPA5 116 early embryogenesis 23 84 Figure 6. Summary of KH domain containing human proteins. A total of 39 human proteins 12contain KH domain(s). For the paralogs such as MEX3A, MEX3B, MEX3C, and MEX3D; PCBP1, PCBP2, PCBP3, and PCBP4; IGF2BP1, IGF2BP2, and IGF2BP3, only one of them is shown.

The purpose of these multiple copies of KH domains is to achieve greater affinity and specificity, and to stabilize the nucleic acid complex (Chen et al., 1997). Vigilin is a high density lipoprotein binding protein and has 14 KH domains (Dodson and Shapir., 1997).

9

Similarly, Nova1 and PCBP1 proteins have three KH domains. A list of other KH domain containing human proteins is shown (Figure 6). Mutations in the KH domains have been found to severely impair the RNA binding activity of the full length proteins. A missense of the conserved isoleucine (Ile304) to asparagine in one of the two KH domains of FMR1 protein shows to strongly abolish the RNA- binding properties of the protein, one of the possible reasons could be change of hydrophobic residue to hydrophilic residue (Siomi et al, 1994). In DDX43, site directed mutagenesis of the first glycine of in the conserved GXXG (G84D) showed reduced binding and unwinding activities as compared to the full length DDX43 protein on both RNA and DNA substrates. (Talwar et al, 2017). Similarly, Gly227Asp or Ser mutations in the KH domain of GLD-1 protein in C. elegans resulted in a loss of function due to electrostatic repulsion/steric interference of the negatively charged amino acid around the nucleic acid backbone (Lewis et al, 2000). A transversion base T to A mutation in the DNA coding for the KH domain of Quaking RNA binding protein (QKI) results in a change of amino acid Val to Glu (Chénard and Richard, 2008), which results in the loss of RNA binding in quaking viable mouse models causing defective yolk sac vascular remodeling enlarged heart and pericardial effusion. The in- frame deletion of exon 8 encoding for KH1 domain of Fragile X mental retardation protein (FMRP) manifests transcriptional alterations and attentional deficits in rat models (Golden et al., 2019).

1.4.1 Two types of KH domain There are two different versions of KH domain, namely, type I and type II. Type I is mostly found in eukaryotic proteins and type II is mostly found in prokaryotic proteins. The three- dimensional arrangement of the secondary structural elements differs in both types of KH domain folds (Figure 7) (Grishin, 2001). The sequence and length of the variable loop varies in different KH domains irrespective of them being type I or type II. All typical KH domains have the signature GXXG sequence, although this is altered or interrupted in divergent KH domains. The KH domain associated with their nucleic acid ligands in eukaryotic proteins is mostly of type I, for example, KH1 and KH2 in FMRP (PDB entry 2QND) and hnRNP K (PDB entry 1KHM) (Grishin, 2001; Valverde et al., 2007). Bacterial protein NusA is the only currently known protein with type II KH domain for which the structure in complex with its

10 cognate nucleic acid ligand has been determined (PDB entry 2ATW and 2ASB) (Beuth et al., 2005). The single KH domain of Sam68 protein is essential for its self-association into dimers or trimers; specifically, the KH domain loops 1 and 4 are required for its oligomerization (Chen et al., 1997). The KH domain of Sam68 is also essential for the homopolymeric RNA binding activity and mediates protein-RNA and protein-protein interactions. Generally, recognition of nucleic acids by KH domain is sequence specific and they can accommodate/recognize four to five nucleic acid bases (Chen et al., 1997; Valverde et al., 2008; Dickey et al., 2013).

Variable loop

GXXG loop

(Secondary structure found in eukaryotes)

GXXG loop

(Secondary structure found in prokaryotes)

Variable loop Figure 7. Secondary structure arrangement of Type I and Type II KH domain. Stylized representations of (A) the type I KH domain (eukaryotic) and (B) the type II KH domain (prokaryotic). The dotted line connecting the β2-strand and β`strand represents the variable loop. The white line connecting the a1-helix and the a2-helix represents the GXXG loop. Modified from (Valverde et al., 2007).

1.4.2 Interaction of type 1 KH domains with their specific nucleic-acid sequences

Since our study focuses on human DDX43 and DDX53 helicases (1.3), here we will summarize what is known about the interactions between type 1 KH domains and nucleic acids. Multiple sequence alignment of KH domains of type I reveal high sequence similarity between a small region of helix 1 and helix 2 (VIGXXGXXI) (Figure 8A). ssDNA or ssRNA binding occurs

11

at the hydrophobic cleft formed by these two consecutive helices (1 and 2) and the connecting GXXG loop, and the -sheet present at the inner surface attached with the variable loop of the KH domain (Figure 9A) (Hollingworth et al., 2012). The type I KH domain fold has its three -helices packed on the surface of the three stranded anti-parallel -sheets and favors pyrimidines over purines (Braddock et al., 2002; Braddock et al., 2002; Hollingworth et al., 2012).

A GXXG hnRNP K (1) EMVELRILLQSKNAGAVIGKGGKNIKALRTDYNASVSVPDSSGPERI------LSISADIETIGEI hnRNP K (2) FDCELRLLIHQSLAGGIIGVKGAKIKELRENTQTTIKLFQECCPHSTDR---VVLIGGKPDRV hnRNP K (3) PIITTQVTIPKDLAGSIIGKGGQRIKQIRHESGASIKIDEPLEGSED----RIITITGTQDQIQ FMR-1 (1) SRFHEQFIVREDLMGLAIGTHGANIQQARKVPGVTAIDLDEDTCT------FHIYGEDQDAVKKAR FMR-1 (2) EFAEDVIQVPRNLVGKVIGKNGKLIQEIVDKSGVVRVRIEAENEKNVPQEEEIMPPNSLP NOVA (1) GEYFLKVLIPSYAAGSIIGKGGQTIVQLQKETGATIKLSKSKDFYPGT-TERVCLVQGTAE NOVA (2) RAKQAKLIVPNSTAGLIIGKGGATVKAVMEQSGAWVQLSQKPEGINLQE--RVVTVSGEPEQ NOVA (3) AKELVEIAVPENLVGAILGKGGKTLVEYQELTGARIQISKKGEFLPGTRN-RRVTITGSPA SF1 VMIPQDEYPEINFVGLLIGPRGNTLKNIEKECNAKIMIRGKGSVKEGK-----VGRKDGQMLPGE DDX43 EELPLCFALKSHFVGAVIGRGGSKIKNIQSTTNTTIQIIQEQPESL------VKIFGSKAMQTKAK KHDR1 VLIPVKQYPKFNFVGKILGPQGNTIKRLQEETGAKISVLGKGSMR------DKAKEEELRKGGDPK DDX53 REPPLCFKIKNNMVGVVIGYSGSKIKDLQHSTNTKIQIINGESEAK------VRIFGNREMKAKAK TDRKH TPVSEQLSVPQRSVGRIIGRGGETIRSICKASGAKITCDKESEGTLLLS--RLIKISGTQKE KSRP IGGGIDVPVPRHSVGVVIGRSGEMIKKIQNDAGVRIQFKQDDGTGPEK----IAHIMGPPDRCE PCBP2 PPVTLRLVVPASQCGSLIGKGGCKIKEIRESTGAQVQVAGDMLPNSTER---AITIAGIPQSI AKAP149 DLIIWEIEVPKHLVGRLIGKQGRYVSFLKQTSGAKIYISTLPYTQSVQI----CHIEGSQHHVD ANKHD1 VRRSKKLSVPASVVSRIMGRGGCNITAIQDVTGAHIDVDKQKDKNGER----MITIRGGTESTR RS3 TPTRTEIIILATRTQNVLGEKGRRIRELTAVVQKRFGFPEGSVELYA------EKVATRGLCAIAQ FUBP1 TSMTEEYRVPDGMVGLIIGRGGEQINKIQQDSGCKVQISPDSGGLP----ERSVSLTGAPESVQ IF2B3 VKLEAHIRVPSFAAGRVIGKGGKTVNELQNLSSAEVVVPRDQTPDENDQ--VVVKITGHFYA M3XC GQTTVQVRVPYRVVGLVVGPKGATIKRIQQQTHTYIVTPSRDKEPV------FEVTGMPENVDRAR VIGILIN KSFTVDIRAKPEYHKFLIGKGGGKIRKVRDSTGARVIFPAAEDKDQDL-----ITIIGKEDAVRE ANKRD17 VRRSKKVSVPSTVISRVIGRGGCNINAIREFTGAHIDIDKQKDKTGDRI----ITIRGGTESTR Consensus I I G I IG G I I I I I G L V L L L V V L

V A V V V V B KSRP KH1 ISSQLGPIHPPP-----RTSMTEEYRVPDGMVGLIIGRGGEQINKIQQDSGCKVQISPDS 55 KSRP KH4 -----GAMAP------GGEMTFSIPTHKCGLVIGRGGENVKAINQQTGAFVEISRQL 46 DDX43-KH-89 ------TSRPPEAVAAGHEELPLCFALKSHFVGAVIGRGGSKIKNIQSTTNTTIQIIQEQ 54 * : : * :*****.::: *:. :. ::* :

KSRP KH1 ---GGLPERSVSLTGAPESVQKAKMMLDDIVS------RGRGG------89 KSRP KH4 PPNGDPNFKLFIIRGSPQQIDHAKQLIEEKIEGPLCPVGPGPGGPGPAGPMGPFNPGPFN 106 DDX43-KH-89 P------ESLVKIFGSKAMQTKAKAVIDNFVKKLEENYNSE------89 . : *: :** :::: :.

Figure 8. Multiple sequence alignment of DDX43 KH domain with other protein KH domains. (A) Multiple sequence alignment of type-I KH domains in different proteins showing sequence similarity and conserved VIGXXGXXI region. (B) Multiple sequence alignment between KSRP KH1, KSRP KH4 and DDX43 KH showing conserved GRGG-loop region.

12

The KH domains recognize a sequence of 4-5 unpaired bases with different affinity and specificity (Braddock et al., 2002; Valverde et al., 2008; Hollingworth et al., 2012). The KSRP KH1 and KH4 have identical conserved “GRGG” residues in the loop as in DDX43-KH domain. Multiple sequence alignment shows DDX43-KH domain has 32% and 33% sequence identity with KH1 and KH2 domains of KSRP, respectively (Figure 8B). These KH domains in KSRP bind to AU-rich sequences (Hollingworth et al., 2012).

G3 forming H- KSRP KH3-domain bond with L368 hydrophobic groove A B C

Hydrophobic bonds AGGGU-RNA

Hydrogen bonds

G4 forming H- bond with F358, I356 and Q349

Figure 9. Structure of KSRP KH3 domain in complex with AGGGU (PDB: 4B8T). (A) The hydrophobic amino acids binding groove is in yellow and the bound RNA is color coded. (B) Structure highlighting the hydrogen bonds between RNA G3 (top) and G4 (bottom) and the protein. (C) Protein-RNA contacts observed in the KH3– AGGGU structure. Hydrogen bonds are in red. Modified from (Nicastro et al., 2012).

Together, the KSRP KH domains bind RNA ligand with much higher affinity than each of them separately (Valverde et al., 2008). Each KH domain shows a different sequence preference despite having high structural similarity and conserved amino acid sequences. A key aspect of this sequence specificity depends on the five amino acid residues located N- terminal to the GXXG motif, discriminating the first two bases of ssDNA or ssRNA tetrad (Braddock et al., 2002). Also, the ssDNA or ssRNA substrate adapts itself to a conformation suitable for protein-ssDNA/RNA interactions (Braddock et al., 2002).

13

The nucleotide recognition and binding in the KH domain involves H-bonding between the bases and the phosphate backbone with the side chains of the protein. The GXXG motif loop docks the phosphate groups of the first two nucleotide bases by means of electrostatic interactions, inter-molecular Van der Waals interactions and H-bonding (Figure 9B and C). The Watson-Crick edges of the nucleic acids point towards the second -sheet of type I KH domain (third -sheet for type II domains) and forms two H-bonds with its backbone amide and carboxyl elements (Hollingworth et al., 2012, Nicastro et al., 2015). This type of binding renders the nucleic acid in an extended conformation along the hydrophobic groove of the KH domain proteins (Figure 9) (Braddock et al., 2002; Hollingworth et al., 2012).

1.4.3 Structure of KH domains bound to specific nucleic acid sequences The number of known KH domain structures complexed with their cognate nucleic acid ligand is relatively small. For example, the solution structure of KH3 of hnRNP K bound to ssDNA (Figure 10A), the X-ray structure of the KH3 domain of Nova-2 bound to an in vitro selected stem-loop RNA (SELEX RNA) (Figure 10B), crystal structure of KH1 domain from Poly(C)- binding protein (PCBPs) with ssDNA (Figure 10C and D), solution structure of the FBP KH3- KH4 domain bound to ssDNA (Figure 10E and F), and several others (Sidiqi et al., 2005; Valverde et al., 2008). There is very little or no structural changes observed in the KH domains when in ligand-free state or in complex with the ligands (Valverde et al., 2008). The binding

Kd of KH domains with ssDNA or ssRNA ranges between nanomolar to micromolar, and a Kd value of >>1 mM indicates very poor binding and is considered as functionally irrelevant in studying protein-RNA interaction assays (Valverde et al., 2008, Hollingworth et al., 2012). For example, the hnRNP K KH3 domain binds to 5’-dTCCC with a Kd value of 3 µM concentration and the KSRP WT KH-domains together bind to 42-mer TNF ARE (AU-rich elements) RNA with 3.6 ± 1.4 nM concentration (Braddock et al., 2002, Valverde et al., 2008, Hollingworth et al., 2012). Few other examples of KH domains and their specific nucleic acid sequences and

Kd values are summarized in Table 1. The KH domain is an evolutionarily conserved domain with the GXXG motif, which is present in many RNA-binding proteins, but has not been reported in helicases. Moreover, the mechanism and biological function of this KH domain in both DDX43 and DDX53 remains unknown.

14

A B C

D E F

Figure 10. Structure of KH domains bound to specific nucleic acid sequences. (A) Solution structure of the third KH domain of hnRNP K recognizes a tetrad of sequence 5’- dTCCC (purple sticks; PDB: 1J5K). (B) Crystal structure of Nova-2 KH3 bound to the tetranucleotide sequence 5’-UCAC (blue sticks; PDB: 1EC6). Crystal structure of the first KH domain from PCBP-2 in complex with the tetrad sequence 5’-dCCCT (C) and 5’-dACCC (D) (PDB: 2PQU and 2AXY, respectively). Solution structure of FBP KH3-KH4 domain bound to 5’-dTTTT (E) and the KH4 domain (F) recognizes ssDNA 5’-ATTC. Modified from (Valverde et al., 2008).

Table 1. KH domains with their specific sequences and Kd value

GXXG Specific nucleic Kd value Methods used References sequence acids bound by the to determine KH domain Kd KH- GPRG 5´- 1 µM Isothermal Liu, Z. QUA2 UAUACUAACAA Titration (2001) Quaking Calorimetry (QKI) (ITC) proteins hnRNP K GKGG 5´ d-TCCC  3 ± 1.5 NMR Braddock et KH3 µM spectroscopy al., 2002

15

FBP (Fuse GRNG 5´ d-TTTT  10 nM NMR Braddock et binding 5´ d-ATTTTT spectroscopy al., 2002 protein)  3 µM KH3 3´-AUUUUU and 3´- ATTUTT FBP KH4 GKGG 5´ d-ATTC  10 nM NMR Braddock et 5´ d-TATTCCC spectroscopy al., 2002

5´- UAUUCCC  12 µM 5´- TAUUCCC

CP1 GRQG 5´- 4.37 µM Surface Sidiqi et al., member of CUCUCCUUUCUU Plasmon 2005 Poly(C) UUUCUUCUUCCC Resonance binding UCCCUA-3´ (SPR) protein KH3 KSRP GRGG 5´- UAUUUA >1000 NMR Hollingworth KH1 µM spectroscopy et al., 2012 KSRP GKGG 5´- UAUUUA 390 ± 50 NMR Hollingworth KH2 µM spectroscopy et al., 2012 KSRP GRSG 5´- UAUUUA 140 ± 20 NMR Hollingworth KH3 µM spectroscopy et al., 2012 KSRP GRGG 5´- UAUUUA 350 ± 30 NMR Hollingworth KH4 µM spectroscopy et al., 2012

2. HYPOTHESIS AND OBJECTIVES

2.1 Hypothesis: We hypothesize that the KH domains are not only essential for DDX43 and DDX53 binding with nucleic acids, but also regulate their substrates specificity. 2.2 Objectives: 1) Structural analysis of KH domain by X-ray crystallography, NMR spectroscopy and molecular modeling. 2) To characterize KH domain nucleic acids binding using biochemical methods. 3) To identify the nucleic acids bound by KH domain using Systematic evolution of ligands by exponential enrichment (SELEX) and ChIP/CLIP-Seq methods.

16

3. MATERIALS AND METHODS 3.1 Reagent list Table 2. List of reagents, catalog number and suppliers Reagents, Cat No. Suppliers Address 2-logDNA ladder, N3200 NEB Lawrenceville, Georgia, USA 3XFLAG peptide, A6001 APExBIO Houston, Texas, USA Acrylamide, 0314 AMRESCO North York, Ontario, Canada N,N'-Methylenebisacrylamide, Fisher Scientific Madison, Wisconsin, USA AC164790250hrp Adenosine Triphosphate Sigma-Aldrich Oakville, Ontario, Canada (ATP), A1852 Adenosine Diphosphate Sigma-Aldrich Oakville, Ontario, Canada (ADP), A2754 CanadaAdenylyl- Sigma-Aldrich Oakville, Ontario, Canada imidodiphosphate(AMP-PNP), A2647 Clarity Max Western ECL Bio-Rad Laboratories, Inc. Mississauga, Ontario, substrate, 1705062 Canada Adenosine 5′-[γ-thio] Enzo Life Sciences Brockville, Ontario, Canada triphosphate (ATP-γ-S), ALX- 480-066-M005 Agarose,5510UB Gibco Burlington, Ontario, Canada Ampicillin, A1593 Sigma-Aldrich Oakville, Ontario, Canada Ammonium persulfate (APS), Sigma-Aldrich Oakville, Ontario, Canada A3678 BME Vitamins 100X, B6891 Sigma-Aldrich Oakville, Ontario, Canada Boric Acid, SLBC5554V Sigma-Aldrich Oakville, Ontario, Canada Bovine serum albumin, A2058 Sigma-Aldrich Oakville, Ontario, Canada Bradford protein assay reagent, Bio-Rad Hercules, California, USA 500-0013 Bromophenol blue, B0126 Sigma-Aldrich Oakville, Ontario, Canada Chloramphenicol, Fisher Scientific Madison, Wisconsin, USA AC227920250 Dithiothreitol (DDT), Sigma-Aldrich Oakville, Ontario, Canada 10197777001 DNase I, 11284932001 Sigma-Aldrich Oakville, Ontario, Canada dNTPs, N0447 NEB Mississauga, Ontario, Canada Dulbecco's Modified Eagle Sigma-Aldrich Oakville, Ontario, Canada Medium (DMEM), SH30022.01 EDTA, 17892 Fisher Scientific Madison, Wisconsin, USA

17

ANTI-FLAG® M2 Affinity Sigma-Aldrich Oakville, Ontario, Canada Gel, A2220 Fetal Bovine Serum (FBS), Sigma-Aldrich Oakville, Ontario, Canada F6178 Formic Acid, F0507 Sigma-Aldrich Oakville, Ontario, Canada Glycine, BP381-5 Fisher Scientific Madison, Wisconsin, USA Glycerol, 123170 Fisher Scientific Madison, Wisconsin, USA Imidazole, O3196-500 Fisher Scientific Madison, Wisconsin, USA Isopropyl β-D-1- Fisher Scientific Madison, Wisconsin, USA thiogalactopyranoside (IPTG), 15528019 Kanamycin, 60615 Fisher Scientific Madison, Wisconsin, USA Lithium chloride (LiCl), Fisher Scientific Madison, Wisconsin, USA AM9480 Lysogeny Broth (LB) with Sigma-Aldrich Oakville, Ontario, Canada agar, L2897 Methanol, 154246 Sigma-Aldrich Oakville, Ontario, Canada Magnesium Chloride (MgCl2), Sigma-Aldrich Oakville, Ontario, Canada M8266 Monoclonal ANTI-FLAG M2 Sigma-Aldrich Oakville, Ontario, Canada Peroxidase (HRP) antibody, A8592 Ammonium Chloride Cambridge Isotope Tewksburry, Massachusetts, 15 ( NH4Cl), NLM-467-5 Laboratories, Inc. USA 3-(N-morpholino) Fisher Scientific Madison, Wisconsin, USA propanesulfonic acid (MOPS), 1132-61-2 Nickel-NTA Affinity beads, Sigma-Aldrich Oakville, Ontario, Canada 70666 Non-fat dry milk, 170-6404 Sigma-Aldrich Oakville, Ontario, Canada Nonidet P-40, 74385 Sigma-Aldrich Oakville, Ontario, Canada N,N,N',N'- Bio-Rad Hercules, California, USA Tetramethylethylenediamine (TEMED), 87689 Penicillin-Streptomycin, Sigma-Aldrich Oakville, Ontario, Canada 084M4778V Polyethyleneimine (PEI) Polysciences, Inc. Warrington, Pennsylvania MAX, 24765 Phenylmethylsulfonyl fluoride Sigma-Aldrich Oakville, Ontario, Canada (PMSF), P7626 Protein standard, 161-0374 Bio-Rad Hercules, California, USA Protease inhibitor, Roche Mannheim, Germany 05892791001 Polyvinylidene difluoride GE Healthcare Mississauga, Ontario, (PVDF), 10600023 Canada RNase A, R6513 Sigma-Aldrich Oakville, Ontario, Canada 18

Sodium dodecyl sulfate (SDS), Sigma-Aldrich Oakville, Ontario, Canada L3771 Sodium chloride (NaCl), S671- Fisher Scientific Madison, Wisconsin, USA 10 sparQ PureMag Beads, 95196 Quantabio Beverly, Massachusetts, USA T4 DNA ligase, M0202 NEB Lawrenceville, Georgia, USA T4 polynucleotide kinase, NEB Mississauga, Ontario, M0201S Canada Thiamine Hydrochloride, Sigma-Aldrich Oakville, Ontario, Canada T4625 Thrombin Protease, GE27- GE Healthcare Mississauga, Ontario, 0846-01 Canada Tris, 0826 AMRESCO North York, Ontario, Canada Tris(2- Sigma-Aldrich Oakville, Ontario, Canada carboxyethyl)phosphine, C4706 TritonTMX-100, X100 Sigma-Aldrich Oakville, Ontario, Canada Trypsin-EDTA, T4049 Sigma-Aldrich Oakville, Ontario, Canada Trypsin, 85450C Sigma-Aldrich Oakville, Ontario, Canada Tryptone, TRP402.205 BioShop Burlington, Ontario, Canada TWEEN20, BP337 Fisher Scientific Madison, Wisconsin, USA Q5 Hot Start High-Fidelity NEB Mississauga, Ontario, DNA Polymerase, M0493 Canada Western Blotting detection GE Healthcare Little Chalfont, UK reagent, RPN2232 Xylene cyanol, X4126 Sigma-Aldrich Oakville, Ontario, Canada

3.2 Plasmids DNA and mutagenesis Human DDX43 and DDX53 cDNA clone were purchased from the SPARC BioCentre, the Hospital for Sick Children, Toronto, Canada. DDX43 gene (full-length, N-terminal domain, C- terminal helicase domain, or KH domain) was PCR amplified and cloned into the NdeI and XhoI sites of a pET28a vector (Novagen). DDX43 KH domain (74 aa (Pro69- Glu142), 80 aa (Thr54-Lys133), 89 aa (Thr54-Glu142) and 126 aa (Arg47-Asp172)) (Figure 11) and DDX43- KH-RecA1 were PCR amplified using primers listed in the table (Table 3) and cloned into the NdeI and XhoI sites of a pET28a vector. Similarly, DDX53-KH domain constructs (73 aa (Pro50-Glu122) and 125 aa (Arg28-Asn152)) (Figure 11) and DDX53-KH-RecA1 was cloned into the NdeI and XhoI sites of a pET28a vector from the full length DDX53 gene plasmid. The

19 mutants A81I, A81D, A81S, A81G, G87I and G87D were generated with QuikChange Site- directed mutagenesis kit (Agilent Technologies) using primers listed in Table 3.

DDX43 KH domain constructs

47-RGGRWRGTSRPPEAVAAGHEELPLCFALKSHFVGAVIGRGGSKIKNIQSTTNTTIQIIQEQPESLVKIFGSKAMQTKAKAVIDNFVKKLEENYNSECGIDTAFQPSVGKDGSTDNNVVAGDRPLID-172

KH-74 HIS DDX43-KH (74 a.a) 69 142 KH-126 HIS DDX43-KH (126 a.a) 47 172

KH-89 HIS DDX43-KH (89 a.a) 54 142 KH-80 HIS DDX43-KH (80 a.a) 54 133 DDX53 KH domain Constructs

28-RGSGWSGPFGHQGPRAAGSREPPLCFKIKNNMVGVVIGYSGSKIKDLQHSTNTKIQIINGESEAKVRIFGNREMKAKAKAAIETLIRKQESYNSESSVDNAASQTPIGRNLGRNDIVGEAEPLSN-152

KH-73 HIS DDX53-KH (73 a.a) 50 122 KH-125 HIS DDX53-KH (125 a.a) 50 152 Figure 11. Cartoon diagram depicting the DDX43 and DDX53 KH domain constructs along with their specific primary sequences. The color of the square brackets corresponds to the fragment length of individual construct.

DDX43-KH-89aa and DDX53-KH-73aa were PCR amplified using primers listed in Table 3 and cloned in HindIII and XhoI sites of the pcDNA3.0 vector (Invitrogen) containing 3xFLAG. All plasmids were verified by DNA sequencing.

3.3 RNA and DNA substrates PAGE-purified oligonucleotides were purchased from Integrated DNA Technologies (IDT) and are listed in (Table 4). RNA and DNA substrates were prepared as described (Guo et al., 2016). Briefly, for each substrate, a single oligonucleotide was 5′-end-labeled with [γ-32P] ATP using T4 polynucleotide kinase (NEB) at 37°C for 1 h. Unincorporated radionucleotides were removed by a G25 chromatography column (GE Healthcare). Single-stranded DNA or RNA substrates were kept at 4°C and ready to use. For the double-stranded DNA substrate, a [γ-

20

32P]ATP-labeled oligonucleotide was annealed to a 2.5-fold excess of the unlabeled complementary strands in annealing buffer (10 mM Tris–HCl, pH 7.5, and 50 mM NaCl) by heating at 95°C for 6 min and then cooling slowly to room temperature. For double-stranded RNA, a [γ-32P]ATP-labeled oligonucleotide was annealed to a 2.5-fold excess of the unlabeled complementary strands in annealing buffer (10 mM MOPS, pH 6.5, 1 mM EDTA, and 50 mM KCl) by heating at 95°C for 6 min and then cooling slowly to room temperature. All double- stranded substrates were purified by PAGE isolation, and their concentrations were determined by liquid scintillation counting before use.

Table 3. List of primers used in the study

Primer Sequence (5’-3’) Used in this study DDX43-KH (74)-F-NdeI ACGTCATATGCCGCTGTG Forward primer to PCR amplify TTTTGCTTTGAAG DDX43-KH-74aa for cloning in NdeI site of pET28a vector DDX43-KH (74)-R-XhoI GCATCTCGAGTTCTGAAT Reverse primer to PCR amplify TGTAATTTTCTT C DDX43-KH-74aa for cloning in XhoI site of pET28a vector DDX43-KH-74-End-F GAAAATTACAATTCAGA Forward primer to PCR amplify ATGACTCGAGCACCACC DDX43-KH-74aa with stop AC codon insertion before C- terminus 6xHis-tag DDX43-KH-74-End-R GTGGTGGTGCTCGAGTC Reverse primer to PCR amplify ATTCTGAATTGTAATTTT DDX43-KH-74aa with stop C codon insertion before C- terminus 6xHis-tag DDX43-KH126aa-NdeI- AGTCCATATGAGAGGTG Forward primer to PCR amplify F GTCGCTGGAGAGGC DDX43-KH-126aa for cloning in NdeI site of pET28a vector DDX43-KH126aa-XhoI- ACTGCTCGAGTCAATCTA Reverse primer to PCR amplify R TCAATGGCCGATC DDX43-KH-74aa for cloning in XhoI site of pET28a vector DDX43-KH89aa-NdeI-F ATATCATATGACCTCTAG Forward primer to PCR amplify GCCCCCGGAGGCC DDX43-KH-126aa for cloning in NdeI site of pET28a vector DDX43-KH89-XhoI-R CATGCTCGAGTCATTCTG Reverse primer to PCR amplify AATTGTAATTTTC DDX43-KH-89aa for cloning in XhoI site of pET28a vector DDX43-KH80aa-XhoI- CATGCTCGAGTCATTTAA Reverse primer to PCR amplify R_new CAAAATTGTCTAT DDX43-KH-80aa for cloning in XhoI site of pET28a vector

21

DDX43-KH-Elastase- CGTACTCGAGTCATTGG Reverse primer to PCR amplify XhoI-R AATGCAGTATCAAT DDX43-KH domain fragment identified after treatment with Elastase protease and Mass spectrometry for cloning in XhoI site of pET28a vector DDX53-KH(73aa)-NdeI- CTGACATATGCCACTCTG Forward primer to PCR amplify F-new CTTTAAAATAAAG DDX53-KH-73aa for cloning in NdeI site of pET28a vector DDX53-KH(73aa)-XhoI- ACGTCTCGAGTCATTCTG Reverse primer to PCR amplify R-New AGTTGTAGCTTTCTTG DDX53-KH-73aa for cloning in XhoI site of pET28a vector DDX53-KH-125aa- CATCCATATGAGAGGCA Forward primer to PCR amplify NdeI-F GTGGCTGGAGTGGC DDX53-KH-125aa for cloning in NdeI site of pET28a vector DDX53-KH-125aa- CATGCTCGAGTCAATTTG Reverse primer to PCR amplify XhoI-R ACAATGGCTCAGC DDX53-KH-125aa for cloning in XhoI site of pET28a vector DDX43-KH-RecA1- CGTACTCGAGTCAAACA Reverse primer to PCR amplify XhoI-R TAGACAATCATTGG DDX43 KH and Rec A1domains for cloning in XhoI site of pET28a vector DDX53-KH-RecA1- CATGCTCGAGTCAAACA Reverse primer to PCR amplify XhoI-R TAAACAATCATAGG DDX53 KH and Rec A1domains for cloning in XhoI site of pET28a vector DDX43_72-138-NdeI-F CATCCATATGTTTGCTTT Forward primer to PCR amplify GAAGAGCCAC DDX43-KH-66aa for cloning in NdeI site of pET28a vector DDX43_72-138-XhoI-R CGTACTCGAGTCAATTTT Reverse primer to PCR amplify CTTCTAGCTTTTT DDX43-KH-66aa for cloning in XhoI site of pET28a vector DDX43_72-452-XhoI-R CGTACTCGAGTCATGTAC Reverse primer to PCR amplify CAACATAGACAAT DDX43 KH and Rec A1 domain (380 aa) for cloning in XhoI site of pET28a vector DDX43_72-635-XhoI-R CGTACTCGAGTCATCTTT Reverse primer to amplify CCATTTCCCTTTT DDX43 KH, Rec A1 and Rec A2 domain (563 aa) for cloning in XhoI site of pET28a vector DDX43_72-612-XhoI-R CGTACTCGAGTCAACTCT Reverse primer to amplify GATTTGCTCTTTC DDX43 KH, Rec A1 and Rec A2 domain (540 aa) for cloning in XhoI site of pET28a vector DDX53_38-118-NdeI-F CATCCATATGCATCAGG Forward primer to PCR amplify GACCGAGAGCA DDX53-KH-80aa for cloning in

22

NdeI site of pET28a vector DDX53_38-118-XhoI-R CGTACTCGAGTCAGCTTT Reverse primer to PCR amplify CTTGTTTTCTAAT DDX53-KH-80aa for cloning in XhoI site of pET28a vector DDX53_38-434-XhoI-R CGTACTCGAGTCAATTCA Reverse primer to PCR amplify GATTACCAACATA DDX53 KH and Rec A1 for cloning in XhoI site of pET28a vector DDX53_38-609-XhoI-R CGTACTCGAGTCATTGTT Reverse primer to PCR amplify GATTTAACTTGTA DDX53 KH, Rec A1 and Rec A2 for cloning in XhoI site of pET28a vector 3xFLAG-F AGCTTATGGACTACAAA 3xFLAG coding sequence. GACCATGACGGTGATTA Annealed with 3XFLAG-R and TAAAGATCATGACATCG cloned into HindIII and EcoRI ATTACAAGGATGACGAT site of pcDNA3.0 vector. GACAAGTGAG 3XFLAG-R AATTCTCACTTGTCATCG 3xFLAG coding sequence. TCATCCTTGTAATCGATG Annealed with 3XFLAG-F and TCATGATCTTTATAATCA cloned into HindIII and EcoRI CCGTCATGGTCTTTGTAG site of pcDNA3.0 vector. TCCATA pcDNA3-DDX43- ACTGAAGCTTGCCGCCA Forward primer to PCR amplify KH89-Hind3-F CCATGACCTCTAGGCCCC DDX43-KH-89aa for cloning in CGGAG HindIII site of pCDNA3 vector pcDNA3-FLAG- ACTACTCGAGTCACTTGT Reverse primer to PCR amplify DDX43-KH89-Xho1-R CATCGTCATCCTTGTAAT DDX43-KH-89aa for cloning in CGATGTCATGATCTTTAT XhoI site of pCDNA3 vector AATCACCGTCATGGTCTT TGTAGTCTTCTGAATTGT AATTTTCTTC pcDNA3DDX53-KH73- ACTGAAGCTTGCCGCCA Forward primer to PCR amplify Hind3-F CCATGCCACTCTGCTTTA DDX53-KH-73aa for cloning in AAATA HindIII site of pCDNA3 vector pcDNA3-FLAG- ACTACTCGAGTCACTTGT Reverse primer to PCR amplify DDX53-KH73-XhoI-R CATCGTCATCCTTGTAAT DDX53-KH-73aa for cloning in CGATGTCATGATCTTTAT XhoI site of pCDNA3 vector AATCACCGTCATGGTCTT TGTAGTCTTCTGAGTTGT AGCTTTCTTG DDX43-A81I-F AGCCACTTTGTTGGCATC Forward primer for site-directed GTAATCGGTCGTGGT mutagenesis of DDX43-A81I DDX43-A81I-R ACCACGACCGATTACGA Reverse primer for site-directed TGCCAACAAAGTGGCT mutagenesis of DDX43-A81I DDX43-G87D-F GTAATCGGTCGTGGTGA Forward primer for site-directed CTCAAAAATAAAGAAT mutagenesis of DDX43-G87D

23

DDX43-G87D-R ATTCTTTATTTTTGAGTC Reverse primer for site-directed ACCACGACCGATTAC mutagenesis of DDX43-G87D DDX43-A81I-F-new GAAGAGCCACTTTGTTG Forward primer for site-directed GCATCGTAATCGGTCGT mutagenesis of DDX43-A81I GGTGGGTC DDX43-A81I-R-new GACCCACCACGACCGAT Reverse primer for site-directed TACGATGCCAACAAAGT mutagenesis of DDX43-A81I GGCTCTTC DDX43-A81G-F GAAGAGCCACTTTGTTG Forward primer for site-directed GCGGGGTAATCGGTCGT mutagenesis of DDX43-A81G GGTGGGTC DDX43-A81G-R GACCCACCACGACCGAT Reverse primer for site-directed TACCCCGCCAACAAAGT mutagenesis of DDX43-A81G GGCTCTTC DDX43-A81S-F GAAGAGCCACTTTGTTG Forward primer for site-directed GCTCGGTAATCGGTCGT mutagenesis of DDX43-A81S GGTGGGTC DDX43-A81S-R GACCCACCACGACCGAT Reverse primer for site-directed TACCGAGCCAACAAAGT mutagenesis of DDX43-A81S GGCTCTTC Restriction digestion sites are underlined. Table 4. List of RNA and DNA substrates used in the study

Substrate name Structure or description Nucleotide sequence (5′3′) ssDNA (30 mer) Used as ssDNA in EMSA DNA 30 mer: GAGCTACCAGCTACCCCGTATGTCAGA GAG Fork dsDNA (30 Fork 30/15-T: 5 15 30 bp) TTTTTTTTTTTTTTTGGTGATGGTGTAT 3 TGAGTGGGATGCATGCA Fork 30/15-B: TGCATGCATCCCACTCAATACACCATC ACCTTTTTTTTTTTTTTT Blunt-end 5´ DNA 30 mer DNA 30 mer: dsDNA (30 bp) 5´ GAGCTACCAGCTACCCCGTATGTCAGA DNA 30 mer comp GAG DNA 30 mer comp: CTCTCTGACATACGGGGTAGCTGGTAG CTC ssRNA (30 mer) Used as ssRNA in EMSA RNA-30-mer: GAGCUACCAGCUACCCCGUAUGUCAG AGAG 30-bp blunt-end 5´ RNA 30 mer RNA-30-mer: dsRNA 5´ GAGCUACCAGCUACCCCGUAUGUCAG

24

AGAG RNA-30mer-comp: CUCUCUGACAUACGGGGUAGCUGGUA GCUC Forked dsRNA 5 25 RNA-41B: 16B (16 bp) AAAACAAAACAAAACAAAACAAAAU 16A 3 AGCACCGUAAAGACGC RNA-41A: GCGUCUUUACGGUGCUUAAAACAAAA CAAAACAAAACAAAA dT8 Used as ssDNA in setting TTTTTTTT crystallization screen with protein dT10 Used as ssDNA in NMR TTTTTTTTTT dTn Used as ssDNA in EMSA n= 5, 6, 7, 10, 15, 20, 25 and 30 dA30 Used as ssDNA in EMSA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAA dC30 Used as ssDNA in EMSA CCCCCCCCCCCCCCCCCCCCCCCCCCC CCC rU30 Used as ssRNA in EMSA UUUUUUUUUUUUUUUUUUUUUUUUU UUUUU dT5 Used as ssDNA in NMR TTTTT dA5 Used as ssDNA in NMR AAAAA dC5 Used as ssDNA in NMR CCCCC dG5 Used as ssDNA in NMR GGGGG rU5 Used as ssRNA in NMR UUUUU MY_CTCTC Used as ssDNA in NMR CTCTC MY_CACAC Used as ssDNA in NMR CACAC MY_ATATA Used as ssDNA in NMR ATATA MY_CGCGC Used as ssDNA in NMR CGCGC MY_ACCAC Used as ssDNA in NMR ACCAC MY_ATTAT Used as ssDNA in NMR ATTAT MY_AGAGA Used as ssDNA in NMR AGAGA MY_GTTGT Used as ssDNA in NMR GTTGT MY_GTTTG Used as ssDNA in NMR GTTTG MY_GGTTG Used as ssDNA in NMR GGTTG MY_TGTGT Used as ssDNA in NMR TGTGT

25

3.4 Expression and purification of recombinant DDX43-KH and DDX53-KH domain proteins The plasmid pET28a-DDX43-KH or pET28a-DDX53-KH was transformed into E. coli BL21(DE3)pLysS cells. Recombinant 6xHis-tagged proteins were subjected to a two-step purification using Nickel Affinity beads (Sigma) and a Sephacryl S-100 HR 16/60 gel filtration column (GE Healthcare). The transformed E. coli BL21(DE3)pLysS cells harboring the recombinant gene were grown at 37°C in LB medium containing 30 μg/mL kanamycin and 34

μg/mL of chloramphenicol until the A600 reached 0.8-1.0 and then induced by addition of 0.5 mM IPTG for overnight at 18°C. The cells were harvested by centrifugation at 5000 g for 10 min at 4°C and stored at -80°C until used. The cell suspension was lysed by sonication in buffer A (25 mM Tris pH 8.0, 0.25 M NaCl and 10% glycerol) having final concentration of 1 mM phenylmethylsulfonyl fluoride (PMSF), protease inhibitor (Roche Applied Science) at 4°C, with 5 short burst of 10 sec at the intervals of 5 min. The lysed cells were centrifuged at 45,000 g for 30 min at 4°C. The supernatant was applied to the Ni-NTA beads equilibrated with buffer A. The beads were then washed with 10 column volumes (CV) of wash buffer B (25 mM Tris pH 8.0, 0.5 M NaCl and 10% glycerol) containing 25 mM imidazole and the proteins then eluted with 5 CV of elution buffer B containing 250 mM imidazole. The protein fractions were confirmed with SDS-PAGE, and fractions with high protein yield were pooled and subjected to size-exclusion chromatography on a Sephacryl S-100 HR 16/60 equilibrated with buffer A. The protein fractions were collected at a flow rate of 0.6 mL/min with the same buffer (buffer A). The protein was confirmed by SDS-PAGE and the fractions in the peak were pooled and concentrated. Protein concentration was determined by the Bradford method using bovine serum albumin (BSA) as the standard. DDX43-KH-RecA1 and DDX53-KH-RecA1 fragments in pET28a was transformed into E. coli BL21(DE3)pLysS cells. The expression and purification protocol was the same as mentioned above. DDX53-KH-RecA1 protein was further subjected to S-100 size exclusion chromatography, SEC650 size exclusion chromatography (BioRad) and HiTrap SP HP cation exchange chromatography for optimization of protein purification.

3.5 Western blot

26

For bacterial overexpression and purified recombinant proteins, an equal amount of proteins (5 µg) were denatured at 100°C for 5 min, then resolved on 10% polyacrylamide Tris-glycine SDS gels, and transferred to PVDF membranes. The membrane was blocked in PBS containing 5% skimmed milk at room temperature for 1 h, followed by probing with a mouse anti-His-tag monoclonal antibody (1:5000, cat# MA1-80218, Invitrogen). HEK293T cells after transfection with pcDNA3.0 vector containing 3xFLAG DDX43- KH-89 and DDX53-KH-73, respectively, were lysed in RIPA buffer and an exact amount of the proteins (30 µg) were resolved on polyacrylamide Tris-glycine SDS gels, and transferred to PVDF membrane. The membrane was blocked in PBS containing 5% skimmed milk at room temperature for 1 h, followed by probing with a mouse monoclonal ANTI-FLAG M2 Peroxidase (HRP) antibody (cat# A8592, Sigma-Aldrich). The membranes were washed with 1X PBS and developed using Clarity Max Western ECL substrate (cat# 1705062, Bio-Rad). The membranes were visualized using ChemiDoc Imaging Systems (Bio-Rad).

3.6 Electrophoretic mobility shift assay (EMSA) Protein/DNA or RNA binding mixtures (20 µL) contained the indicated concentrations of DDX43 KH domain and 0.5 nM of the specified 32P-end-labeled DNA or RNA substrate in the reaction buffer (40 mM Tris (pH 8.0), 0.5 mM magnesium chloride (MgCl2), 0.15 mM sodium chloride (NaCl), 0.01% Nonidet P-40, 0.1 mM DTT and 1 mg/mL bovine serum albumin (BSA)) without ATP. The binding mixtures were incubated at room temperature for 30 min after the addition of DDX43 KH or DDX53 KH proteins. After incubation, 3 µL of loading dye (74% glycerol, 0.01% xylene cyanol, 0.01% bromphenol blue) was added to each mixture, and samples were loaded onto native 5% (19:1 acrylamide/bisacrylamide) polyacrylamide gels and electrophoresed at 200 V for 2 h at 4°C using 1×TBE as the running buffer. The resolved radiolabeled species were visualized using a PhosphorImager and analyzed with Quantity One software (Bio-Rad).

3.7 Helicase unwinding assays The helicase assay reaction mixtures (20 μL) contained 40 mM Tris (pH 8.0), 0.5 mM magnesium chloride (MgCl2), 0.15 mM sodium chloride (NaCl), 0.01% Nonidet P-40, 0.1 mM

DTT, 1 mg/mL bovine serum albumin (BSA), equimolar mixture of 2 mM ATP and MgCl2, 0.5

27 nM of the specified duplex RNA and DNA substrate, and the indicated concentrations of full length DDX43, DDX43 KH, or DDX53 KH proteins. Helicase reactions were initiated by the addition of DDX43 or DDX53 KH proteins and incubated at 37°C for 15 min unless otherwise indicated. Reactions were quenched by addition of 20 μL of 2×stop buffer (17.5 mM EDTA, 0.3% SDS, 12.5% glycerol, 0.02% bromphenol blue, 0.02% xylene cyanol). For standard duplex RNA and DNA substrates, a 10-fold excess of the unlabeled oligonucleotide with the same sequence as the labeled strand was included in the quench to prevent reannealing. Reaction products of duplex RNA substrates were resolved on non-denaturing 15% (19:1 acrylamide:bisacrylamide) polyacrylamide gels and products of DNA unwinding reactions were resolved on non-denaturing 12% (19:1 acrylamide:bisacrylamide) polyacrylamide gels. Radiolabeled RNA and DNA species in polyacrylamide gels were visualized using a PhosphorImager and quantitated using Quantity One software (Bio-Rad).

3.8 15N-Isotopic labeling of recombinant DDX43-KH proteins The 15N-isotopic labelling of DDX43-KH-89, DDX43-KH-126 and its two mutants A81S and A81G proteins for 1H, 15N-HSQC NMR was performed as described previously (Dmitriev et al, 2006). Briefly, the pET28a plasmid containing DDX43-KH-WT, DDX43-KH-A81S or A81G mutants was transformed into E. coli BL21(DE3)pLysS cells respectively. The cells were grown at 37°C in LB medium containing 30 μg/mL kanamycin and 34 μg/mL of chloramphenicol until the A600 reached 0.6-0.8, and pelleted and washed with M9 wash solution

(M9 salts without NH4Cl, pH 7.4). The cells were pelleted again at 6,000 g for 30 min and 15 15 gently resuspended in 1L N-M9 growth medium containing M9-growth media, NH4Cl (Cambridge Isotopes), 30 μg/mL kanamycin and 34 μg/mL of chloramphenicol and induced by addition of 0.5 mM IPTG for overnight at 18°C. The purification procedure is same as the unlabelled DDX43 KH recombinant protein. Additionally, the 6xHis-tag was removed with thrombin protease (GE Healthcare) and the cleaved protein was passed through Ni-NTA affinity column, the flowthrough collected was subjected to Sephacryl S-100 size-exclusion chromatography and 0.6 mL was eluted in each 96 wells using buffer containing 25 mM HEPES (pH 7.4) and 0.25 M NaCl. Protein concentration was determined by the Bradford method using bovine serum albumin (BSA) as the standard. The oligonucleotides used in the experiment were in the same buffer (25 mM HEPES (pH 7.4) and 250 mM NaCl) as the

28 protein. The NMR samples contained 0.2 mM DDX43-KH-89 protein, 0.25 M NaCl, 25 mM

HEPES (pH 7.4), 5% (v/v) D2O, and 1 mM sodium 2,2-dimethyl-2-silapentane-5-sulfonate (DSS, chemical shift reference). For titrations, oligonucleotides were added at a molar ratio of 0, 0.2, 0.5, 1.0, 2.0, and 5.0 relative to the protein. The 1H-15N HSQC experiments were conducted on a 600 MHz Bruker NMR spectrometer. The data were processed using NMRFx Processor and analyzed using NMRViewJ (Yu et al., 2017). Combined chemical shift change 2 2 1/2 Δδ was calculated as [(Δδ NH + Δδ N/25)/2] , where ΔδNH and ΔδN are the chemical shift changes of the amide proton and nitrogen, respectively. The structure of DDX43 KH-89 was predicted using the Phyre 2 server (Kelley et al., 2015). The peaks in 1H-15N HSQC spectrum corresponding to the backbone amide groups were assigned to amino acid residues using SPARTA+ (Shen and Bax., 2010). Oligonucleotide dissociation constants were calculated from the combined chemical shift change as a function of oligonucleotide concentration using fast exchange regime model as described previously (Williamson., 2013).

3.9 Systematic evolution of ligands by exponential enrichment (SELEX) The SELEX protocol was performed (Tuerk and Gold 1990; Thisted et al., 2001). A DNA library that contains ~1×1015 oligonucleotides with a 20 nt random sequence in the middle and primers on both sides (O-32001-20, TriLink Biotechnologies) was subjected to selection for binding to 6xHis-tagged DDX43-KH-89 using Ni-NTA affinity chromatography. The 20 nt was flanked by PCR primers of 23 nucleotides on each side (forward primer 5′- TAGGGAAGAGAAGGACATATGAT-3′ and reverse primer 5′- TCAAGTGGTCATGTACTAGTCAA-3′). The oligonucleotides library was PCR amplified, heated, and kept on ice, then subjected to selection for binding with purified recombinant DDX43-KH-89 protein. Briefly, 2.4 µM of purified DDX43-KH-89 protein was incubated with 4.8 µM of oligonucleotides at 4 °C for 30 min in rotation using Ni-NTA affinity beads that were equilibrated with the 1 × HRB binding buffer (400 mM Tris HCl (pH 8.0), 5mM MgCl2, 0.1% NP-40, BSA (10 mg/mL), and 150 mM NaCl). The beads were centrifuged at 500 g for 5 min, and supernatant was discarded. The beads were washed 5 times with 1 × HRB binding buffer to remove any non-specific binding by centrifuging at 500 g for 5 min and the supernatant was discarded. Oligo was eluted by adding elution buffer (25 mM Tris HCl (pH 8.0), 500 mM NaCl, 10% glycerol and 200 mM imidazole) and incubating for 30 min in

29 rotation at 4 °C, centrifuged at 500 g for 5 min and supernatant was collected. DNA was isolated by phenol: chloroform: isoamylalcohol solution, precipitated by equal volume of 100% isopropanol and 5-50 ug of glycogen, washed with 70% ethanol, and dissolved in sterile MilliQ water. The eluted DNAs were PCR amplified and denatured at 92-95 °C to form ssDNA. The ssDNAs was then subjected to next round of selection with DDX43-KH-89 protein, with half of protein was used in each round compared to the previous round. Six rounds of binding, elution, amplification, and enrichment were performed. Final DNA were purified and used for a DNA library construction using a NEBNext ultra II DNA library prep kit (E7645, NEB). The quality of the library was checked by a Bioanalyzer 2100 (Agilent). The multiplexed DNA samples were combined and analyzed in one lane of 125 cycles with paired-end 125 nt reads on an Illumina HiSeq 2500 system (NRC Saskatoon).

3.10 Cell culture The HEK293T cells (CRL-11268, ATCC) were cultured in 10-cm and 15-cm petri dishes under conditions of 5% CO2 and 37°C. Unless otherwise stated, all cell lines were maintained in Dulbecco’s modified Eagle’s medium (DMEM) media (Sigma-Aldrich), supplemented with 10% fetal bovine serum (F1051, Sigma), 50 µg/mL penicillin streptomycin (P4333, Sigma). For subculture, cells were detached from the plate floor by trypsinization. Cells were subcultured 2-3 times per week at a dilution of 1:5 - 1:20. The working area was decontaminated with 70% ethanol before and after use and cell culture work was performed under sterile conditions.

3.11 Transfection of HEK293T cells and immunoprecipitation HEK293T cells cultured in 10-cm plates were transfected with plasmid DNA (pcDNA- 3xFLAG-DDX43-KH) using 1% polyethyleneimine ‘Max’ (PEI) at a ratio of 1:4. For each plate, 10 μg of DNA was mixed with 500 μL of 0.15 M sterile NaCl via gentle vortex for 10 sec. Then 40 μL of the transfection reagent, PEI, was added to this mixture followed by another 10 sec of a gentle vortex. DNA–PEI complex was formed by incubating the mixture at room temperature for 10 min followed by dispensing the complex dropwise into the plates. The cells were incubated for 24 h post transfection before harvest for DNA and RNA library construction (mentioned below).

30

3.12 ChIP-DNA and CLIP-RNA library construction After 24 hr transfection, the overexpressed protein was cross-linked with nucleic acids using formaldehyde (37%, 40 µL/mL of media) for 15 min and quenched with 1.5 M glycine (60 µL/mL). The cells were washed with cold PBS, suspended in 10 mL of swelling buffer (10 mM Tris-HCl (pH 7.5), 2 mM MgCl2 and 3 mM CaCl2), incubated on ice for 30 min and centrifuged at 500 g for 10 min, the supernatant was discarded and the pellet was collected. The pellet was suspended in 1 mL of sonication buffer (50 mM HEPES (pH 7.9), 140 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na-deoxycholate, 0.1% SDS, 0.5 mM PMSF, and protease inhibitor cocktail). The DNA was sheared by sonication of 20 cycles (30 s/cycle) with 50% duty cycle and 1-min intervals between each cycle, followed by centrifugation at 14,000 rpm for 15 min. The lysate containing nucleic acid bound with DDX43-KH-89aa protein was added to M2-FLAG beads (A2220, Sigma) equilibrated with sonication buffer and incubated for 2 h at 4 °C. The supernatant was discarded by centrifuging at 3,000 rpm for 8 min at 4 °C, followed by the addition of NTN 500 mM NaCl buffer (50 mM Tris (pH 7.4), 500 mM NaCl and 0.5% NP40) to the slurry of protein M2-FLAG beads and further rotated for 15 min at 4 °C. The beads were washed after removing the supernatant with NTN 150 mM NaCl buffer (50 mM Tris (pH 7.4), 150 mM NaCl and 0.5% NP40) and rotated at 4 °C for 15 min. The pcDNA3 empty vector served as a control. The beads were suspended in BC 100 buffer (25 mM Tris (pH 7.4), 100 mM NaCl, 10% glycerol, and 0.1% Tween 20) and the DDX43-KH- 89aa and DDX53-KH-73aa protein was eluted by adding 3xFLAG peptide and followed by incubation at 65 °C for 1 h. The DNA was extracted with phenol-chloroform, precipitated with 100% ethanol, washed with 70% ethanol, and dissolved in 50 µL of sterile MilliQ water. The isolated DNA fragments was end repaired using NEBNext Ultra II End Prep buffer and Enzyme mix, adaptor-ligated, cleaned with 0.9× sparQ PureMag beads, and size selected using 0.4× beads. The size-selected DNA was then amplified using (Index primers 9, 1, 11, 12, 30 and 37) NEBNext Multiplex Oligos for Illumina and universal primers (E7335, NEB). The products were cleaned using 0.9× sparQ PureMag beads.

For RNA extraction and purification, Trizol RNA extraction reagent and chloroform protocol was used to isolate the sheared RNA. The isolated RNA was eluted in nuclease-free water. The RNA library was prepared using the NEBNext Ultra II directional library prep kit (E7760, NEB) and DNA library by NEBNext ultra II DNA library prep kit (E7645, NEB)

31 according to the manufacturer's instructions. Briefly, for RNA library preparation, the RNA extracted was converted into cDNA using random primers provided with the kit. The second strand was synthesized with NEBNext Second Strand Synthesis Reaction Buffer with dUTP and Enzyme Mix. The products were purified using 1.8× volume of sparQ PureMag beads and the cDNA eluted with nuclease-free water. This cDNA was end repaired, adaptor-ligated, cleaned up using 0.9× sparQ PureMag beads, and size selected using 0.4× beads. The size- selected DNA was then amplified using six NEBNext Multiplex Oligos (Index primers 3, 4, 5, 6, 7 and 8) and universal primer (E7335, NEB). The final products were cleaned using 0.9× sparQ PureMag beads. The quality of the libraries was checked by a Bioanalyzer 2100 (Agilent). The multiplexed DNA samples were combined and analyzed in one lane of 125 cycles with paired- end 125 nt reads on an Illumina HiSeq 2500 system (NRC Saskatoon).

3.13 Bioinformatics analysis of SELEX, ChIP-DNA and CLIP-RNA sequences The bioinformatics analysis was performed by Daniel Hogan (Graduate student in Dr. Anthony Kusalik’s lab, University of Saskatchewan). The paired-end SELEX reads were trimmed using cutadapt (v1.18) (Martin, M., 2011) to remove primers and extract the DNA insert. The resulting trimmed reads were decomposed into k-mers for k = 5 to 20. The set of k-mers present in the control-read pool were subtracted from the set of k-mers present in the positive- read pool. These k-mers, termed exclusive k-mers, were then sorted by the number of times they appeared in the positive set. P-values were assigned to each k-mer using Fisher’s exact test. The resulting P-values were adjusted for multiple comparisons using the Benjamini- Hochberg (BH) procedure and a false-discovery rate of 5%. The resulting set of significantly enriched k-mers was submitted to MEME (Bailey et al., 2009) for motif discovery. The adaptors of the paired-end FASTQ reads from ChIP-DNA and ChIP-RNA were trimmed using cutadapt (v1.18) (Martin, M., 2011) and then analyzed with FASTQC (v0.11.8) (Andrews, 2010) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to ensure adaptor removal and read quality. Quality trimming was elided due to sufficient read quality across the entire length of the read (>25 Phred Score). The resulting trimmed reads were mapped to the hg38 human reference genome using Bowtie2 (version 2.3.5) (Langmead and Salzberg, 2012).

32

The BWT index for hg38 can be obtained from the NCBI FTP server. With the resulting mapped reads, peaks were called using MACS2 (2.1.1.20160309) (Zhang et al., 2008) and a p- value threshold of 1x10-5, a threshold used in other studies (Tang, M., 2017; Terranova et al., 2018) (https://zenodo.org/record/819971#.XcrY_3dFyF5). The control/test reads were submitted jointly to account for background noise. The peaks were written out to BED format, which specifies the location of each peak in the genome among other attributes. The DNA sequences located at each peak were extracted using the getfasta command from BEDTools (v2.28.0) (Quinlan and Hall, 2010) and the collection of sequences submitted to MEME (Bailey et al., 2009) for motif discovery.

4. RESULTS 4.1 Protein overexpression and purification 4.1.1 Purification of DDX43-KH and DDX53-KH domain proteins To characterize and crystallize the DDX43-KH and DDX53-KH domains, I cloned different KH fragments (Table 5) into the pET28a vector, overexpressed in E. coli and purified through two-step chromatography. First, the 6xHis-tagged KH domain proteins were passed through a Ni-NTA affinity chromatography column, and the Ni2+-bound proteins were eluted with imidazole. Eluted fractions were evaluated by Coomassie stained SDS-PAGE, and pure fractions were pooled and loaded on to a Sephacryl S-100 size exclusion column. Most KH

domain proteins were purified to near homogeneity (Figure 12).

through (CLF) through - A Peak 1 Peak 2 Elution

Peak 1 M

Wash buffer (WB) buffer Wash (CL) lysate Cell CL Flow CL M kDa

15 Peak 2 10 DDX43-KH-74

Elution

Peak 1 Peak 2 CL B CLF M 175 M Peak 2 kDa 150

125

100

mAU Peak 1 75

50 15 DDX43-KH-89 25 10

0 0 20 40 60 80 100 Volume (mL)

33

Elution

CLF CL C WB M Peak 1 Peak 1 Peak 2 kDa M

25 Peak 2 15 DDX43-KH-126 10

Peak 1 Peak 2

CL CLF Elution M WB D M kDa

Peak 2 Peak 1 15 10 DDX53-KH-73

Elution

WB CL E CLF M Peak 1 Peak 1 Peak 2 M kDa

Peak 2 15 DDX53-KH-80 10

Peak 1 Peak 2

CL CLF Elution WB M F M kDa Peak 1 Peak 2

25 DDX53-KH-125 15 10

G

induced pellet induced supinduced

induced sup induced induced pellet induced

- -

- -

Un Un pellet Induced throughFlow 1 # Elution Elution # 3 # Elution 5 # Elution 7 # Elution

Un pellet Induced throughFlow 1 # Elution 5 # Elution Un supInduced 3 # Elution 7 # Elution M supInduced M kDa kDa

D 15 15 DDX43-KH-80 10 10

Figure 12. Purification of DDX43 KH and DDX53 KH domain proteins. (A-F) SDS- PAGE analysis of the eluted DDX43 KH and DDX53 KH fractions from Ni-NTA column O (left), chromatographic profiles of recombinant DDX43- and DDX53-KH domain proteins eluting from size-exclusion chromatography (center), and SDS-PAGE analysis of the eluted fractions (right). M, marker. (G) SDS-PAGE analysis of expression and purification of DDX43-KH-80 protein (left) and Western blotting analysis using an anti-His antibody (right).

34

According to the molecular weight standards used to calibrate the Sephacryl S-100 size exclusion column, the apparent molecular weight of DDX43-KH-74, -89 and -126 proteins are estimated to be 9.6 kDa (predicted as 10.4 kDa), 16.9 kDa (predicted 12.0 kDa) and 30.1 kDa (predicted 15.8 kDa), respectively, indicating the KH-74 and -89 proteins exist as monomers in solution, whereas the KH-126 as a dimer (Figure 12A-C). The DDX43-KH-80 was in insoluble fraction (pellet) as revealed by Western blot (Figure 12G). Similarly, the mass of DDX53-KH-80 and -125 are 13.1 kDa (predicted as 12.0 kDa) and 16.6 kDa (predicted 15.8 kDa), respectively, indicating the KH-80 and -125 proteins exist as monomers in solution (Figure 12E and F). After the Ni-NTA purification, DDX53-KH-73 (predicted 9.8 kDa) was purified to near homogeneity using SEC70 size-exclusion chromatography (Figure 12D), and it exists as a monomer in solution.

Table 5. Summary of DDX43 and DDX53 KH domains protein expression and their respective molecular weight and pI. KH domains Protein Predicted molecular Protein pI solubility weight (kDa) DDX43-KH-74 aa (69 to 142) Soluble 10.4 9.67 DDX43-KH-80 aa (54-133) Insoluble 11.0 9.81 DDX43-KH-89 aa (54-142) Soluble 12.0 9.21 DDX43-KH-126 aa (47-172) Soluble 15.8 9.00 DDX53-KH-73 aa (50-122) Soluble 10.5 9.87 DDX53-KH-80 aa (38-118) Soluble 12.0 9.9 DDX53-KH-125 aa (28-152) Soluble 15.8 9.86

4.1.2 DDX43-KH+RecA1 and DDX53-KH+RecA1 proteins are unstable and difficult to purify Besides KH domain alone, I also tried to purify proteins that contain the KH domain and RecA1 domain (Figure 3 and 4). After Ni-NTA column chromatography, I performed SDS- PAGE analysis and observed the expected DDX43-KH+RecA1 protein band of molecular weight 47 kDa in the eluted fraction (Figure 13A). However, the yield of soluble protein was very small. We tried different protein induction conditions: 25 °C incubation for 5 h after IPTG induction and 37 °C incubation for 3 h after IPTG induction; however, the results were the same — majority of the protein was in pellet fraction, remaining protein in the cell lysate poorly bound to the Ni-NTA column and therefore, and negligible amount of DDX43-

35

DDX43-KH-RecA1 (47 kDa) 25 C for 5 h

37 C for 3 h 25 C for 5 h 37 C for 3 h

A B

WB

IP IS CLF Elution IP IS

Elution

WB

Induced sup (IS) sup Induced CL WB IS Induced pellet (IP) pellet Induced CLF Elution Elution FT IP M M CLF M M kDa WB kDa 75 75 50 50 37 37

DDX53-KH-RecA1 (47.5 kDa) 25 C for 5 h 37 C for 3 h 25 C for 5 h 37 C for 3 h

C D

WB

IS CLF IP IS CLF Elution Elution WB

IP M M

IP IS WB Elution CLF

IP Elution CLF WB IS M M kDa 75 50

37 IP

Elution

WB CL CLF E Pellet M F

DDX53-KH- RecA1

Fractions A/22 – A/29 G M kDa

50 DDX53-KH-RecA1 37

Figure 13. Optimization of expression and purification of DDX43-KH+RecA1 and DDX53-KH+RecA1 proteins. (A and B) SDS-PAGE analysis (A) and Western blot analysis (B) of DDX43-KH+RecA1 (inside red square) expressed at 25 and 37 °C after and before Ni- NTA affinity chromatography purification. M, marker. (C and D) SDS-PAGE analysis (C) and Western blot analysis (D) of DDX53-KH+RecA1 (inside red square) fractions after and before Ni-NTA affinity chromatography purification at 25 and 37 °C. (E) SDS-PAGE analysis of the eluted DDX53-KH+RecA1 fractions from a Ni-NTA column. (F and G) Chromatographic profile of recombinant DDX53-KH+RecA1 protein eluting from a SEC650 size exclusion column (F) and its SDS-PAGE analysis (G). 36

KH+RecA1 protein was in the elution fraction (Figure 13A). Western blotting further confirmed the DDX43-KH+RecA1 protein was expressed predominantly in an insoluble form and the soluble fraction in cell lysate bound poorly to the column (Figure 13B).

Similarly, we used 25 °C (incubation for 5 h after IPTG induction) and 37 °C (incubation for 3 h after IPTG induction) for DDX53-KH+RecA1 protein expression after induction with 0.5 mM IPTG. The protein yield was higher for the cultures at 37 °C compared to 25 °C, and the expected protein was soluble and eluted as 47.5 kDa (Figure 13C). Western blotting confirmed the DDX53-KH+RecA1 protein was expressed and soluble (Figure 13D). The fractions after Ni-NTA purification (Figure 13E) was loaded on a SEC650 size exclusion chromatography column for further purification; however, a broadening peak (Figure 13G) and multiple bands were seen on SDS-PAGE (Figure 13F), indicating the protein was not stable.

4.2 DDX43-KH-74 and DDX43-KH-126 grow into needle like small crystals and DDX53-KH proteins does not crystallize After purifying DDX43-KH and DDX53-KH domain proteins (Table 6), we performed preliminary screening tests in variety of reagents and at different pH for protein crystallization. Index screen, Crystal screen and Classic screen were applied to search for crystallization conditions using concentrated proteins. The proteins were incubated at 4, 9, 15 and 22 °C with and without oligonucleotide sequences (dT5, dT8 or dT10). Despite extensive efforts, no crystals with the oligonucleotides were obtained. However, screening without oligonucleotide, DDX43-KH-74 protein with N- and C- terminal 6xHis-tag gave small needle like 1D growth crystals at 22 °C (as observed under the microscope) under two different conditions consisting of magnesium formate and 0.1M Tris pH 8.5 one well (Figure 14A) and magnesium formate and Polyethylene glycol (PEG) 3350 in another well (Figure 14B). Similar to DDX43-KH-74, needle crystals for the construct DDX43-KH-126 were obtained at condition of 0.2 M MgCl2, 0.1 M Tris pH8.5, 3.4 M 1, 6-Hexanediol without oligonucleotide. Optimization of crystal conditions was carried out by varying Mg2+ salts such as MgCl2 (Figure 14C) and MgSO4 (Figure 14D). However, these results were not reproducible with the same proteins that were purified in different batches. We also did not get any hits for other KH domain constructs of DDX43

37 and DDX53 proteins with or without oligonucleotide. Crystal screening was performed for the DDX43-KH-89 and DDX43-KH-126 proteins after their N-terminal 6xHis-tag was removed by thrombin protease; however, screening did not give any positive hits. Crystals obtained from initial hits were tested for the quality of their X-ray diffraction. However, due to small size, the crystals did not show any diffraction. To get better crystals, we carried out protease reaction to remove any unstable ends from the DDX43-KH-126 protein. Proteases such as subtilisin, elastase, trypsin, chymotrypsin and GluC were used (Figure 15A); elastase and GluC gave better cleaved protein bands (Figure 15B). We performed Western blotting using an anti-His antibody and confirmed that the C-terminal amino acids were cleaved (Figure 15B). The elastase and GluC treated DDX43-KH-126 protein was purified by SEC70 size exclusion chromatography and further DLS analysis was performed (Figure 15C and D). Fractions 24 and 25 from elastase treated sample gave near 100% monodispersed reading whereas GluC treated protein formed different aggregate species. The purified samples were set for crystallization screens, but did not give any positive hits.

The elastase treated sample was analyzed by Mass spectrometry to identify the amino acid residues cleaved. It showed 22 amino acid residues from the C-terminal of DDX43-KH- 126 were cleaved. Thus, we cloned this new fragment of DDX43-KH-104 (47-150 aa) into the NdeI and XhoI sites of a pET28a vector. However, we did not see any protein expression for this construct and could not proceed for crystal screening.

Table 6. Summary of DDX43 and DDX53 KH domains protein expression, concentration and crystal screening. KH domains Protein Protein Crystals Crystals solubility concentration screening obtained (mg/mL) DDX43-KH-74 NC-His tag Soluble ≤ 5.0  Yes (not (69 to 142) reproducible) DDX43-KH-74 (69 to 142) Soluble ≤ 5.0  No DDX43-KH-80 (54-133) Insoluble NA NA NA DDX43-KH-89 (54-142) Soluble  8.0  No DDX43-KH-126 (47-172) Soluble  10.0 -12.0  Yes (not reproducible) DDX53-KH-73 (50-122) Soluble  7.0  No DDX53-KH-80 (38-118) Soluble  7.0  No DDX53-KH-125 (28-152) Soluble  10.0  No

38

A B

C D

Figure 14. DDX43-KH protein in the form of needle crystals. (A and B) Crystals of DDX43-KH-74 protein in 0.1 M Tris pH 8.5 and 0.1 M Magnesium formate dihydrate solution (A) and 0.1 M Magnesium formate dihydrate and 12.5% w/v PEG 3350 solution (B). (C and D) Crystals of DDX43-KH-126 protein in 0.3 M MgCl2, 3 M 1,6-Hexanediole, 1M Tris pH 8.5 (C) and 0.1 M MgSO4, 3 M 1,6-Hexanediole, 1 M Tris pH 8.5 (D).

B C + + C Protein

A Subtilisin Elastase Trypsin Chymotrypsin Glu C No Protease C + Protein

Elastase + Protein + Elastase No Protease No Glu

Elastase + Protein + Elastase Glu 1h 5h o/n M M Protease No M Protease No 1h 5h o/n 1h 5h o/n 1h 5h o/n 1h 5h o/n M 1h 5h o/n kDa kDa kDa

E F

20 20 20 15 15 15

10 10 10

C D

mAU mAU

Volume (mL) Volume (mL)

Figure 15. Treatment of DDX43-KH-126 with different proteases. (A) Time dependent treatment of DDX43-KH-126 protein with indicated proteases. (B) SDS-PAGE of DDX43- KH-126 protein treated with Elastase and GluC (left) and Western blot analysis of the same using an anti-His antibody (right). (C and D) Chromatographic profile and SDS-PAGE analysis of elastase (C) and GluC (D) cleaved fractions of recombinant DDX43-KH-126aa protein eluting from a SEC70 size exclusion column.

39

4.3 Biochemical characterization of DDX43-KH and DDX53-KH domains using electrophoretic mobility shift assay (EMSA)

4.3.1 DDX43 and DDX53 KH domains bind ssDNA and RNA but not blunt-end dsDNA and dsRNA substrates

It is well documented that KH domains are involved in nucleic acid binding (Beuth et al., 2005), thus we performed EMSA for the DDX43-KH and DDX53-KH domain proteins. The DDX43-KH-126 protein bound ssDNA (Figure 16A), forked dsDNA (Figure 16C), ssRNA (Figure 16D) and forked dsRNA (Figure 16F) substrates but not blunt-end dsDNA and dsRNA substrates (Figure 16B and E). This preference for single stranded substrate binding was consistent with DDX43-KH-74 and DDX43-KH-89 (Figure 16G and H). The DDX43-KH- 104 (obtained after cleaving DDX43-KH-126 protein with elastase) was purified using size exclusion chromatography) and gave similar results (Figure 16I).

We performed EMSA experiments with DDX53-KH-125 protein and found the protein bound ssDNA and ssRNA substrates but not blunt end dsDNA and dsRNA (Figure 16J). DDX53-KH-80 protein showed very weak binding to ssDNA and forked dsDNA as compared to its longer construct (Figure 16K). Taken together, these results suggest DDX43-KH and DDX53-KH domains bind ssDNA and RNA, but not dsDNA and dsRNA substrates.

A B C DDX43-KH-126 DDX43-KH-126 DDX43-KH-126 NE NE NE

* 30-bp 30-bp 30-mer * * 15

ssDNA Blunt end dsDNA Forked dsDNA

D E F DDX43-KH-126 DDX43-KH-126 DDX43-KH-126 NE NE NE

* 30-bp

30-mer * 30-bp 15 * Forked dsRNA ssRNA Blunt end dsRNA

40

G DDX43-KH-74 DDX43-KH-74 DDX43-KH-74 DDX43-KH-74 NE NE NE NE

* 30-bp 30-mer 30-mer * 30-bp * * 15

ssDNA blunt end dsDNA Forked dsDNA ssRNA

H DDX43-KH-89 DDX43-KH-89 DDX43-KH-89 DDX43-KH-89 NE NE NE NE

30-bp * 30-bp dT30 * * 15 30-mer Forked dsDNA * ssDNA blunt end dsDNA ssRNA

I DDX43-KH-104 DDX43-KH-104 DDX43-KH-104 DDX43-KH-104 NE NE NE NE

* 30-bp

30 30-bp 15 5’ 15 30-mer * * 3’ 45mer ssDNA Forked dsDNA ssDNA Blunt end dsDNA

J DDX53-KH-125 DDX53-KH-125 DDX53-KH-125 NE NE NE

* 30-bp 30 15 * dT25 5’ 15 * 3’ 45mer ssDNA Forked dsDNA dT25

41

DDX53-KH-80 DDX53-KH-80 K NE NE

* 30-bp

15 30-mer * Forked dsDNA

ssDNA Figure 16. DDX43 and DDX53 KH proteins prefer to bind single stranded DNA and RNA. (A-F) Representative EMSA images of increasing protein concentration (0–9.6 μM) of DDX43 KH domain protein (126 aa) binding with 0.5 nM of ssDNA (A), blunt end dsDNA (B), forked dsDNA (C), ssRNA (D), blunt end dsRNA (E), or forked dsRNA (F). (G) DDX43- KH-74 protein binding with different DNA and RNA substrates. (H) DDX43-KH-89 protein binding with different DNA and RNA substrates. (I) DDX43-KH-104 (elastase treated) protein binding with different DNA substrates. (J) DDX53-KH-125 protein binding with different DNA substrates. (K) DDX53-KH-80 protein binding with ssDNA and forked dsDNA substrates. NE, no enzyme.

4.3.2 DDX43-KH and DDX53-KH domains prefer pyrimidines over purines by EMSA In KH domain-containing RNA binding proteins, the role of KH domain is to bind oligonucleotides in a sequence specific manner (Beuth et al., 2005, Hollingworth et al., 2012). To identify the specific sequence bound by DDX43-KH and DDX53-KH domains, we performed EMSA with dT, dC, dA and rU oligonucleotides. DDX43-KH-126 domain bound dT30, dC30 and rU30 but not dA30 (Figure 17A-D), suggesting DDX43-KH domain prefers binding pyrimidines than purines. Also, we used different lengths of dT oligonucleotide such as dT5, dT6, dT7, dT10, dT15, dT20 and dT25, and found that there was no binding for dT5, dT6, and dT7, but weak binding for dT10 (Figure 17E). The binding gradually increased for dT20 and dT25 (Figure 17F and G), suggesting longer oligonucleotide sequence is preferred.

We performed similar experiments with DDX53-KH-125, and similar results were obtained (Figure 17H-J). These results suggest DDX43-KH and DDX53-KH domains prefer binding with single strand oligonucleotides of pyrimidines over purine.

42

A DDX43-KH-126 B DDX43-KH-126 C DDX43-KH-126 D DDX43-KH-126 NE NE NE NE

30-mer * 30-mer 30-mer * * dT30 rU30 dC30 dA30

E F G DDX43-KH-126 DDX43-KH-126 DDX43-KH-126 NE NE NE

dT25 dT20 * dT10 * *

dT20 dT25

dT10

H I J DDX53-KH-125 DDX53-KH-125 DDX53-KH-125 NE NE NE

30-mer * 30-mer * 30-mer *

dT30 dC30 dA30 Figure 17. DDX43 and DDX53 KH domain proteins prefer binding to pyrimidines over purines. (A-D) Representative images of EMSA performed by incubating 0.5 nM of substrate dT30 (A), dC30 (B), dA30 (C) and rU30 (D) with increasing DDX43-KH-126 protein (0-9.6 µM) at room temperature for 30 min. (E-G) EMSA assay of DDX43-KH-126 protein with different lengths of dT10 (E), dT20 (F) and dT25 (G). (H-J) EMSA performed by incubating 0.5 nM of substrate dT30 (H), dC30 (I), and dA30 (J) with increasing DDX53-KH-125 protein (0-9.6 µM) at room temperature for 30 min. NE, no enzyme.

43

4.4 DDX43-KH and DDX53-KH domain alone have no helicase unwinding activity

Full length DDX43 protein is able to unwind both DNA and RNA substrates (Talwar et al., 2017). Also it has been shown that single strand DNA binding protein can separate double strand DNA or RNA, such as RPA and POT1 (Georgaki et al., 1992; Treuner et al., 1996; Lao et al., 1999; Torigoe and Furukawa, 2006; Torigoe and Kaneda, 2007; Torigoe, 2007), thus, we asked whether KH domain has unwinding activity of its own. Using a 12-bp dsRNA substrate, we found that the full-length DDX43 unwound the 12-bp dsRNA, whereas the KH domains of DDX43 and DDX53 alone did not (Figure 18A), though it could bind with the substrate (Figure 18B). These results suggest that the KH domain in DDX43 might bind with substrates specifically, while its helicase domain coordinates with the KH domain to unwind the substrate, but the KH domain alone has no unwinding activity.

A DDX43 KH DDX43 KH DDX53 KH DDX53 KH B 126aa (+ 126aa (- ATP) 125aa (+ 125aa (- DDX43 WT ATP) ATP) ATP) DDX43-KH-126 NE NE

12 bp RNA

5’ 3’ 3’ 5’

12 bp RNA 5’ 3’ 5’ 3’ 3’ 5’

Figure 18. Unwinding assays of DDX43 and DDX53 KH proteins. (A) Full-length DDX43 and can unwind the 12-bp dsRNA substrates (0.5 nM) whereas both DDX43-KH-126 (0.3-1.2 µM) and DDX53-KH-125 (0.3-1.2 µM) did not. NE, no enzyme; WT, wild-type; filled triangle, heat denatured substrate. (B) EMSA of DDX43-KH-126 protein (0.3-9.6 µM) binding to the 12-bp dsRNA substrate (0.5 nM).

4.5 Nuclear Magnetic Resonance (NMR) spectroscopy analysis of DDX43- KH domain 4.5.1 Expression and purification of 15N-labelled DDX43-KH-89 protein Since we were unable to crystallize DDX43-KH and DDX53-KH proteins, we selected NMR spectroscopy to study the protein structure and functions. The DDX43-KH-126 and DDX43-

44

KH-89 proteins were expressed and grown in 15N-labeled M9 medium. The purification protocol was the same as for the unlabeled proteins. After purification, the protein was concentrated to 5-7 mg/mL and applied to 600MHz NMR spectrometer for obtaining HSQC spectrum. 1H-15N HSQC NMR spectrum was obtained at 25 °C for DDX43-KH-126 and DDX43- KH-89 proteins. The spectra of KH-126 protein revealed partial aggregation and misfolding. By comparison, KH-89 protein was well folded and sufficiently stable at higher concentrations to conduct oligonucleotide binding studies by NMR. Spectral quality of the KH-89 protein improved further upon removal of the hexa-histidine tag after Ni-NTA purification (Figure 19A and B). The protein was further purified using S100 size exclusion column chromatography (Figure 19C and D). The homogenous fractions was pooled and concentrated for NMR spectroscopy analysis.

A B C D

Cleaved Peak 1 Peak 2 Peak 1 M Elution M Uncleaved

kDa kDa

WB CL CLF M

Peak 2 15 15 DDX43- 10 KH-89 10

Figure 19. Purification of 15N-labelled DDX43-KH-89 protein for NMR analysis. (A) SDS-PAGE analysis of the eluted KH-89 fractions from a Ni-NTA affinity column by imidazole. M, marker. (B) SDS-PAGE analysis of 6xHis tag-uncleaved and cleaved KH-89 proteins. (C) Chromatographic profile of 6xHis-tag cleaved KH-89 protein eluting from a Sephacryl S-100 column. (D) SDS-PAGE analysis of the KH-89 protein (peak 2) eluted from the size exclusion chromatography column.

4.5.2 Sequential amino acid assignment of DDX43-KH-89 protein and identifying the amino acids interacting with oligonucleotides The 1H-15N HSQC NMR spectrum of DDX43-KH-89 protein (His tag cleaved) was dispersed and indicated well folded. A model for DDX43-KH-89 was generated using Phyre 2 server (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index) with 98% of residues modelled at >90% confidence, and was used to assign the peaks in the HSQC spectrum to possible specific amino acids (Figure 20A and B). Eighty amino acids out of 89 were uniquely assigned. The shifts in position for signals of specific DDX43-KH-89 residues upon titrations with dT10 and

45 dT5 oligonucleotides (Figure 21A and B) showed identified residues involved in oligonucleotide binding.

A B C-Terminal (142E)

N-Terminal

(54T)

N (ppm) N 15

1H (ppm) Figure 20. Amino acid peak assignments of DDX43-KH-89 protein’s NMR spectra. (A) The model structure of DDX4.3-KH-89 generated by Phyre 2 server. (B) The sequential amino acid assignments in DDX43-KH-89.

G84

G87

A dT10

G86 G84

G87

H65

N (ppm) N 15

A60 V112

K113 A81

I101

1 H (ppm)

46

G84

G87

B dT5 G86 G84

G87

H65

N (ppm) N 15

A60 V112 A81 K113

I101

1 H (ppm)

C D C-Terminal (142E) KH-89 + dT5 A81 KH-89 + dT GRGG loop 10 N-Terminal region (54T)

GRGG GRGG loop A81 (ppm) G87

region G87 

1H (ppm) Combined 

0 0.015 0.025 0.035 0.045  0.055 Residue Number

Figure 21. NMR characterization of the interaction between DDX43-KH-89 with dT10 and dT5 oligonucleotides. (A and B) Overlay spectra of titrations of dT10 (A) and dT5 (B) with DDX43-KH-89 protein. (C) Combined chemical shift changes caused by dT10 and dT5 binding to DDX43-KH-89 as a function of amino acid residue. (D) The model structure of DDX43-KH-89 highlighted with potential amino acids involved in dT5 binding.

47

Using the binding data, we plotted a graph between combined chemical shift and the residue numbers (Figure 21C). Chemical shift changes upon binding to nucleic acids were used to calculate the dissociation constant (Kd) values. The average Kd value for DDX43-KH-

89 binding dT10 and dT5 was 0.17 ± 0.09 mM and 1.5 ± 0.29 mM, respectively. These results suggest that longer oligonucleotides have higher affinity than shorter oligonucleotides, as there are more binding sites for DDX43-KH-89 protein on dT10 than dT5. We also found regions of the protein involved in nucleic acid binding (Figure 21D). The largest chemical shifts are the residues in the GRGG motif, indicating the highest binding.

4.5.3 DDX43-KH domain prefers binding pyrimidines than purines by NMR spectroscopy

We performed titration studies with a constant concentration of KH-89 protein and increasing oligonucleotide. With successive additions of 5-mer oligonucleotides, many residues showed prominent chemical shift changes for dT5 (Figure 21B), dC5 (Figure 22A), and rU5 (Figure

22B), but negligible for dA5 (Figure 22C) or dG5 (Figure 22D), indicating that DDX43 KH domain favors binding to pyrimidines rather than purines.

G84

G87 A dC5 G86 G84 G87

H65

N (ppm) N 15 A60 V112 K113 A81 I101

1H (ppm)

48

G84

G87 B rU5 G86 G84 G87

H65

N (ppm) N 15 A60 V112 K113 A81 I101

1H (ppm)

C

dA5

N (ppm) N 15

1H (ppm)

49

D E

dG5 G87

, ppm,

δ

N (ppm) N

Δ 15

1 H (ppm) Oligonucleotides (mM)

F

dT5 dC5 rU5

(ppm) (ppm) (ppm)

  

Residue Number Residue Number Residue Number

Figure 22. 1H-15N HSQC NMR spectrum showing DDX43 KH domain prefers pyrimidine nucleotides. (A-E) Overlay NMR spectra of DDX43-KH-89 protein (0.2 M) titrated with dC5 (A), rU5 (B), dA5 (C) and dG5 (D) (conc. 0.04-1.0 M). Insets show the enlarged view of the chemical shift changes for residues G84 and G87 upon pyrimidine binding. (E) Combined chemical shift change (Δδ) plotted as a function of dT5, dC5 and rU5 oligonucleotide concentration at 200 μM DDX43 KH-WT protein for residue G87. (F) Combined chemical shift change () as a function of KH-89 (200 µM) residue number observed for dT5 (left), dC5 (center) and rU5 (right) oligonucleotide (400 µM) binding.

We then calculated the Kd values for dT5, dC5 and rU5 as  1.5±0.3 mM,  2.1±0.7 mM, and  1.8±0.2 mM, respectively. These results suggested that, among pyrimidines, the DDX43- KH domain prefers thymine and uracil rich oligonucleotides over cytosine (Figure 22E).

Moreover, the same residues showed chemical shifts for dT5, dC5 and rU5 substrates, suggesting that all oligonucleotides bind the KH domain in a similar manner (Figure 22F).

4.6 Role of Ala81 and Gly87 amino acid(s) in DDX43 KH protein’s stability and function 4.6.1 DDX43-KH G87D and A81I mutants are unstable and insoluble for purification

50

Our NMR peak assignment data revealed that residues Ala81 and Gly87 are majorly involved in oligonucleotide interactions (Figure21C). Sequence alignment of DDX43 proteins showed that these two amino acids are highly conserved in different organisms (Figure 23A). However, aligning DDX43-KH domain with other KH domains, we found that the glycine (G87 in DDX43) of GXXG loop is highly conserved throughout all eukaryotic KH domains whereas the alanine (A81 in DDX43) is not present in other KH domains except hnRNP K-1 A and NOVA-3 (Figure 23B). To determine the importance of these two amino acids GXXG in Homo sapiens VGAVIGRGGSKIK Macaca mulatta VGAVIGRGGSKIK nucleotide binding, we mutated Ala81 to Ile and Gly87 to PanAsp troglodytes. VGAVIGRGGSKIK Rattus norvegicus VGAVIGRGGSKIR Previously, we have mutated G84 to Aspartic Mus acid musculus and found that the binding VGAVI GRG of GSKIR Felis catus VGAVIGRGGSKIK DDX43-KH protein was abolished (Tanu et al. 2017). Therefore,Bos Taurus to analyze the effect VGofA VIG87GRG GSNIK Ovis aries VGAVIGRGGSNIK on binding affinity to oligonucleotides, we chose to mutateAnolis G87 tocaroli asparticnensis acid in DDX43 VGA-KHLIGRG- GSRIK Piliocolobus tephrosceles VGAVIGRGGSKIK 89 and DDX43-KH-126 constructs. For DDX43-KHMicrotus-126-G87D ochrogaster, the target protein VGA VI wasGRG GSKIR Urocitellus parryii VGAVIGRGGSKIK observed in cell pellet, cell lysate, and flow-through ofBos the indicus Ni-NTA x column,Bos taurus but not VG inAVI theGRG GSNIK Lagenorhynchus obliquidens VGAVIGRGGSKIK elution (Figure 24A). This was confirmed by Western blotAcinonyx using anjubatus anti-His antibody (VGFigureAVIGRG GSKIK Zalophus californianus VGAVIGRGGSKIK 24B). Ursus arctos horribilis VGAVIGRGGSKIK

B A GXXG GXXG Homo sapiens VGAVIGRGGSKIK hnRNP K (1) AGAVIGKGGKNIKAL Macaca mulatta VGAVIGRGGSKIK hnRNP K (2) AGGIIGVKGAKIKEL Pan troglodytes VGAVIGRGGSKIK hnRNP K (3) AGSIIGKGGQRIKQI Rattus norvegicus VGAVIGRGGSKIR FMR-1 (1) MGLAIGTHGANIQQA Mus musculus VGAVIGRGGSKIR FMR-1 (2) VGKVIGKNGKLIQEI Felis catus VGAVIGRGGSKIK NOVA (1) AGSIIGKGGQTIVQL Bos Taurus VGAVIGRGGSNIK NOVA (2) AGLIIGKGGATVKAV Ovis aries VGAVIGRGGSNIK NOVA (3) VGAILGKGGKTLVEY Anolis carolinensis VGALIGRGGSRIK SF1 VGLLIGPRGNTLKNI Piliocolobus tephrosceles VGAVIGRGGSKIK DDX43 VGAVIGRGGSKIKNI Microtus ochrogaster VGAVIGRGGSKIR KHDR1 VGKILGPQGNTIKRL Urocitellus parryii VGAVIGRGGSKIK DDX53 VGVVIGYSGSKIKDL Bos indicus x Bos taurus VGAVIGRGGSNIK TDRKH VGRIIGRGGETIRSI Lagenorhynchus obliquidens VGAVIGRGGSKIK KSRP VGVVIGRSGEMIKKI Acinonyx jubatus VGAVIGRGGSKIK PCBP2 CGSLIGKGGCKIKEI Zalophus californianus VGAVIGRGGSKIK AKAP149 VGRLIGKQGRYVSFL Ursus arctos horribilis VGAVIGRGGSKIK ANKHD1 VSRIMGRGGCNITAI RS3 TQNVLGEKGRRIREL B FUBP1 VGLIIGRGGEQINKI GXXG IF2B3 AGRVIGKGGKTVNEL hnRNP K (1) AGAVIGKGGKNIKAL M3XC VGLVVGPKGATIKRI hnRNP K (2) AGGIIGVKGAKIKEL VIGILIN HKFLIGKGGGKIRKV hnRNP K (3) AGSIIGKGGQRIKQI ANKRD17 ISRVIGRGGCNINAI FMR-1 (1) MGLAIGTHGANIQQA FigureFMR 2-31. Multiple(2) VGKVI G sequenceKNGKLIQEI alignment of DDX43-KH domain for determining the conservationNOVA (1)of A81 AGS IIandGKG GG87QTIVQ aminoL acids. (A) Amino acids A81 and G87 are conserved in NOVA (2) AGLIIGKGGATVKAV DDX43NOVA proteins (3) fromVGAIL differentGKGGKTLVE eukaryotesY . (B) Alignment of the GXXG motifs from DDX43 and otherSF1 known humanVGLLIG PRKHGNT domainLKNI s. DDX43 VGAVIGRGGSKIKNI KHDR1 VGKILGPQGNTIKRL DDX53 VGVVIGYSGSKIKDL TDRKH VGRIIGRGGETIRSI 51 KSRP VGVVIGRSGEMIKKI PCBP2 CGSLIGKGGCKIKEI AKAP149 VGRLIGKQGRYVSFL ANKHD1 VSRIMGRGGCNITAI RS3 TQNVLGEKGRRIREL FUBP1 VGLIIGRGGEQINKI IF2B3 AGRVIGKGGKTVNEL M3XC VGLVVGPKGATIKRI VIGILIN HKFLIGKGGGKIRKV ANKRD17 ISRVIGRGGCNINAI Elution Elution A B M M kDa kDa

20 20 15 15 10 10

C Peak 1 Peak 2 D Peak 1 M kDa

Peak 2

20 15 16 kDa

10

Elution Elution E F

15 15

10 10

Figure 24. Purification of DDX43-KH-126-G87D and DDX43-KH-89-A81I mutant proteins. (A and B) SDS-PAGE analysis of DDX43-KH-126-G87D (16 kDa) protein after Ni-NTA affinity chromatography (A) and Western blot for the same using an anti-His antibody (B). M, protein ladder (10 to 250 kDa, Precision Plus Protein Prestained Standards dual color, Bio-Rad). (C and D) Chromatographic profile of KH-126-G87D protein eluting from a Sephacryl S-100 column (C) and its SDS-PAGE analysis (D). (E and F) SDS-PAGE analysis of DDX43-KH-89-A81I (12 kDa) protein after Ni-NTA affinity chromatography (E) and Western blot using an anti-His antibody (F).

Because DDX43-KH-126-G87D protein did not bind to the Ni-NTA beads in the column, we collected the flow-through containing the unbound protein and loaded on a S100 size exclusion chromatography column. However, the protein eluted in the void volume of column with multiple bands/impurities as seen on the SDS-PAGE (Figure 24C) and the chromatographic profile (Figure 24D), suggesting the protein is prone to aggregation.

52

For DDX43-KH-89-A81I, the target protein was present in pellet but not in the soluble form (Figure 24E). This was confirmed by Western blot using an anti-His antibody (Figure 24F). Thus, we were unable to purify DDX43-KH-126-G87D and DDX43-KH-89-A81I proteins.

4.6.2. 1H-15N HSQC NMR spectroscopy of DDX43-KH-89-A81S and DDX43-KH-89-A81G Unlike A81I and G87D, we were able to purify the recombinant protein DDX43-KH-89-A81S and DDX43-KH-89-A81G under the same conditions as the wild type (WT) protein. The 15N- labeled DDX43-KH-89-A81S (Figure 25A) and DDX43-KH-89-A81G (Figure 25B) proteins eluted as monomers from the size exclusion column and we proceeded to obtain their 1H- 15N HSQC NMR spectra. Most of the peaks in the spectra for the mutants overlapped with the spectrum for the native DDX43-KH-89, indicating that the conformation of the mutants was not affected by the single amino acid change. Therefore, we used the WT peak assignment data and transferred it onto the overlapping peaks in DDX43-KH-89-A81G (Figure 26A) and DDX43-KH-89-A81S (Figure 26B) mutants’ spectra.

A Peak 1 Cleaved Peak 1 Peak 2

M 1 2 3 4 5 6 7 8 9 Uncleaved M kDa M kDa kDa 75 75 75

25 25 25 Peak 2 15 15 15 10  10 kDa 10 10 KH-A81S (89aa) KH-A81S (89aa)

B Peak 1 Peak 1 Cleaved Peak 2

M 1 2 3 4 5 6 7 8 9 M Uncleaved M kDa kDa kDa 75 75 75 25 25 25 15 15 Peak 2 15 10 10 10  10 kDa KH-A81G (89aa) KH-A81G (89aa)

Figure 25. Purification of DDX43-KH-89 A81S and A81G mutant proteins. (A and B) SDS-PAGE analysis of expression and purification of A81S (A) and A81G (B) proteins for NMR analysis. The protein bands after Ni-NTA purification (left), 6xHis cleaved (second from left), chromatographic profile (second from right) and its SDS-PAGE analysis (right) are shown for both mutants.

53

A Black: KH-WT

Red: KH-A81G

N (ppm) N 15

1 H (ppm) B Black: KH-WT

Blue: KH-A81S

N (ppm) N 15

1 H (ppm)

Figure 26. NMR spectral analysis of DDX43-KH-89 A81S and A81G mutant proteins. (A and B) Overlap of WT with A81G (A) and A81S (B) chemical shift spectra, transfer of amino acid peak assignments from WT to the mutants is shown in the uncrowded regions of the spectra.

54

4.6.3. DDX43-KH-89-A81S and DDX43-KH-89-A81G have reduced protein stability and binding activity To identify the effect of mutations A81S and A81G on DDX43-KH domain’s binding activity, we titrated DDX43-KH-89-A81S (Figure 27A) and DDX43-KH-89-A81G (Figure 27B) with dT5 oligonucleotide and found very few amino acids showing chemical shift changes, as opposed to the effect on the WT protein (Figure 27C). The chemical shift changes of residues G84 (Figure 27D) and G87 (Figure 27E) in the GRGG loop displayed a decreased binding in the order of WT>A81G>A81S, demonstrating an important role of A81 in DNA binding. The protein spectra showed aggregation in the 1H-axis ranging 8.0-8.5 ppm, suggesting the mutant proteins are not folded properly, thus, affecting the stability of DDX43-KH domain.

Interestingly, most residues involved in dT5 binding showed smaller chemical shift changes in A81S and larger chemical shift changes in A81G, compared to the WT (Figure 27C). Steric hindrance introduced by the hydroxyl group (-OH) in serine might explain the lower binding affinity and smaller DNA-induced chemical shift change as compared to glycine, which is more similar to alanine in size. The stronger DNA-induced chemical shifts in A81G mutant may be due to some repositioning of the oligonucleotide in the binding site, compared to the wild type. In addition, EMSA showed that DDX43-KH-89-A81S and DDX43-KH-89-A81G proteins had reduced binding to ssDNA and ssRNA substrates compared to the WT (Figure 28A and B). However, the binding was not completely abolished in these mutants. These results suggest that mutation of Ala81 to glycine or serine affects the binding activity of DDX43-KH domain. Therefore, Ala81 is not only important for the stability of the protein but is necessary for its proper function.

55

A

A81S + dT5 G84 G86 G87

H65

N (ppm) N 15

1H (ppm) B

A81G + dT5 G84 G86 G87

H65

V132

N (ppm) N 15

E142

K113

I101

1H (ppm)

56

C D E

G84 G87

, , ppm

, , ppm

δ

δ

(ppm)



combined

combined Combined Chemical Shifts Shifts Chemical Combined

Residue Number dT5 Oligonucleotide (mM) dT5 Oligonucleotide (mM) Figure 27. Analysis of interaction between DDX43-KH-89 A81S and A81G with oligonucleotides using NMR spectroscopy. (A and B) Titration of dT5 oligonucleotide with A81S (A) and A81G (B) proteins using NMR spectroscopy. (C) Overlap of combined chemical shifts () changes caused by dT5 binding to A81S and A81G and WT DDX43-KH-89 as a function of amino acid residue. (D and E) G84 (D) and G87 (E) amino acids combined chemical shift (δ) plotted against dT5 oligonucleotide concentration for DDX43-KH WT, A81S and A81G proteins.

A KH-WT KH-A81G KH-A81S NE NE NE

30-mer *

ssDNA ssDNA ssDNA

B KH-WT KH-A81G KH-A81S NE NE NE

30-mer * ssRNA ssRNA ssRNA Figure 28. Analysis of interaction between DDX43-KH-89 A81S and A81G proteins with oligonucleotides using EMSA. (A and B) Representative EMSA images of increasing protein concentration (0–9.6 μM) of DDX43-KH-89 protein (WT, A81G, and A81S) binding with 0.5 nM of ssDNA (F) or ssRNA (G). NE, no enzyme.

57

4.7 Identification of nucleic acids bound by KH domain using SELEX and ChIP/CLIP-Seq methods 4.7.1 Sequences identified by SELEX Our EMSA and NMR results suggested that the DDX43-KH domain has sequence binding preference. To identify the specific sequence, we first utilized a high-throughput SELEX (systematic evolution of ligands by exponential enrichment) approach. We selected a synthetic DNA library that contains ~1×1015 of 20 random nucleotides (O-32001-20, TriLink Biotechnologies). We incubated this random pool of DNA sequences with purified DDX43- KH-89 protein, and bound oligonucleotides were eluted and enriched by PCR (Figure 29).

PCR Protein RNA DNA Eluted w/ imidazole Amplification, Repeat: Binding- Washing- Phenol Elution- DNA library chloroform Extraction Washing PCR extraction (5-10 times) (cat# O-32001, Deep TriLink) Sequencing (Illumina) 6 × His Tagged KH domain bound nucleic DNA (DNA library) KH domain DNA/protein Library acids retained Protein construction

Figure 29. Identification of specific sequences by SELEX method. Flow chart showing isolation and identification of DDX43 KH domain bound nucleic acids by SELEX. After six rounds of binding, washing, elution, and enrichment steps, we recovered the final oligonucleotides from protein-DNA complexes and a DNA library was constructed using a NEBNext ultra II DNA library prep kit. The oligonucleotides with no protein added during the selection rounds was considered as a negative control. The quality of these DNA libraries was analyzed by Bioanalyzer, and it showed a desired band of 200 bp for both samples: DDX43-KH-89 and negative control (no protein) (Figure 30A and B). Because the starting library has 20 nt random sequences with 23 nt from the primers on each side, and 120 bp of adaptor-index primers were added during library construction, therefore, the total length of 200 bp was expected (Figure 30C). After the quality check, we had the libraries deep sequenced.

After subtracting reads from the control to which no protein was added, 8877 of 20-nt sequences specific for DDX43-KH-89 protein were obtained. These sequences were analyzed by MEME server to identify recognition pattern. We selected 6-8 nt width for nucleotides and it resulted in three top consensus sequences that were found in 234 sequences, and all contain 3- 58

4 C/T-rich motif with an insertion of one or two guanine nucleotides between them (Figure 31A, B and C). We also used 5-20 nt width, the full-length of the random nucleotides, and it resulted in two top consensus sequences found in 291 sequences, and all contain 3-4 C/T-rich element and an insertion of one or two guanine nucleotides (Figure 31D and E). These results suggest that the C/T repeats with insertion of a guanine residue might be preferred by the DDX43-KH domain.

A B C DDX43-KH-89 Negative control 186 bases

60 nt 23 nt 20 nt 23 nt 60 nt

Figure 30. Quality check by Bioanalyzer for the DNA library constructed using SELEX method. (A and B) Bioanalysis results of DNA library constructed for DDX43-KH-89 protein (B) and the negative control (no protein, B) by the SELEX method. (C) Cartoon depicting the total length of DNA library constructed from the SELEX procedure.

A B C

D E

Figure 31. Motif sequences identified for DDX43-KH-89 protein by SELEX method. (A- C) Top three motif sequences obtained from SELEX method using 6-8 nt width in MEME server. (D and E) Two top DNA sequence motifs identified by SELEX method using 20-nt width in MEME server.

4.7.2 Sequences identified by ChIP/CLIP-seq

59

To identify nucleic acids bound by the DDX43-KH-89 and DDX53-KH-73 proteins in vivo, we cloned both fragments into a pcDNA3 vector with a 3×FLAG tag. The target proteins were overexpressed in HEK293T cells (Figure 32A), protein-nucleic acids were crosslinked with formaldehyde, and DNA and RNA were sheared by sonication to obtain fragment sizes between 200 bp to 600 bp (Figure 32B and C). Proteins complexed with sheared DNA and RNA fragments were subjected to anti-FLAG affinity chromatography and pulled down by elution with 3X FLAG peptide. DNA and RNA (converted to cDNA) were used for library construction (Figure 32D). The quality of the libraries was checked by a Bioanalyzer. DNA libraries showed an average peak of  400 bp for DDX43-KH-89, DDX53-KH-73 and negative control samples, suggesting an ideal size of the DNA libraries constructed (Figure 33A-C). We obtained a small and broad peak for the RNA libraries of DDX43-KH-89, DDX53-KH-73 and the negative control samples; this might be due to low amount of RNA library product (Figure 33D-F). All libraries were subjected to deep sequencing.

A DDX53 DDX43 B DDX43 DDX53 Empty C DDX43 DDX53 Empty M NT EV KH-73 KH-89 M KH 89 KH 73 Vector M KH- 89 KH- 73 Vector

bp bp 15 500 500 300 300 10 200 200 100 100

D Protein RNA DNA

FLAG M2 beads Eluted w/ Crosslinking w/ 3xFLAG (For RNA, formaldehyde Washing peptide RT to cDNA) Digested w/ Deep Cell lysate RNase A Sequencing sonication /DNase I KH domain (Illumina) 3xFLAG tagged shearing Proteins/M2 KH domain KH domain beads, all bound nucleic Library expressed in nucleic acids acids retained construction HEK293T cells

Figure 32. Identification of specific sequences bound by KH domain using ChIP/CLIP- Seq methods. (A) Western blot analysis of proteins DDX53-KH-73 (12.5 kDa) and DDX43- KH-89 (13 kDa) expressed in HEK293T cells using an anti-FLAG antibody (Sigma). M, protein marker. NT, non-transfection cells. EV, empty vector kDa. (B and C) Agarose gel showing size of bands for ChIP-DNA (B) and CLIP-RNA (C) library for DDX43-KH-89, DDX53-KH-73 and empty vector. M, 1 kb DNA marker. (D) Flow chart of the modified ChIP/CLIP-Seq method used.

60

A B DDX53-KH-73 ChIP-DNA C DDX43-KH-89 ChIP-DNA Negative control ChIP- DNA

D DDX43-KH-89 ChIP-cDNA E DDX53-KH-73 ChIP-cDNA F Negative control_cDNA

Figure 33. Quality check by Bioanalyzer for the DNA/RNA libraries from ChIP/CLIP- Seq methods. (A-C) Bioanalyzer results of DNA libraries for DDX43-KH-89 (A), DDX53- KH-73 (B) proteins, and negative control (C). (D-F) Bioanalyzer results of RNA libraries for DDX43-KH-89 (D), DDX53-KH-73 (E), and negative control (F).

A B

C D

Figure 34. Motif sequences identified for DDX43-KH-89 protein by ChIP/CLIP-seq method. (A and B) Two top motif sequences obtained from ChIP-Seq method for DDX43- KH-89. (C and D) Two top motif sequences obtained from CLIP-Seq method for DDX43-KH- 89.

For the DDX43-KH-89 DNA library, sequences were aligned with (hg38) and the sequences obtained from the empty vector control were subtracted. With a

61 range of 295-529 bp in length, we found two top consensus sequences with significant CT-rich motif (Figure 34A and B). For the RNA library of DDX43-KH-89, using sequences in the range of 296-712 bp in length, we found two top consensus sequences with significant CT- or GT-rich motifs (Figure 34C and D), which is consistent with the sequences obtained by the SELEX method.

4.7.3 Validation of specific nucleic acid sequences bound by DDX43-KH domain using 1H- 15N- HSQC NMR spectroscopy Since we did not find any commonly occuring sequences between SELEX and ChIP/CLIP-seq, we opted with the overall occurrence of CT-rich and TG-rich sequences as the preferred sequence in all three techniques. To validate the sequence motifs identified by SELEX and ChIP/CLIP-seq, we tested binding of several different sequences to DDX43-KH-89 by 1H-15N HSQC NMR spectroscopy along with CT/TG-repeats (Figure 35). First, we examined chemical shift changes caused by homopurine sequence AGAGA and homopyrimidine sequence CTCTC, and found that DDX43 KH domain has a higher affinity for the pyrimidine sequences (Table 7).

Table 7. Affinities of DDX43-KH-89 domain with indicated oligonucleotide sequences determined by NMR spectroscopy.

Oligonucleotide sequence Kd value (mM) GTTGT 0.0435± 0.0065 TGTGT 0.0776 ± 0.011 GTTTG 0.165 ± 0.025 GGTTG 0.249 ± 0.076 CGCGC 0.266 ± 0.170 ACCAC 0.348 ± 0.140 ATATA 0.400 ± 0.250 ATTAT 0.497 ± 0.180 CTCTC 0.995 ± 0.601 CACAC 1.232 ± 0.600 AGAGA 2.232 ± 1.236

62

GTTGT CTCTC GTTTG

G84 G86 G86 G84 G86 G84 G84 G84 G87 G87 G87 G84 G87 G87

H65 G87 H65 H65

N (ppm) N

N (ppm) N N (ppm) N

A60 15 15 A60 15 A60 V112 V112 V112 A81 A81 A81 K113 K113 K113

I101 I101 I101

1 1 1 H (ppm) H (ppm) H (ppm)

GGTTG TGTGT AGAGA G86 G86 G84 G86 G84 G84 G87 G84 G87 G87 G84 G84 G87 G87 G87

H65 H65 H65

N (ppm) N

N (ppm) N

N (ppm) N 15

A60 15

15 A60 A60 V112 V112 V112 A81 K113 A81 A81 K113 I101 I101

1 1 1 H (ppm) H (ppm) H (ppm)

CGCGC CACAC ATATA G86 G86 G86 G84 G84 G84 G84 G84 G87 G87 G87 G84 G87 G87 G87 H65 H65

H65

N (ppm) N

N (ppm) N N (ppm) N 15 A60 A60 15 V112

15 V112 A60 V112 A81 K113 A81 K113 A81 I101 I101

1 1 H (ppm) H (ppm)

ACCAC ATTAT G86 G86 G84 G84 G84 G87 G87 G84

G87 G87

H65 H65

N (ppm) N

N (ppm) N 15 15 A60 A60 V112 V112 A81 K113 A81 K113

I101 I101

1 1 H (ppm) H (ppm)

Figure 35. Overlay of 1H-15N HSQC titration spectrum of DDX43 KH-89 protein with indicated oligonucleotide (ranged 40-1000 μM). Titration of DDX43-KH-89 protein (200 μM) with the indicated oligonucleotide (ranged 40-1000 μM). Insets show the enlarged view of the chemical shifts for residues G84 and G87.

63

Analyzing purine-pyrimidine mixed sequences, we found (1) TGTGT has the lowest Kd

(0.078 mM); (2) adenine is not a preferred nucleotide as evidenced by higher Kd values for ATATA, ACCAC, CACAC, and ATTAT (Kd ranging from 0.348 to 1.232); and (3) guanine increases binding affinity, especially adjacent to a pyrimidine, such as in CGCGC and TGTGT. Among the pyrimidines, thymine is preferred over cytosine, as demonstrated by comparing binding affinities of TGTGT (Kd=0.078 mM) and CGCGC (Kd=0.266 mM). This is consistent with a slightly higher binding affinity of dT5 compared to dC5 (shown above).

Finally, we tested the nucleotide position effect by comparing GGTTG, GTTTG, and

GTTGT, and found that GTTGT has the highest binding affinity (Kd=0.044 mM), suggesting TTGT as a preferred motif for the DDX43-KH domain. Combined chemical shift change (Δδ) plotted as a function of different oligonucleotides concentration at 200 μM DDX43 KH-WT protein for residues G84 (Figure 36A) and G87 (Figure 36B) highest binding for GTTGT oligonucleotide.

A B

G84 G87

, ppm,

δ

, ppm,

Δ

δ Δ

Oligonucleotides (mM) Oligonucleotides (mM) Figure 36. Combined chemical shift (Δδ) vs. oligonucleotide concentration (mM) for residues G84 and G87. (A and B) Combined chemical shift change (Δδ) plotted as a function of different oligonucleotides concentration (0.0-1000 μM) at 200 μM DDX43 KH-WT protein for residues G84 (A) and G87 (B).

5. DISCUSSION In this study, using EMSA, SELEX, ChIP/CLIP-Seq, and NMR, we have found that the KH domain in DDX43 helicase prefers binding with ssDNA and ssRNA, with a preference to pyrimidine-rich sequences, particularly the tetranucleotide TTGT. Similarly, using EMSA, we

64 have found that DDX53-KH domain prefers binding to ssDNA or ssRNA with a preference for pyrimidines.

5.1 DDX43-KH and DDX53-KH domain binds to single strand DNA and RNA substrates The KH domain was first characterized in hnRNP K protein and since then it is known to bind single stranded DNA or RNA substrates (Valverde et al., 2008). Using EMSA and NMR, we found that the DDX43-KH and DDX53-KH domains prefer binding to single strands of both DNA and RNA substrates but show no binding to blunt-end substrates. In all KH domains, the phosphate backbone of nucleic acids interacts with the conserved GXXG region and the single strand preference is accomplished by the contacts made between the second -sheet of type I KH domain (third -sheet for type II KH domain) and the Watson-Crick edges of the nucleic acids in single stranded oligonucleotides to form hydrogen bonds (Hollingworth et al., 2012; Nicastro et al., 2015). This binding is not possible if the bases of nucleic acids are paired with the complementary sequence. Not only do KH domains bind single strands of nucleic acid sequences but they are also known to recognize and bind substrates in a sequence specific manner. This specificity towards nucleic acid bases is determined by the five amino acid residues preceding the GXXG motif (Braddock et al., 2002). The KH domain containing proteins such as FBP, binds to ssDNA of the upstream region of c-myc promoter in vivo and in vitro (Braddock et al., 2002). The three KH domains of poly(C)-binding proteins (PCBP2) bind to poly(C)-sequences with high affinity and specificity and are regulators of multiple biological processes including mRNA stability and post-transcriptional regulation (Fenn et al., 2007, Nazarov et al., 2019). PCBP2 binds to the C-rich sequence in the internal ribosomal entry site (IRES) element for viral RNA cap- independent translation (Fenn et al., 2007). These examples present an important role played by KH domains in proteins and loss of function of KH domains results in impaired ssDNA/RNA binding and various diseases, such as fragile X mental retardation syndrome (Valverde et al., 2008; Hollingworth et al, 2012). Here, we found that both DDX43-KH and DDX53-KH domains prefer pyrimidines.

5.2 DDX43-KH domain prefers pyrimidines

65

It remains to be determined what is the biological function of pyrimidine-rich sequences bound by DDX43-KH and DDX53-KH domains. Favoring pyrimidines by KH domains can be related to a narrow binding cleft that would only accommodate smaller bases. In the FBP KH3 and KH4 domains in complex with M29 ssDNA, the center of the groove formed by helices α1 and α2 is hydrophobic, and the edges are hydrophilic and charged, the narrow binding site formed favors pyrimidines over purines (Braddock et al., 2002). It has been shown the KH domain-containing protein P element somatic inhibitor (PSI) typically binds poly-pyrimidine RNA (Amarasinghe et al., 2001). Specific binding of hnRNP K to the single-stranded pyrimidine-rich sequence in the promoter of human c-myc gene activates transcription (Tomonaga and Levens., 1996). Binding of hnRNP K and PCBP1 to the C-rich strand of human telomeric DNA was established in vitro (Lacroix, et al., 2000) and in the K562 human cell line (Bandiera et al., 2003). The polypyrimidine tract–binding protein (PTB) is an important factor in the regulation of and provides an interesting example of RNA recognition by multiple RRM domains (Keppetipola et al, 2012). Likewise, αCP1- KH3 possesses a narrow hydrophobic cleft that would be expected to accommodate pyrimidine- rich ssRNA or ssDNA (Sidiqi et al., 2005). Rather than homothymine, TTGT is preferred by the DDX43-KH domain. Consistent with our findings, recently another group also reported that DDX43 protein has the highest affinity with TG repeats sequence (Wu et al., 2019). Interestingly, TTGT is the optimal binding sequence for the KH2+KH3 domains of FUSE Binding Protein (FBP) (Benjamin et al., 2008). These findings suggest that the fate of selection of specific nucleic acids by such DNA/RNA-recognizing domains lies within the specific amino acids in those proteins interacting with nucleic acids. Therefore, these codes of amino acids can be further used to model KH domains affinity and specificity towards nucleic acid sequences de novo.

5.3 Role of the GXXG loop in DDX43 KH domain The conserved GXXG loop is a hallmark of KH domains. This loop forms one wall of the bent oligonucleotide binding site and interacts with the phosphate backbone of the central nucleotides, forcing their bases into specific recognition pockets that determine base specificity. Both glycines are contacting the oligonucleotide and a sidechain would introduce steric hindrance. Moreover, the C-terminal glycine assumes the backbone conformation that is

66 attenable only for a glycine (positive φ/ψ torsion angles). Mutation of first glycine in any two KH domains of coding region determinant binding protein (CRD-BP) completely abolishes the RNA binding in vitro (Barnes et al., 2014). Similarly, by substituting the first glycine, Gly-227 to Asp/Ser in GXXG of GLD-1 results in loss-of-function phenotype (Lewis et al, 2000). We previously also show changing G84 to aspartic acid in DDX43 helicase cause no binding to ssDNA/RNA substrates (Talwar et al., 2017). In addition to the two conserved glycines, we found the alanine adjacent to the GRGG loop is crucial for nucleic acids-binding of DDX43-KH domain (Figure 22), which is consistent with the notion that the five amino acid residues located at the N-terminus of the GXXG motif discriminating the first two bases of ssDNA or ssRNA tetrad (Braddock et al., 2002). The KSRP KH1 and KH4 have identical conserved “GRGG” residues in the loop as DDX43 KH, and bind AU-rich regions (Garcia-Mayoral et al., 2007). However, the KH domains in different proteins show a different sequence preference despite having high secondary structural similarity and conserved amino acid sequences. In all KH domain-nucleic acid structures, the nucleic acid backbone interacts with a conserved GXXG loop, this segment links the two helices of the KH core and orients the four nucleic acid bases towards the hydrophobic groove in the protein structure forming a network of main chain and side chain hydrogen bonds, thereby mediating nucleobase recognition (Nicastro et al., 2015).

5.4 Oligomeric status of DDX43-KH and DDX53-KH domain Isolated KH domains have been seen to crystallize as monomers, dimers and tetramers (Valverde et al., 2008), but no published data support the formation of noncovalent higher- order oligomers by KH domains in solution. We found that DDX43-KH domain forms monomer and dimer, and with dimer having higher affinity with nucleic acids. Because the full-length DDX43 protein exists as a monomer (Talwar et al., 2017, Wu et al., 2019), which excludes the possibility that DDX43 forms a dimer through its KH domain. Though the isolated KH domain forms monomer and dimer, it may only exist in isolated condition. Also, we purified DDX53-KH domain, which eluted in a monomeric forms. The first KH domain of PCBP2 in complex with ssDNA forms two identical dimer complexes per asymmetric unit (Du et al., 2005); however, no dimer or higher‐order oligomer has been shown in vivo. We also co-

67 purified DDX43 full-length protein with nucleic acids (dT8); however, we did not obtain dimeric form, indicating nucleic acid does not affect the oligomeric state of DDX43 protein.

5.5 Role of KH domain in DDX43 and DDX53 helicases Although KH domains have been reported in many RNA binding proteins (Nicastro et al., 2015), the role of KH domain in helicases has not been reported. In fact, DDX43 and DDX53 are the only two human helicases that contain a KH domain in their N-terminuses (Figure 6). We have found that the KH domain in DDX43 is not only required for full unwinding activity, but also partially decide substrates specificity. Our previous work shows that only the full- length DDX43 protein could unwind RNA and DNA duplex substrates, the C-terminal helicase domain has no unwinding activity on RNA and very weak unwinding activity on DNA (Talwar et al., 2017). This suggests that the KH domain is crucial for DDX43 to perform unwinding activity on substrates, which is supported by a recent work demonstrating the KH domain is more important than the helicase core domain for enzyme-substrate interaction (Wu at al., 2019). Our EMSA and NMR spectroscopy results revealed that the KH domain displays substrate preference; however, it still can bind diverse sequences except those that are purine rich. Therefore, the single KH domain having 75-85 aa in DDX43 acts like a clamp for the DDX43 helicase domain (~555 aa) by enhancing the substrate binding and also, partially regulating substrate specificity (Figure 37). Generally, in other DEAD-box helicases, the motifs in RecA 2 domain are responsible for binding with nucleic acid substrates (Fairman-Williams et al., 2010). The RecA1 and RecA2 motifs coordinate together and couple the NTP hydrolysis to nucleic acid binding and unwinding (Cordin et al., 2006, Fairman-Williams et al., 2010). For example, in eIF4A helicase, the motifs Ia and Ib associate with motifs IV and V while binding with RNA substrates (Cordin et al., 2006). In the solved crystal structure of DDX53 helicase domain (PDB ID: 3IUY) without its KH domain, motifs V and VI contribute to nucleotide interaction (Schutz et al., 2010). Thus, the C-terminal helicase domain itself exhibits nucleotide binding, and the KH domain in DDX43 and DDX53 enhances nucleic acid binding. We propose that the KH domain is responsible for greater binding with a single strand tail, which is 5′ for RNA and 3′ for DNA, while two RecA domains are responsible for unwinding duplex nucleic acid sequences and display a weak binding without their KH domain (Figure 37).

68

Helicase domain binds and unwinds nucleic acids by coupling ATP hydrolysis ATP ADP KH-domain acting like a clamp on nucleic acid substrates.

3 KH 5

1and 2) 1and

Helicase

RecA ( DNA/RNA substrates

Figure 37. A working model showing the role of KH domain in helicase. The KH domain acts like a clamp for the helicase domain and enhances substrate binding for the full length DDX43 and DDX53 helicases, where the C-terminal helicase domain couples ATP hydrolysis and nucleic acid binding and unwinding of the ssDNA/RNA substrate.

Moreover, the KH domain binds a separated single strand nucleotide chain to prevent them from re-annealing, and that way accelerates helicase’s processivity. Therefore, for full function of DDX43 helicase, both the KH domain and the helicase domain are necessary (Wu et al., 2019). It will be interesting to determine the structure of the KH domain with its DNA/RNA ligand, or the full-length DDX43 and DDX53 proteins.

5.6 Potential biological role of KH domain containing helicases DDX43 and DDX53 in cancer DDX43 and DDX53 are overexpressed in a variety of cancers and thus can serve as biomarkers and immunotherapy targets. It was shown that the expression of DDX43 is a potential prognostic marker and a predictor of response to anthracycline treatment in breast cancer (Abdel-Fatah et al., 2014, Abdel-Fatah et al., 2016). Expression of DDX43 is required for melanoma tumor growth and progression (Linley et al., 2012). Frequent expression of DDX43 was also found in chronic myeloid leukemia (CML) and acute myeloid leukemia (AML) (Adams st al., 2002; Chen et al., 2011; Roman-Gomez et al., 2007), and its expression is associated with advanced disease and poor prognosis in CML (Roman-Gomez et al., 2007). DDX43 is also upregulated in decitabine‐resistant K562 cells (Wen et al., 2019). DDX43 is highly expressed in lung adenocarcinoma, and the expression level is related to the stage and

69 metastasis of lung adenocarcinoma (Ma et al., 2017). Recently it was found that arresting of miR-186 and releasing of lncRNA H19 by DDX43 facilitate tumorigenesis in CML (Lin et al., 2018). Similarly, DDX53 is expressed in testis and various types of tumors but has negligible expression in normal tissues (Kim et al., 2010; Kim et al., 2017). The hypomethylation of DDX53 protein’s promoter region results in the overexpression of the protein and it has been observed that DDX53 might play roles in cancer cell proliferation (Cho et al., 2002, Kim and Jeoung., 2008). Therefore, DDX43 and DDX53 can be potential targets for anti-tumor immunotherapy. While our current biochemical evidences suggest the KH domain is partially responsible for substrates specificity and affinity, thus the KH domain alone can be a candidate epitope for the development of peptide vaccines against tumor immunotherapy. Although it remains unknown where the TTGT/UUGU motifs are present in the cells, the DDX43 KH domain’s preferred sequence could potentially be used as an antagonist to inhibit DDX43’s oncogenic function.

6. CONCLUSIONS AND FUTURE WORK

6.1 Conclusions

In conclusion, we have found that the KH domains of DDX43 and DDX53 bind ssDNA and ssRNA but not blunt-end dsDNA and RNA substrates. EMSA results suggest the DDX43-KH domain prefers rU30 > dT30 >dC30, whereas it shows negligible binding to adenine (dA30) nucleotide. Also, the DDX43-KH and DDX53-KH domains do not have any unwinding activity of their own. We were unable to crystallize the KH domains of DDX43 and DDX53 helicases. Using NMR spectroscopy, we found that the conserved GXXG-region in the DDX43-KH domain interacts with single stranded oligonucleotides. The binding of DDX43- KH domain increases with the increasing length of the oligonucleotide, suggesting longer nucleic acids are preferred. NMR data shows that the DDX43-KH protein has a binding preference of dT5 > rU5 > dC5 and negligible interaction with dA5 or dG5.

In addition to the conserved GXXG loop, we found that the alanine (A81) adjacent to the GRGG loop is crucial for nucleic acid binding of DDX43 KH domain. Mutants A81S and

70

A81G have reduced binding ability to oligonucleotides and protein stability, suggesting A81 is important for the stability and function of the DDX43-KH domain.

Our results from SELEX and ChIP/CLIP-seq suggest that DDX43 KH domain prefers TC-repeats and TG-repeats. HSQC NMR spectroscopy suggests the KH domain has the highest affinity for GTTGT and TG-repeats (Kd= 0.0435±0.0065 mM and 0.0776±0.011 mM, respectively), and does not prefer purine rich sequences, such as AGAGA (Kd >> 1 mM).

Together, these results suggest that the efficiency of single KH domain is very low towards the ssDNA/RNA substrate specificity, unless there are multiple KH domains present in a given protein. The KH domains in DDX43 and DDX53 might coordinate with their neighbor helicase-domains to gain full binding specificity and unwinding activity.

6.2 Future work 1) To generate A81G, A81S and G87D mutations in the full-length DDX43 and determine their protein stability and enzymatic activities. 2) To analyze the ChIP/CLIP-seq data of DDX53-KH-73 protein and identify its specific sequences. 3) To validate the sequences identified by ChIP/CLIP-seq for DDX53-KH-73 using NMR spectroscopy and other methods. 4) Our NMR spectroscopy peak assignments of DDX43-KH-89aa protein revealed that 5-6 amino acids in the starting N-terminal region remain unassigned with their designated peaks (Figure 20), suggesting this region is unstable. It is possible that DDX43-KH protein without those 5-6 N-terminal residues might form crystals.

7. REFERENCES Abdelhaleem, M. (2004). Do human RNA helicases have a role in cancer? Biochim. Biophys. Acta. 1704(1), 37-46.

Abdel-Fatah, T.M., McArdle, S.E., Johnson, C., Moseley, P.M., Ball, G.R., Pockley, A.G., Ellis, I.O., Rees, R.C. and Chan, S.Y. (2014). HAGE (DDX43) is a biomarker for poor prognosis and a predictor of chemotherapy response in breast cancer. Br. J. Cancer 110, 2450-2461.

71

Abdel-Fatah, T.M., McArdle, S.E., Agarwal, D., Moseley, P.M., Green, A.R., Ball, G.R., Pockley, A.G., Ellis, I.O., Rees, R.C. and Chan, S.Y. (2016). HAGE in triple-negative breast cancer is a novel prognostic, predictive, and actionable biomarker: A transcriptomic and protein expression analysis. Clin. Cancer Res. 22, 905-914.

Adams, S.P., Sahota, S.S., Mijovic, A., Czepulkowski, B., Padua, R.A., Mufti, G.J. and Guinn, B.A. (2002). Frequent expression of HAGE in presentation chronic myeloid leukaemias. Leukemia, 16, 2238-2242.

Amarasinghe, A.K., MacDiarmid, R., Adams, M.D. and Rio, D.C. (2001). An in vitro-selected RNA-binding site for the KH domain protein PSI acts as a splicing inhibitor element. RNA, 7, 1239-1253.

Ambrosini, G., Khanin, R., Carvajal, R. and Schwartz, G. (2014). Overexpression of DDX43 mediates MEK inhibitor resistance through RAS upregulation in uveal melanoma cells. Mol. Cancer Ther. 13(8), pp.2073-2080.

Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Backe, P.H., Messias, A.C., Ravelli, R.B., Sattler, M., and Cusack, S. (2005). X-ray crystallographic and NMR studies of the third KH domain of hnRNP K in complex with single- stranded nucleic acids. Structure, 13(7), 1055-1067.

Bailey, T., Boden, M., Buske, F., Frith, M., Grant, C., Clementi, L., Ren, J., Li, W. and Noble, W. (2009). MEME SUITE: tools for motif discovery and searching. Nucl. Acids Res. 37(Web Server), pp.W202-W208.

Bandiera, A., Tell, G., Marsich, E., Scaloni, A., Pocsfalvi, G., Akintunde Akindahunsi, A., Cesaratto, L. and Manzini, G. (2003). Cytosine-block telomeric type DNA-binding activity of hnRNP proteins from human cell lines. Arch. Biochem. Biophys. 409, 305-314.

Benjamin, L.R., Chung, H.J., Sanford, S., Kouzine, F., Liu, J. and Levens, D. (2008). Hierarchical mechanisms build the DNA-binding specificity of FUSE binding protein. Proc. Natl. Acad. Sci. USA, 105, 18296-18301.

Barnes, M., van Rensburg, G., Li, W., Mehmood, K., Mackedenski, S., Chan, C., King, D., Miller, A. and Lee, C. (2014). Molecular Insights into the Coding Region Determinant-binding Protein-RNA Interaction through Site-directed Mutagenesis in the Heterogeneous Nuclear Ribonucleoprotein-K-homology Domains. J. Biol. Chem. 290(1), pp.625-639.

Bernstein, K.A., Gangloff, S., and Rothstein, R. (2010). The RecQ DNA helicases in DNA repair. Annu. Rev. Genet. 44, 393-417.

Beuth, B., Pennell, S., Arnvig, K.B., Martin, S.R., and Taylor, I.A. (2005). Structure of a Mycobacterium tuberculosis NusA-RNA complex. EMBO J. 24(20), 3576-3587.

72

Braddock, D., Baber, J., Levens, D. and Clore, G. (2002). Molecular basis of sequence-specific single-stranded DNA recognition by KH domains: solution structure of a complex between hnRNP K KH3 and single-stranded DNA. EMBO J. 21(13), pp.3476-3485.

Braddock, D.T., Louis, J.M., Baber, J.L., Levens, D., and Clore, G.M. (2002). Structure and dynamics of KH domains from FBP bound to single-stranded DNA. Nature. 415(6875), 1051- 1056.

Brosh, R.M., Jr., and Bohr, V.A. (2007). Human premature aging, DNA repair and RecQ helicases. Nucl. Acids Res. 35(22), 7527-7544.

Byrd, A.K., and Raney, K.D. (2012). Superfamily 2 helicases. Front. Biosci. (Landmark. Ed). 17:2070-2088.

Chénard, C. and Richard, S. (2008). New implications for the QUAKING RNA binding protein in human disease. J. Neurosci. Res. 86(2), pp.233-242.

Chen, T., Damaj, B.B., Herrera, C., Lasko, P., and Richard, S. (1997). Self-association of the single-KH-domain family members Sam68, GRP33, GLD-1, and Qk1: role of the KH domain. Mol. Cell Biol. 17(10), 5707-5718.

Chen, Q., Lin, J., Qian, J., Yao, D.M., Qian, W., Li, Y., Chai, H.Y., Yang, J., Wang, C.Z., Zhang, M. et al. (2011). Gene expression of helicase antigen in patients with acute and chronic myeloid leukemia. Zhongguo Shi Yan. Xue. Ye. Xue. Za Zhi, 19, 1171-1175.

Cho, B., Lim, Y., Lee, D., Park, S., Lee, H., Kim, W., Yang, H., Bang, Y. and Jeoung, D. (2002). Identification and characterization of a novel cancer/testis antigen gene CAGE. Biochem. Biophys. Res. 292(3), pp.715-726.

Cordin, O., Banroques, J., Tanner, N.K., and Linder, P. (2006). The DEAD-box protein family of RNA helicases. Gene. 367:17-37.

Dickey, T.H., Altschuler, S.E., and Wuttke, D.S. (2013). Single-stranded DNA-binding proteins: multiple domains for multiple functions. Structure. 21(7), 1074-1084.

Dillingham, M.S. (2011). Superfamily I helicases as modular components of DNA-processing machines. Biochem Soc. Trans. 39(2), 413-423.

Dmitriev, O., Tsivkovskii, R., Abildgaard, F., Morgan, C. T., Markley, J. L., and Lutsenko, S. (2006). Solution structure of the N-domain of Wilson disease protein: distinct nucleotide- binding environment and effects of disease mutations. Proc. Natl. Acad. Sci. USA. 103, 5302- 5307

Dodson, R. and Shapiro, D. (1997). Vigilin, a ubiquitous protein with 14 K homology domains, is the estrogen-inducible vitellogenin mRNA 3′-untranslated region-binding protein. J. Biol. Chem. 272(19), pp.12249-12252.

73

Dolgova, N. V., Nokhrin, S., Yu, C. H., George, G. N., and Dmitriev, O. Y. (2013). chaperone Atox1 interacts with the metal-binding domain of Wilson's disease protein in cisplatin detoxification. Biochem. J. 454, 147-156.

Du, Z., Lee, J.K., Tjhen, R., Li, S., Pan, H., Stroud, R.M., and James, T.L. (2005). Crystal structure of the first KH domain of human poly(C)-binding protein-2 in complex with a C-rich strand of human telomeric DNA at 1.7 A. J. Biol. Chem. 280(46), 38823-38830.

Fairman-Williams, M.E., Guenther, U.P., and Jankowsky, E. (2010). SF1 and SF2 helicases: family matters. Curr. Opin. Struct. Biol. 20(3), 313-324.

Fenn, S., Du, Z., Lee, J., Tjhen, R., Stroud, R. and James, T. (2007). Crystal structure of the third KH domain of human poly(C)-binding protein-2 in complex with a C-rich strand of human telomeric DNA at 1.6 A resolution. Nucl. Acids Res. 35(8), pp.2651-2660.

Fuller-Pace, F.V. (2013). DEAD box RNA helicase functions in cancer. RNA. Biol. 10(1), 121-132.

Garcia-Mayoral, M.F., Hollingworth, D., Masino, L., Diaz-Moreno, I., Kelly, G., Gherzi, R., Chou, C.F., Chen, C.Y. and Ramos, A. (2007).The structure of the C-terminal KH domains of KSRP reveals a noncanonical motif important for mRNA degradation. Structure, 15, 485-498.

Golden, C., Breen, M., Koro, L., Sonar, S., Niblo, K., Browne, A., Burlant, N., Di Marino, D., De Rubeis, S., Baxter, M., Buxbaum, J. and Harony-Nicolas, H. (2019). Deletion of the KH1 Domain ofFmr1Leads to Transcriptional Alterations and Attentional Deficits in Rats. Cereb. Cortex, 29(5), pp.2228-2244.

Guo, M., Vidhyasagar, V., Talwar, T., Kariem, A. and Wu, Y. (2016). Mutational analysis of FANCJ helicase. Methods, 108, pp.118-129.

Grishin, N.V. (2001). KH domain: one motif, two folds. Nucl. Acids Res. 29(3), 638-643.

Jackson, R.N., Lavin, M., Carter, J., and Wiedenheft, B. (2014). Fitting CRISPR-associated Cas3 into the helicase family tree. Curr. Opin. Struct. Biol. 24, 106-114.

Kelley, L.A., Mezulis, S., Yates, C.M., Wass, M.N. and Sternberg, M.J. (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845-858.

Hollingworth, D., Candel, A., Nicastro, G., Martin, S., Briata, P., Gherzi, R. and Ramos, A. (2012). KH domains with impaired nucleic acid binding as a tool for functional analysis. Nucl. Acids Res. 40(14), pp.6873-6886.

Jarmoskaite, I., and Russell, R. (2014). RNA helicase proteins as chaperones and remodelers. Annu. Rev. Biochem. 83, 697-725.

Kim, H., Kim, Y., and Jeoung, D. (2017). DDX53 Promotes Cancer stem cell-like properties and autophagy. Mol. Cells. 40(1), 54-65. 74

Kim, Y., and Jeoung, D. (2008). Role of CAGE, a novel cancer/testis antigen, in various cellular processes, including tumorigenesis, cytolytic T lymphocyte induction, and cell motility. J. Microbiol. Biotechnol. 18(3), 600-610.

Kim, Y., Park, H., Park, D., Lee, Y.S., Choe, J., Hahn, J.H., Lee, H., Kim, Y.M., and Jeoung, D. (2010). Cancer/testis antigen CAGE exerts negative regulation on p53 expression through HDAC2 and confers resistance to anti-cancer drugs. J. Biol. Chem. 285(34), 25957-25968.

Lacroix, L., Lienard, H., Labourier, E., Djavaheri-Mergny, M., Lacoste, J., Leffers, H., Tazi, J., Helene, C. and Mergny, J.L. (2000). Identification of two human nuclear proteins that recognise the cytosine-rich strand of human telomeres in vitro. Nucl. Acids Res. 28, 1564- 1575.

Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4), pp.357-359.

Leitao, A.L., Costa, M.C., and Enguita, F.J. (2015). Unzippers, resolvers and sensors: a structural and functional biochemistry tale of RNA helicases. Int. J. Mol. Sci. 16(2), 2269- 2293.

Lewis, H., Musunuru, K., Jensen, K., Edo, C., Chen, H., Darnell, R. and Burley, S. (2000). Sequence-Specific RNA Binding by a Nova KH Domain. Cell, 100(3), pp.323-332.

Lin, J., Chen, Q., Yang, J., Qian, J., Deng, Z., Qian, W., Chen, X., Ma, J., Xiong, D., Ma, Y., An, C. and Tang, C., 2014. DDX43 promoter is frequently hypomethylated and may predict a favorable outcome in acute myeloid leukemia. Leukemia Res. 38(5), pp.601-607.

Lin, J., Ma, J.C., Yang, J., Yin, J.Y., Chen, X.X., Guo, H., Wen, X.M., Zhang, T.J., Qian, W., Qian, J. et al. (2018). Arresting of miR-186 and releasing of H19 by DDX43 facilitate tumorigenesis and CML progression. Oncogene, 37, 2432-2443.

Linder, P. (2006). Dead-box proteins: a family affair--active and passive players in RNP- remodeling. Nucl. Acids Res. 34(15), 4168-4180. Linder, P., and Jankowsky, E. (2011). From unwinding to clamping - the DEAD box RNA helicase family. Nat. Rev. Mol. Cell Biol. 12(8), 505-516.

Linley, A., Mathieu, M., Miles, A., Rees, R., McArdle, S. and Regad, T. (2012). The Helicase HAGE Expressed by Malignant Melanoma-Initiating Cells Is Required for Tumor Cell Proliferationin Vivo. J. Biol. Chem. 287(17), pp.13633-13643.

Lohman, T.M., and Bjornson, K.P. (1996). Mechanisms of helicase-catalyzed DNA unwinding. Annu. Rev. Biochem. 65, 169-214.

Liu, Z. (2001). Structural Basis for Recognition of the Intron Branch Site RNA by Splicing Factor 1. Science, 294(5544), pp.1098-1102.

75

Martelange, V., De, S.C., De, P.E., Lurquin, C., and Boon, T. (2000). Identification on a human sarcoma of two new with tumor-specific expression. Cancer Res. 60(14), 3848- 3855.

Mathieu, M.G., Linley, A.J., Reeder, S.P., Badoual, C., Tartour, E., Rees, R.C., and McArdle, S.E. (2010). HAGE, a cancer/testis antigen expressed at the protein level in a variety of cancers. Cancer Immunol. 10, 2.

Ma, N., Xu, H.E., Luo, Z., Zhou, J., Zhou, Y. and Liu, M. (2017). Expression and significance of DDX43 in lung adenocarcinoma. Pak. J. Pharm. Sci. 30, 1491-1496.

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), p.10.

Nakel, K., Hartung, S.A., Bonneau, F., Eckmann, C.R. and Conti, E. (2010). Four KH domains of the C.elegans Bicaudal-C ortholog GLD-3 form a globular structural platform. RNA, 16, 2058-2067.

Nazarov, I., Bakhmet, E. and Tomilin, A. (2019). KH-Domain Poly(C)-Binding Proteins as Versatile Regulators of Multiple Biological Processes. Biochemistry (Moscow), 84(3), pp.205- 219.

Nicastro, G., García-Mayoral, M., Hollingworth, D., Kelly, G., Martin, S., Briata, P., Gherzi, R. and Ramos, A. (2012). Noncanonical G recognition mediates KSRP regulation of let-7 biogenesis. Nat. Struct. Mol. Biol. 19(12), pp.1282-1286.

Nicastro, G., Taylor, I.and Ramos, A. (2015). KH–RNA interactions: back in the groove. Curr. Opin. Struct. Biol. 30, pp.63-70.

Oddone, A., Lorentzen, E., Basquin, J., Gasch, A., Rybin, V., Conti, E. and Sattler, M. (2007). Structural and biochemical characterization of the yeast exosome component Rrp40. EMBO Rep. 8, 63-69.

Parsyan, A., Svitkin, Y., Shahbazian, D., Gkogkas, C., Lasko, P., Merrick, W.C., and Sonenberg, N. (2011). mRNA helicases: the tacticians of translational control. Nat. Rev. Mol. Cell Biol. 12(4), 235-245.

Patel, S.S., and Picha, K.M. (2000). Structure and function of hexameric helicases. Annu. Rev. Biochem. 69, 651-697.

Pause, A., Methot, N., Svitkin, Y., Merrick, W.C., and Sonenberg, N. (1994). Dominant negative mutants of mammalian translation initiation factor eIF-4A define a critical role for eIF-4F in cap-dependent and cap-independent initiation of translation. EMBO J. 13(5), 1205- 1215.

Pause, A., and Sonenberg, N. (1992). Mutational analysis of a DEAD box RNA helicase: the mammalian translation initiation factor eIF-4A. EMBO J. 11(7), 2643-2654.

76

Pyle, A.M. (2008). Translocation and unwinding mechanisms of RNA and DNA helicases. Annu. Rev. Biophys. 37, 317-336.

Quinlan, A. and Hall, I. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), pp.841-842.

Roman-Gomez, J., Jimenez-Velasco, A., Agirre, X., Castillejo, J.A., Navarro, G., San Jose- Eneriz, E., Garate, L., Cordeu, L., Cervantes, F., Prosper, F. et al. (2007). Epigenetic regulation of human cancer/testis antigen gene, HAGE, in chronic myeloid leukemia. Haematologica, 92, 153-162.

Schutz, P., Karlberg, T., van den Berg, S., Collins, R., Lehtio, L., Hogbom, M., Holmberg- Schiavone, L., Tempel, W., Park, H.W., Hammarstrom, M., Moche, M., Thorsell, A.G., and Schuler, H. (2010). Comparative structural analysis of human DEAD-box RNA helicases. PLoS. One. 5(9), e12791.

Shen, Y. and Bax, A. (2010) SPARTA+: a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. J. Biomol. NMR, 48, 13-22.

Sidiqi, M., Wilce, J.A., Vivian, J.P., Porter, C.J., Barker, A., Leedman, P.J., and Wilce, M.C. (2005). Structure and RNA binding of the third KH domain of poly(C)-binding protein 1. Nucl. Acids Res. 33(4), 1213-1221.

Singleton, M.R., Dillingham, M.S., and Wigley, D.B. (2007). Structure and mechanism of helicases and nucleic acid translocases. Annu. Rev. Biochem. 76, 23-50.

Siomi, H., Matunis, M.J., Michael, W.M., and Dreyfuss, G. (1993). The pre-mRNA binding K protein contains a novel evolutionarily conserved motif. Nucl. Acids Res. 21(5), 1193-1198.

Siomi, H., Choi, M., Siomi, M., Nussbaum, R. and Dreyfuss, G. (1994). Essential role for KH domains in RNA binding: Impaired RNA binding by a mutation in the KH domain of FMR1 that causes fragile X syndrome. Cell, 77(1), pp.33-39.

Tanner, N.K., Cordin, O., Banroques, J., Doere, M., and Linder, P. (2003). The Q motif: a newly identified motif in DEAD box helicases may regulate ATP binding and hydrolysis. Mol. Cell, 11(1), 127-138.

Talwar, T., Vidhyasagar, V., Qing, J., Guo, M., Kariem, A., Lu, Y., Singh, R., Lukong, K. and Wu, Y. (2017). The DEAD-box protein DDX43 (HAGE) is a dual RNA-DNA helicase and has a K-homology domain required for full nucleic acid unwinding activity. J. Biol. Chem. 292(25), pp.10429-10443.

Terranova, C., Tang, M., Orouji, E., Maitituoheti, M., Raman, A., Amin, S., Liu, Z. and Rai, K. (2018). An integrated platform for genome-wide mapping of chromatin states using high- throughput ChIP-sequencing in tumor tissues. J. Vis. Exp. (134).

77

Thisted, T., Lyakhov, D. and Liebhaber, S. (2001). Optimized RNA targets of two closely related triple KH domain proteins, heterogeneous nuclear ribonucleoprotein K and αCP-2KL, suggest distinct modes of RNA recognition. J. Biol. Chem. 276(20), pp.17484-17496.

Tomonaga, T. and Levens, D. (1996). Activating transcription from single stranded DNA. Proc. Natl. Acad. Sci. USA, 93, 5830-5835.

Tuerk, C. and Gold, L. (1990). Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science, 249(4968), pp.505-510.

Tuteja, N., Tarique, M., Banu, M.S., Ahmad, M., and Tuteja, R. (2014). Pisum sativum p68 DEAD-box protein is ATP-dependent RNA helicase and unique bipolar DNA helicase. Plant Mol. Biol. 85, 639-651.

Umate, P., Tuteja, N., and Tuteja, R. (2011). Genome-wide comprehensive analysis of human helicases. Commun. Integr. Biol. 4(1), 118-137.

Valverde, R., Edwards, L., and Regan, L. (2008). Structure and function of KH domains. FEBS J. 275(11), 2712-2726.

Valverde, R., Pozdnyakova, I., Kajander, T., Venkatraman, J., and Regan, L. (2007). Fragile X mental retardation syndrome: structure of the KH1-KH2 domains of fragile X mental retardation protein. Structure. 15(9), 1090-1098.

Wen, X.M., Zhang, T.J., Ma, J.C., Zhou, J.D., Xu, Z.J., Zhu, X.W., Yuan, Q., Ji, R.B., Chen, Q., Deng, Z.Q. et al. (2019). Establishment and molecular characterization of decitabine- resistant K562 cells. J. Cell. Mol. Med. 23, 3317-3324.

Williamson, M.P. (2013) Using chemical shift perturbation to characterise ligand binding. Prog. Nucl. Mag. Res. Sp. 73, 1-16.

Wu, H., Zhai, L., Chen, P. and Xi, X. (2019). DDX43 prefers single strand substrate and its full binding activity requires physical connection of all domains. Biochem. Biophys. Res. Commun. 520(3), 594-599.

Yoga, Y.M., Traore, D.A., Sidiqi, M., Szeto, C., Pendini, N.R., Barker, A., Leedman, P.J., Wilce, J.A., and Wilce, M.C. (2012). Contribution of the first K-homology domain of poly(C)- binding protein 1 to its affinity and specificity for C-rich oligonucleotides. Nucl. Acids Res. 40(11), 5101-5114.

Yu, C.H., Yang, N., Bothe, J., Tonelli, M., Nokhrin, S., Dolgova, N.V., Braiterman, L., Lutsenko, S. and Dmitriev, O.Y. (2017). The metal chaperone Atox1 regulates the activity of the human copper transporter ATP7B by modulating domain dynamics. J. Biol. Chem. 292, 18169-18177.

78

Zhang, Y., Liu, T., Meyer, C., Eeckhoute, J., Johnson, D., Bernstein, B., Nussbaum, C., Myers, R., Brown, M., Li, W. and Liu, X. (2008). Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9(9), p.R137.

79

Case #00960521 - Permission to use figure from an article [ ref:_00D30oeGz._5000c1x4bLP:ref ]

[email protected]

on behalf of

[email protected]

Wed 12/25/2019 3:20 AM

To:

 Yadav, Manisha

Dear Manisha Yadav,

Thank you for contacting the Copyright Clearance Center Rightslink Customer Support. We provide permissions to reproduce copyrighted materials on behalf of copyright holders who list their titles with us. My name is Nathan and I will be happy to assist you today.

I hope this email finds you well. Upon review of your email requesting permission to reuse a figure from the article "Fitting CRISPR-associated Cas3 into the Helicase Family Tree", I checked our database and can confirm that this article is available under the Creative Commons CC-BY-NC- ND license and permits non-commercial use of the work as published, without adaptation or alteration provided the work is fully attributed.

This license lets others reuse the work for any purpose, including commercially; however, it cannot be shared with others in adapted form, and credit must be provided to you.

I wish you all the best moving forward!

If you have any further questions please don't hesitate to contact a Customer Account Specialist at 855-239-3415 Monday-Friday, 24 hours/day.

Kind regards Nathan

Nathan Tomlinson Customer Account Specialist Copyright Clearance Center 222 Rosewood Drive Danvers, MA 01923 www.copyright.com

80

Toll Free US +1.855.239.3415 International +1.978-646-2600 Facebook - Twitter - LinkedIn

ref:_00D30oeGz._5000c1x4bLP:ref

------Original Message ------From: Yadav, Manisha [[email protected]] Sent: 12/20/2019 2:37 PM To: [email protected] Subject: Permission to use figure from an article

Dear Sir/Madam,

My name is Manisha Yadav, I am a student in University of Saskatchewan (Canada) working on my MSc. thesis. I would like to ask for your permission to use “Figure 1a. A schematic of the core helicase domain.” from Jackson, R.N., Lavin, M., Carter, J., and Wiedenheft, B. (2014). Fitting CRISPR- associated Cas3 into the helicase family tree. Curr. Opin. Struct. Biol. 24, 106-114.

Thank you!

Manisha Yadav

MSc student

Department of Biochemistry, Microbiology and Immunology

2B32 Health Science Building University of Saskatchewan Saskatoon, SK Canada S7N 5E5

Telephone (639) 317-4634

This message (including attachments) is confidential, unless marked otherwise. It is intended for the addressee(s) only. If you are not an intended recipient, please delete it without further distribution and reply to the sender that you have received the message in error. ref:_00D30oeGz._5000c1x4bLP:ref

81