Proteome-Level Assessment of Origin, Prevalence and Function of Leucine- Aspartic Acid (LD) Motifs

bioRxiv preprint doi: https://doi.org/10.1101/278903; this version posted March 10, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Proteome-level assessment of origin, prevalence and function of Leucine- Aspartic Acid (LD) motifs Tanvir Alam1,2,#, Meshari Alazmi1,2,#, Rayan Naser1,3,#, Franceline Huser1,3,#, Afaque A. Momin1,3,#, Katarzyna W. Walkiewicz1,3,5, Christian G. Canlas4, Raphaël G. Huser2, Amal J. Ali3, Jasmeen Merzaban3, Vladimir B. Bajic1,2,*, Xin Gao1,2,*, Stefan T. Arold1,3,* King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 1. Computational Bioscience Research Center (CBRC); 2. Division of Computer, Electrical and Mathematical Sciences & Engineering (CEMSE); 3. Division of Biological and Environmental Sciences and Engineering (BESE); 4. Imaging and Characterization Core Lab 5. Current address: NanoTemper Technologies GmbH, Floessergasse 4, Munich, Germany * Correspondence should be sent to STA: [email protected] or XG: [email protected] or VBB: [email protected] # Contributed equally Short Title: Proteome-wide prediction and function of LD motifs KEYWORDS focal adhesion; protein-ligand interaction; nuclear export signal; machine-learning; evolution ABSTRACT: Short Linear Motifs (SLiMs) contribute to almost every cellular function by connecting appropriate protein partners. Accurate prediction of SLiMs is difficult due to their shortness and sequence degeneracy. Leucine-aspartic acid (LD) motifs are SLiMs that link paxillin family proteins to factors controlling (cancer) cell adhesion, motility and survival. The existence and importance of LD motifs beyond the paxillin family is poorly understood. To enable a proteome-wide assessment of these motifs, we developed an active-learning based framework that iteratively integrates computational predictions with experimental validation. Our analysis of the human proteome identified a dozen proteins that contain LD motifs, all being involved in cell adhesion and migration, and revealed a new type of inverse LD motif consensus. Our evolutionary analysis suggested that LD motif signalling originated in the common unicellular ancestor of opisthokonts and amoebozoa by co-opting nuclear export sequences. Inter-species comparison revealed a conserved LD signalling core, and reveals the emergence of species-specific adaptive connections, while maintaining a strong functional focus of the LD motif interactome. Collectively, our data elucidate the mechanisms underlying the origin and adaptation of an ancestral SLiM. 1 bioRxiv preprint doi: https://doi.org/10.1101/278903; this version posted March 10, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. INTRODUCTION To a large extent, cellular signal transduction networks rely on the recognition of short linear motifs (SLiMs) by their cognate ligand binding domains 1. These motifs are contained on a single contiguous amino acid stretch of typically less than 15 residues, and do not require to be embedded in a three- dimensional protein framework to be functional. The binding energy of many SLiMs is dominated by only a few residues, resulting in moderate to low binding affinities (with dissociation constants Kd’s of 1 – 150 µM) that are ideal for mediating transient signalling interactions 2. On the one hand, this characteristic facilitates emergence and diversification of SLiMs, and hence the restructuration and evolutionary adaptation of an organism’s interactome 3. On the other hand, the resulting sequence motif degeneration and binding promiscuity hamper our capacity to computationally identify SLiMs and their biologically relevant binding partners 4,5. LD motifs were first described in 1996 by Brown et al. as novel SLiMs that associate paxillin with the cell adhesion proteins vinculin and the focal adhesion kinase (FAK) 6,7. Paxillin and its family members Hic-5 (also called TGFB1I1 or ARA55) and leupaxin 8-10 contain in their N-terminal region four or five of these motifs, named after the first two amino acids of their sequence consensus LDXLLXXL 7 (Fig. 1A,B). The N-terminal LD motifs, together with other protein-protein interaction sites located in this region, orchestrate the dynamic assembly of different signalling complexes 9. The C-terminal region of paxillin family proteins, which contains four double zinc-finger lin-11, isl-1, mec-3 (LIM) domains, mediates recruitment to integrin clusters at sites of cellular adhesion. The LIM domains also mediate nuclear localisation and nuclear receptor interactions (reviewed in 10). Using these diverse protein interaction motifs, paxillin family proteins establish a communication between cell adhesions and the nucleus, functionally linking gene expression with cell attachment 11-13. Thus, paxillin family members play important roles in embryonic development, epithelial morphogenesis and the immune response 14. As mediators of cellular motility and survival, paxillin family proteins are also key factors governing associated pathological conditions, such as inflammation, cardiovascular disease, and the development, spread and metastasis of tumours 9,10,12,14-16. Moreover, paxillin LD motifs are targets of a subset of human papilloma viruses, which cause cervical cancers 17,18. More than a dozen proteins were shown to interact with paxillin family LD motifs, using LD motif binding domains (LDBDs) of at least six different domain architectures (reviewed in 10). Most of these proteins are important players in membrane-proximal intracellular structures that connect cell-surface receptors with signalling and/or the cytoskeleton, mostly in focal adhesions or similar structures (FAK, PYK2, vinculin, talin, GIT, parvin, Bcl-2) 6,16,19-24. Other proteins are involved in nuclear export of proteins (XPO1, also called CRM1) or the transport of mRNA (PABP1) for mRNA delivery to sites of cellular adhesion 11,25,26. In all cases where experimental 3D structures are available, LD motifs form amphipathic helices that use the hydrophobic side chains (LDXLLXXL) of the motif to dock onto an elongated hydrophobic patch on LDBDs, while the negative charge (LDXLLXXL) forms ionic interactions with basic charges next to the hydrophobic patch (Fig. 1C) 10,19,22,24. The biological importance of LD motifs, and the discovery of more than a dozen LD motif– interacting proteins (many of which have evolved different LDBD folds) 10, motivated efforts to discover LD motifs outside the paxillin family. However, to date, motifs of the LDXLLXXL consensus were only experimentally confirmed in two cellular proteins: the deleted in liver cancer 1 (DLC1) tumour suppressor gene 24,27, and the rotavirus ‘X’-associated non-structural protein (RoXaN) 25. In addition to 2 bioRxiv preprint doi: https://doi.org/10.1101/278903; this version posted March 10, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. these, gelsolin (an actin binding, severing, and capping protein mediating osteoclastic actin cytoskeletal organisation) was found to bind the LDBD of PYK2 through a similar motif (LDXALXXL) 28. Some authors further include motifs from the E6 associated protein (E6AP) and the E6 binding proteins (E6BP), two proteins that interact with E6 proteins of a subset of papillomaviruses. However, these motifs deviate from the LD motif consensus sequence (E6AP: LQELLGEE; E6BP: LEEFLGDYR) and/or structure (E6BP; discussed below) 17,18,29-31. The fastest and most comprehensive way to search for LD motifs on a proteome-wide level would be through bioinformatic approaches. However, current algorithms produce too many false positives to be useful. We, therefore, developed an algorithm specific for LD motifs, which faced two challenges. Firstly, identification of LD motifs requires the (normally unknown) 3D structural context of candidate sequences. Secondly, the number of bona fide LD motifs is too low for classical machine- learning approaches. We overcame these challenges through the development of a support vector machine (SVM)-based model to estimate the structural content of a candidate sequence and the use of experimental feedback. We show that the resulting LD motif finder (LDMF) tool enables a proteome- scale prediction of LD motifs without a high rate of false positives, while still being able to distinguish LD motifs from highly similar short helical interaction motifs. Combined with experimental validation we used LDMF to assess the prevalence and function of LD motifs in the human proteome, to investigate the motif’s origin in unicellular eukaryotes, and probe its use in bacterial and viral pathogens. Our integrative method can serve as a guide for the identification of other low-abundance SLiMs. RESULTS Features that characterise bona fide LD motifs in silico. To determine features that characterise LD motifs in silico, we analysed known LD motif–containing proteins using algorithms to predict protein disorder (MetaPrDos32), as well as secondary (PSIPRED33) and tertiary protein structures (SwissModel 34 and

Load more