Contiguously-Hydrophobic Sequences Are Functionally Significant

bioRxiv preprint doi: https://doi.org/10.1101/2021.09.02.458776; this version posted September 4, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Contiguously-hydrophobic 2 sequences are functionally 3 significant throughout the human 4 exome 5 Ruchi Lohia1§, Matthew E.B. Hansen2†, Grace Brannigan1,3†* *For correspondence: *[email protected] (GB) 6 1Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 7 2 †These authors contributed USA; Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA; equally to this work 8 3Department of Physics, Rutgers University, Camden, NJ, USA Present address: §Stanley 9 Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, USA 10 Abstract Hydrophobic interactions have long been established as essential to stabilizing 11 structured proteins as well as drivers of aggregation, but the impact of hydrophobicity on the 12 functional significance of sequence variants has rarely been considered in a genome-wide 13 context. Here we test the role of hydrophobicity on functional impact using a set of 70,000 14 disease and non-disease associated single nucleotide polymorphisms (SNPs), using enrichment 15 of disease-association as an indicator of functionality. We find that functional impact is 16 uncorrelated with hydrophobicity of the SNP itself, and only weakly correlated with the average 17 local hydrophobicity, but is strongly correlated with both the size and minimum hydrophobicity 18 of the contiguous hydrophobic domain that contains the SNP. Disease-association is found to 19 vary by more than 6-fold as a function of contiguous hydrophobicity parameters, suggesting 20 utility as a prior for identifying causal variation. We further find signatures of differential selective 21 constraint on domain hydrophobicity, and that SNPs splitting a long hydrophobic region or 22 joining two short regions of contiguous hydrophobicity are particularly likely to be 23 disease-associated. Trends are preserved for both aggregating and non-aggregating proteins, 24 indicating that the role of contiguous hydrophobicity extends well beyond aggregation risk. 25 26 Statement of Significance 27 Proteins rely on the hydrophobic effect to maintain structure and interactions with the environ- 28 ment. Surprisingly, no signs that amino acid hydrophobicity influences natural selection have been 29 detected using modern genetic data. This may be because analyses that treat each amino acid sep- 30 arately do not reveal significant results, which we confirm here. However, because the hydrophobic 31 effect becomes more powerful as more hydrophobic molecules are introduced, we tested whether 32 unbroken stretches of hydrophobic amino acids are under selection. Using genetic variant data 33 from across the human genome, we found evidence that selection pressure increases continually 34 with the length of the unbroken hydrophobic sequence. These results could lead to improvements 35 in a wide range of genomic tools as well as insights into disease and protein evolutionary history. 36 Introduction 37 An ambitious task for human genetics is discovering the genetic basis for heritable traits and dis- 38 ease risks. In principle, genome-wide association studies (GWAS) can identify variants associated 1 of 18 bioRxiv preprint doi: https://doi.org/10.1101/2021.09.02.458776; this version posted September 4, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 39 with an effect, independent of the genomic or protein context of the variant (Visscher et al., 2017). 40 In practice, associated variants are often not directly causal but instead passively “tag” haplotypes 41 containing the true causal variants that are either unknown or observed but not directly tested 42 for association (Cannon and Mohlke, 2018). The pattern of linkage disequilibrium that defines “tag- 43 ging” is population dependent. Any information on the protein-level function of the local sequence 44 surrounding an associated variant can help fine-map and rank putative causal variants from a list 45 of associations(Gallagher and Chen-Plotkin, 2018). 46 All protein level methods for variant-to-function prediction rely on some form of residue char- 47 acterization. In addition to physicochemical properties (Stone, 2005; Niroula et al., 2015; López- 48 Ferrando et al., 2017; Hecht et al., 2015; Popov et al., 2019) these may include evolutionary con- 49 servation (Ng and Henikoff, 2001; Thomas and Kejariwal, 2004; Stone, 2005; Capriotti et al., 2006; 50 Choi et al., 2012; Hecht et al., 2015; Niroula et al., 2015) and structural propensities (Capriotti 51 et al., 2004, 2005; Parthiban et al., 2006; Wainreb et al., 2010; Popov et al., 2019; Ittisoponpisan 52 et al., 2019; Iqbal et al., 2020). Such methods may also rely on known protein structures in order 53 to incorporate properties such as local secondary structure and solvent accessibility (Iqbal et al., 54 2020). In the absence of structural information, however, computational prediction accuracy is low, 55 and full three-dimensional structures have been experimentally solved for a tiny fraction of known 56 proteins (Rose et al., 2016; Mir et al., 2017). Fewer than 35% of human protein coding genes have 57 structures deposited in the protein data bank (Prlić et al., 2016). With few exceptions, therefore, 58 these methods have not been applied on a genome-wide scale. 59 Most functional proteins are intrinsically modular. For example, a given protein may include 60 multiple secondary structure elements, ordered and disordered regions, transmembrane and sol- 61 uble portions, or stretches of highly charged regions followed by stretches of highly hydrophobic 62 regions. In the absence of structural information, bioinformatics “variant-to-function” prediction 63 methods that use sequence context typically neglect modularity, by defining the local sequence 64 using a window of constant length, centered around the mutation (Teng et al., 2010; Capriotti 65 et al., 2006; Capriotti and Fariselli, 2017; Hecht et al., 2015). This definition of sequence context 66 can weaken predictive power, particularly for SNPs near the boundary between adjacent protein 67 modules with distinct functional roles. 68 Despite the common framing that “sequence determines structure which determines function”, 69 single nucleotide polymorphisms (SNPs) can alter function while leaving the protein structure es- 70 sentially unchanged. For example, intrinsically disordered proteins (IDPs) lack unique structure 71 yet are both essential for many critical biological pathways (Ward et al., 2004; Dyson and Wright, 72 2005; Uversky, 2013, 2019; Panchenko and Babu, 2015) and sensitive to sequence. (Weinreb et al., 73 1996; Uversky et al., 2012; Cuanalo-Contreras et al., 2013; Patel et al., 2015; Uversky, 2015; Tovo- 74 Rodrigues et al., 2016). 75 Although not all functional proteins have well-defined structures, we showed in a previous 76 study(Lohia et al., 2019) that a long intrinsically disordered protein can retain modularity of inter- 77 actions. Fully-atomistic molecular dynamics simulations of a 92-residue disordered protein, run us- 78 ing enhanced sampling, revealed a soft network of tertiary contacts between contiguous stretches 79 of hydrophobic residues. In order to distinguish these stretches from any other more traditional 80 segment or domain definition, we follow terminology more common to polymer physics and call 81 these stretches “blobs". Blobs may contain secondary structure elements, but are not required to 82 do so. 83 Here we demonstrate that hydrophobic blobs can be used as generic predictors of functional 84 modules, by testing for enrichment of functional effects throughout the proteome. Our hypothesis 85 was motivated by the cooperativity of the hydrophobic effect (Jiang et al., 2017) and the recognition 86 that hydrophobic residues are likely to be buried (Lins et al., 2003) and contain a high density of 87 interactions. Any curve drawn through such a cluster, including a curve representing the peptide 88 chain, will contain contiguous hydrophobic residues (Figure 1). 89 In the present study, we used a large database of protein sequences to determine the extent 2 of 18 bioRxiv preprint doi: https://doi.org/10.1101/2021.09.02.458776; this version posted September 4, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 90 to which hydrophobicity-based sequence segmentation captures functional modules. Our test in- 91 volves three steps: 1) segmentation of the protein into the blobs (“blobulation"), 2) characteriza- 92 tion of the blobs based on various physicochemical properties and 3) testing whether SNPs with 93 “known" functional impact are enriched or depleted genome-wide in blobs with particular prop- 94 erties. We test for enrichment of hydrophobic blobs among disease-associated SNPs, repeating 95 the calculation with a range of hydrophobicity thresholds and stratifying according to hydrophobic 96 blob length. This process reveals a striking dependence that is muted when a standard moving 97 window approach is used instead, and is completely lost when the sequence context is neglected. 98 We test for consistency against population frequency data,

Contiguously-Hydrophobic Sequences Are Functionally Significant

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support