Computational Studies of Charge in G Coupled Receptors

A thesis submitted to the University of Manchester for the degree of MPhil Bioinformatics in the Faculty of Life Sciences

2013

Spyros Charonis

Contents

Abstract …………………………………………………………………………….. 4

Declaration …………………………………………………………………………. 5

Copyright …………………………………………………………………………... 6

Acknowledgements ………………………………………………………………… 7

Abbreviations ………………………………………………………………………. 8

1 Introduction ………………………………………………………. ………. 9

1.1 Biology in the Silicon Era ………………………………………… 9

1.2 Structural Biology and Bioinformatics ………………………… 10

1.3 Coupled Receptors …………………………………... 11

1.3.1 GPCR Classification and Nomenclature ………………. 12

1.3.2 Structural modularity of GPCRs ……………………….. 17

1.3.3 GPCR Functional Mechanisms ………………………… 21

1.3.4 GPCRs as Drug Targets ………………………………… 24

1.4 Electrostatics …………………………………………………….. 25

1.4.1 pH and pKa ………………………………………………. 27

1.4.2 pH dependence of charge state for amino acids ………. 29

1.4.3 Electrostatics in Protein Interactions ………………….. 31

1.4.4 Modeling Electrostatics …………………………………. 33

1.4.4.1 Finite Difference Poisson Boltzmann …………... 34 1.4.4.2 Debye-Hückel Theory …………………………… 35

1.5 Bioinformatics Tools and Methodologies ………………………. 37 1.5.1 Sequence Analysis Methods ……………………………... 38 1.5.1.1 BLAST and PSI-BLAST ………………………... 40

1

1.5.2 Structure Prediction ……………………………………. 41 1.5.2.1 Modeling ……………………………. 42 1.5.3 GPCR Information Repositories ………………………. 45 1.6 Aims and Objectives ………………………………………………... 47

2 Methods …………………………………………………………………. 48

2.1 Sequence Analysis Methodologies ……………………………... 48

2.1.1 Detecting Low-Complexity Regions …………………… 48

2.1.2 PSI-BLAST ……………………………………………... 50

2.2 Structural Analysis Methodologies …………………………… 51 2.3 GPCR Dataset Generation ……………………………………... 52 2.4 PDB File Processing ……………………………………...... 53

2.5 pKa Calculations ………………………………………………... 55 2.6 Molecular Visualization ………………………………………... 57

3 Results …………………………………………………………………… 59

3.1 Empirically Defined GPCR Topology ………………………… 59

3.2 GPCR Sequence Dataset ………………………………………. 60

3.2.1 The distribution of ionizable groups ………………….. 62 3.2.2 Characterizing major peaks …………………………... 66

3.3 Shortlisting charged residues for pKa calculations ………….. 67

3.4 pKa predictions …………………………………………………. 70 3.4.1 Database and Literature Searches ……………………. 71

3.4.2 Predicted pKa Values and Charge States for Residues 73

4 Discussion ………………………………………………………………. 80

4.1 7TM Charge Distribution ……………………………... 80

4.2 Assessment of the Hypothesis ………………………..... 81

4.3 Conclusions …………………………………………….. 83

2

References …………………………………………………………………….. 84

Appendix A GPCR Dataset Sequence Names ……..……………………… 90

Appendix B Unix text filtering commands .……………………………… 101

Final Word Count: 19,454

Figures and Tables

Figure 1.1 Layout of a GPCR ………………………………………… 12 Figure 1.2 Hierarchy of the GRAFS Classification Scheme ………… 16 Figure 1.3 Generic Architecture of GPCRs …………………………... 18 Figure 1.4 GPCR Family Structural Coverage ………………………. 21 Figure 1.5 Overview of GPCR Functional Mechanism ……………… 23 Figure 1.6 Electric field created by two oppositely charged bodies …. 26 Figure 1.7 Deprotonation of carboxylic side-chain group ……………. 30 Figure 1.8 Protonation of amino side-chain group …………………… 30 Figure 1.9 Dielectric environment of protein-solvent systems ………. 35 Figure 1.10 Debye Length vs. Counterionic Concentration …………… 37 Figure 1.11 The steps involved in homology modeling ……………….. 43 Figure 2.1 Dotplots of template sequence and control sequence ……... 49 Figure 2.2 Cylinder of pseudo-atoms enclosing GPCR model ………... 56 Figure 2.3 Methods for Charge Studies on GPCRs …………………… 57 Figure 2.4 Structure of human β2 adrenergic ………………... 58 Figure 3.1 GPCR Dataset Composition ……………………………….. 61 Figure 3.2 Frequency of Ionizable Groups vs. Cartesian Location ……..62 Figure 3.3 Sample output of histogram analysis script ……………….. 65 Figure 3.4 Filtering residues for pKa calculations …………………….. 69 Figures 3.5 – 3.12 Residue Locations ……………………………. 74 – 79

Table 1.1 Comparing the GRAFS and A-F Classification Systems……. 17 Table 1.2 Solved GPCR Structures ……………………………………. 20 Table 1.3 pKa values for charged side-chains …………………………. 31 Table 1.4 pH-dependent variation of charge states ……………………. 31

Table 2.1 PSI-BLAST search parameters ……………………………... 50 Table 2.2 Charged atom indicators for ionizable side-chains …………. 54 Table 3.1 Delineation of GPCR Domains in Cartesian Coordinates …... 59 Table 3.2 GPCR Dataset Sequence Composition ……………………… 60 Table 3.3 Peak groups selected for charge frequency distribution …….. 64 Table 3.4 Matching Histogram Peaks to Conserved Residues ………… 67 Table 3.5 Minor Peak Shortlisted Residues ……………………………. 70 Table 3.6 pKa Predictions ……………………………………………… 71 Table 3.7 Mapping model residues onto wild-type GPCRs …………… 72 Table 3.8 pH-dependent variation of charge states ……………………. 73 Table 3.9 Predicted charge states of residues …………………………...73

3

Abstract

Electrostatic interactions play significant roles in the functioning of almost all , and thus have a significant impact on biological phenomena at the molecular level. At the epicenter of these interactions are amino acids that are electrically charged and thus participate in interactions involving proteins. The G protein- coupled receptors (GPCRs) comprise the largest known class of transmembrane receptors and are important in mediating an extremely diverse array of signal transduction pathways. The distribution of charged residues carrying ionizable groups was studied by creating a dataset of GPCRs using homology modeling. This dataset was used to study the distribtuion of charged residues, with particular emphasis on the transmembrane elements of GPCRs, in an attempt to correlate locations with significant amounts of charge to functional sites. Calculations of pKa were used to assess functionality of such residues; there was no conclusive evidence that these amino acids are functional.

Ionizable groups localized to the transmembrane region of GPCRs were divided into two groups – high frequency residues and low frequency residues. After verifying that high frequency residues represented well known charged residues conserved throughout GPCRs, the residues having low frequencies were selected for pKa calculations. The Finite Difference Poisson Boltzmann (FDPB) and Finite Difference Debye Huckel (FD/DH) methods were used calculate shifts in pKa values for a series of residues, so that their charge state could be predicted, and each residue location was queried in literature for functional annotations. Few of the selected residues had annotations, suggesting perhaps that the scale of the dataset should be increased.

4

Declaration

No portion of the work referred to in the thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

5

Copyright Statement

i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://www.campus.manchester.ac.uk/medialibrary/policies/intellectual- property.pdf), in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/

6

Acknowledgements

I would firstly like to thank my supervisor, Dr. Jim Warwicker, for his continuous and expert advice, support and guidance throughout my time working on this project. Many special thanks go to my parents for their moral and financial support throughout all my years of education.

7

Abbreviations

BLAST Basic Local Alignment Search Tool DH Debye Huckel ECL Extracellular Loop FD Finite Difference FD/DH Finite Difference/Debye Huckel FDPB Finite Difference Poisson Boltzmann GPCR G protein coupled receptor ICL Intracellular Loop MSA Multiple Sequence Alignment PDB Protein Data Bank pH potential Hydrogen pKa acid dissociation constant PSI-BLAST Position Specific Iterated Basic Local Alignment Search Tool TM Transmembrane

8

Chapter 1. Introduction

The past fifty years have hosted an unprecedented transformation in the field of biology from a purely experimental science to one in which computational, informatics-based methodologies and principles have become an indispensable asset. The main vehicles of this transformation have been the introduction of low-cost computing technology, as well as the development of the paradigm-shaping World Wide Web and Internet. Virtually all areas of biomedical research currently make use of computational methodologies such as sequence alignment, protein structure prediction, database querying and text processing in some capacity; the aforementioned often make use of web-based technologies. Therefore, the recent shift in biology from the traditional experiment-centric paradigm to a computation- driven one reflects the necessity to systematically disseminate and interpret an overwhelming amount of data that is now routinely generated and made publically available.

1.1 Biology in the Silicon Era

Bioinformatics is the scientific field devoted to the consistent management and extraction of useful information from biological data. Its inception came about from a necessity to efficiently organize, analyze, and interpret a hitherto unanticipated surge of biological data. Biology has, over the past decade, established itself as a data-intensive field; one of the consequences of this transformation is the need to adjust the methodologies used to uncover its operational principles accordingly. Indeed, a simple yet elegant example of how computational techniques can drive knowledge forward is found in the discovery of the APOA5 (Pennacchio et al., 2001). The study contributed the discovery of a gene encoding an important apolipoprotein involved in cholesterol and triglyceride homeostasis. In spite of this gene being completely unknown prior to this study, the discovery was made without employing any experimental methods; simply by aligning human and mouse genomic sequences and observing regions with unusually high similarity. As such, it serves to illustrate how purely computational techniques, e.g. sequence alignment, can provide the basis for revealing functional elements within a genome.

9

The ultimate purpose of biomedical science - of which bioinformatics is an established subfield - is to understand the inner workings of organisms with the goal of improving health and disease treatment. The scope of bioinformatics has not yet reached this potential as its use is not yet tangible by the medical sciences, which operate on a macroscopic scale (i.e. organs, humans, populations). Systems biology aims to address this gap by providing a genomic-level and proteomic-level understanding of diseases, employing concepts such as gene regulatory networks. Nonetheless, systems biology is a relatively new field, and efforts to characterize pathophysiological states in terms of quantifiable changes in genomic and proteomic effects are still in their infancy. The bioinformatics boom has had its greatest impact on basic research, and has been driven by the emergence of experimental techniques that generate data in a high-throughput fashion; DNA sequencing, microarray expression, mass spectrometry, and more recently next-generation sequencing. Effectively dealing with the complexity and size of datasets that emerge from such technologies is a central theme in bioinformatics, and hence as a field it operates almost exclusively on the nanoscopic to microscopic scale (i.e. atomic, subcellular, and cellular).

1.2 Structural Biology and Bioinformatics

The field of structural biology began in the 1950s with the discovery of the double helical structure of DNA (solved by Watson and Crick in 1952) and with the determination of the structures of hemoglobin and myoglobin (the first protein structures that were solved by Perutz and Kendrew in 1958). The gradual improvement of experimental techniques in tandem with the exponential increase in computing power over the past decades have helped to shape it into one of the most important fields of biomedical science. Structural biology has developed into one of the most actively researched areas of modern biology, and has in turn has spawned the field of structural bioinformatics; the branch of bioinformatics that deals with all aspects related to the analysis and prediction of the three-dimensional structure of biological macromolecules. The most functionally prominent biological molecules are proteins, and hence structural bioinformatics most often operates within the field of proteomics. The great promise of this field lies in the potential for high-resolution

10 structural information about biological systems to unravel the functional mechanisms of these systems. The establishment of correlations and patterns between the sequence, structure, and function of macromolecules lies at the of structural bioinformatics. Computational methods are essential in this context because structural data is largely complex, and this renders information processing techniques suitable for the development of efficient and systematic ways for transforming data into useful knowledge.

1.3 G Protein-coupled Receptors

One of the most fundamental functions of cells is receiving and transmitting chemical signals, and more generally interacting with their external environment. The interface where such interactions take place is known as the plasma membrane, which consists of a phospholipid bilayer that separates a cell’s interior from its exterior. The surface of the cell membrane is densely populated by protein molecules that function in all forms of signal transduction such as antibodies, receptors, and ion channels to name a few. G protein-coupled receptors (GPCRs) comprise the largest superfamily of plasma membrane-spanning (termed transmembrane) receptors (Lander et al., 2001; Venter et al., 2001), responsible for the transduction of a vast range of signals, and consist of circa 5% of in the (Zhang et al., 2006) encoding 800 to 1000 distinct human proteins (Marchese et al., 1999; Venter et al., 2001). On account of their main function as versatile molecular sensors, GPCRs control a broad spectrum of intracellular biochemical changes which in turn modulate numerous aspects of cellular physiology, growth, and development (Pierce et al., 2002). Figure 1.1 depicts a simplistic rendering of a GPCR protein within a bilayer membrane.

11

Figure 1.1. Layout of a GPCR spanning the plasma membrane

Image adapted from (www.scientificimages.co.uk)

1.3.1 GPCR Classification and Nomenclature The evolutionary origins of GPCRs have been traced to the first eukaryotes, after which they diverged in vertebrates into five major classes, summarized in the GRAFS classification system (Glutamate, , Adhesion, /Taste2 and Secretin), each with numerous subfamilies (Fredriksson et al., 2003). Rhodopsin is the largest of these families and currently the only one with structural representation in the PDB database. A brief overview of each family in the GRAFS scheme follows.

Glutamate family This receptor family is named after the metabotropic (i.e. membrane receptors that act through a second messenger) glutamate receptors, which mediate glutamate responses in a variety of CNS () functions. Glutamate, the anionic derivative of , is the principal excitatory in the CNS (Meldrum, 2000) and is hence a key component in neural pathways. The family also includes metabotropic GABA receptors, which respond to the chief inhibitory neurotransmitter (gamma-aminobutyric acid, or GABA) in the CNS. - sensing receptors and a number of orphan GPCRs are also included in the glutamate family, which contains 30 members.

12

Rhodopsin family This family is made up of the largest number of receptors with circa 280 non- olfactory receptors and 350 olfactory receptors (Zozulya et al., 2001) amounting to around 630 members. Olfactory receptors cover those GPCRs that are present on the cell membranes of responsible for the detection of odor molecules (olfactory neurons). This is the only GPCR family at present for which structures have been solved. Four main groups of the rhodopsin family can be recognized phylogenetically, and these are designated as α, β, γ, and δ (Fredriksson et al., 2003). Rhodopsin α-Group This is the largest of the four groups of the rhodopsin family and consists of five main branches – prostaglandin, amine, , melatonin, and MECA (Melanocortin, Endothelial differentiation sphingolipid, Cannabinoid and Adenosine) receptors, comprising 125 members (Fredriksson and Schiöth, 2006). α-Group prostaglandin branch This cluster contains the prostaglandin and closely related thromboxane receptors together with several orphan receptors (Fredriksson and Schiöth, 2006). α-Group amine branch The amine receptor cluster contains the (5HT), , muscarinic, and adrenergic families. The term amine refers to the fact that all known ligands for these receptors are structurally related amine molecules consisting of a single aromatic ring. It is this branch of the α- for which most GPCR structures have been solved (two adrenergic receptors, one dopamine and histamine) α-Group opsin branch This cluster comprises the rhodopsin receptor, the for short, medium and long wavelength photons, peropsin, encephalopsin, and . Three PDB structures from this branch are currently available (bovine and squid rhodopsin, bovine opsin).

13

α-Group melatonin branch There are two receptors in this branch which bind melatonin, a hormone involved in regulating the biological clock of organisms. α-Group MECA branch As the abbreviation suggests, this branch includes four types of related receptors whose endogenous ligands are interestingly structurally diverse (Fredriksson and Schiöth, 2006). Specifically, melanocortin receptors bind (melanocortin), endothelial differentiation sphingolipid receptors bind , cannabinoid receptors bind anandamide, while adenosine receptors bind purines (adenosine). A single receptor from this branch has been structurally solved (human A2A ). Rhodopsin β-Group This group consists of several receptors, all of which bind peptides, including , FF, , tachykinin, and families amounting to slightly above 50 members. Rhodopsin γ-Group This group comprises three main clusters, namely SOG (somatostatin, opioid, and receptors), MCH (melanocyte concentrating hormone receptors), and chemokine receptors. The chemokine cluster is by far the largest one of these, containing C-C receptors (β-chemokine) and C-X-C receptors (α-chemokine). Together, the three clusters comprise a total of 72 characterized members (Fredriksson and Schiöth, 2006). A human α- crystal structure (chemokine type 4) has recently been solved. Rhodopsin δ-Group The final group of the rhodopsin family is the MAS oncogene receptor cluster, which consists of 63 known receptors that bind small peptides.

Adhesion family The adhesion family is the second largest GPCR family in humans, consisting of 67 members (Fredriksson and Schiöth, 2006). Members of the Adhesion family have structural features that separate them from the other GRAFS families, the most notable of which are their long N-termini enriched with serine and threonine residues that create glycosylation sites. These termini are believed to bind various proteins that promote cell-to-cell and cell-to-matrix interactions, hence the term “adhesion”. 14

Frizzled/Taste 2 family This family includes two distinct clusters (frizzled and taste 2 receptors), which despite extensive sequence dissimilarity share several consensus regions and are grouped together (Fredriksson et al., 2003). Perhaps their most salient structural feature is a cysteine-rich binding domain, which binds several ligands and notably proteins that activate the Wnt signaling pathway; a pathway important in embryonic development and cell-cell communication (Logan and Nusse, 2004; Lie et al., 2005). Genes encoding Frizzled receptors are hence essential for development, tissue and cell polarity, formation of neural , and regulation of proliferation. Mutations in this family have been linked to familial exudative vitreoretinopathy (Huang and Klein, 2004), a genetic disorder affecting the angiogenesis in the retina that commonly leads to visual impairment. The taste receptors type 2 (T2R) are GPCRs lacking introns with very short extracellular N-termini. They are known for recognizing bitter substances, and the majority of T2Rs remain uncharacterized orphan receptors. Together, the Frizzled/Taste2 family (often abbreviated to Frizzled) consists of 62 receptors, making it the third largest GPCR family.

Secretin family The is the smallest and consists of 15 receptors that bind large hormones involved in paracrine signaling. The transmembrane region of these receptors share putative sequence similarity with the Adhesion family (and under a previous classification scheme were in the same cluster), although the most conserved regions are cysteine- enriched; the N-terminal regions are unique in this family. A truncated overview of the hierarchical structure of the GRAFS scheme is illustrated in figure 1.2, showing that throughout the entire GPCR superfamily, structures have been solved only for the α group of the rhodopsin family. Phylogenetic branches with experimentally solved structures are denoted in bold and italicized print.

15

Figure 1.2. Hierarchy of the GRAFS Classification Scheme G R A F S

Glutamate Rhodopsin Adhesion Frizzled Secretin

MG-R, GABA -R, α β γ δ Calcium-R

Prostaglandin, AMINE*, OPSIN, Neuropeptide D/FF, Orexin, SOG, MCH, MAS Melatonin, MECA Tachykinin, Cholecystokinin CHEMOKINE oncogene

*bold capitals indicate cluster for which experimental structure(s) exist (see table 1.2).

Despite the current prominence of the GRAFS classification system, an earlier albeit still employed scheme is the A-F nomenclature (classes A, B, C, D, E and F), using roman numerals to denote subclasses (Attwood and Findlay, 1994; Kolakowski, 1994). The ordering of the GPCR superfamily is summarized as follows in this system: A-F GPCR Classification Scheme Class A: Rhodopsin-like family Class B: Secretin-like family Class C: Metabotropic glutamate family Class D: Fungal family Class E: cyclic AMP receptor family Class F: Frizzled family

Class A, corresponding to the Rhodopsin class in the GRAFS system, is the largest and most well characterized. Certain classes of the A-F system do not exist in humans; most notably classes D and E are not found in humans, as are as certain subfamilies of class A (such as the invertebrate opsin receptors) (Fredriksson et al., 2003). Table 1.1 summarizes the correspondence between the two classification systems.

16

Table 1.1 Comparing the GRAFS and A-F Classification Systems A/B/C/D/E/F GRAFS Class A (Rhodpsin-like family) R (Rhodopsin family) Class B (Secretin-like family) S (Secretin family) Class C (Metabotropic Glutamate family) G (Glutamate family) Class D (Fungal Pheromone family) – Class E (cyclic AMP family) – Class F (Frizzled family) F (Frizzled/Taste 2* family) Subfamily BII A (Adhesion family) *usually abbreviated as Frizzled

As is shown in table 1.1, the adhesion family was considered as belonging to the secretin-like family under the A-F system, although they two have been distinguished as separate phylogenetic clusters in the GRAFS classification. Furthermore, fungal pheromone and cyclic AMP families have not been clustered into the GRAFS scheme, and hence are classified as other 7TM receptors. The GRAFS system was preferred as a reference for the work carried out in this study, as it is more recent and more commonly used in literature.

1.3.2 Structural modularity of GPCRs GPCRs comprise a scaffold of seven transmembrane α-helices (commonly referred to as the 7TM scaffold), connected by three extracellular loops (ECL1 – 3) and three intracellular loops (ICL1 – 3). An external N-terminus is found flanking the extracellular loops, while an interfacial helix (commonly referred to as helix VIII) as well as a C-terminus is present in the intracellular region adjacent to ICL1 – 3. The general architecture of GPCRs and their modules is illustrated below in figure 1.3.

17

Figure 1.3. Generic Architecture of GPCRs

Generic architecture of GPCRs, separated into ligand-binding module and downstream signaling module. The 7TM scaffold is divided into TM-EC (facing extracellular matrix) and TM-IC (facing cytosol). Major structural features are displayed on human dopamine type 3 (PDB ID 3PBL) – blue patches pinpoint conserved motifs while orange patches indicate kink- inducing proline residues. Figure adapted from (Katritch et al., 2011).

The 7TM scaffold has been long recognized as the most highly conserved structural domain of GPCRs. Given the immense diversity often found in all other structural modules, it serves as the defining signature of GPCRs. The transmembrane domain possesses characteristic hydrophobic patterns as well as conserved functional motifs, with each membrane-spanning helix having an average length of 20-25 residues. This region is represented in 1.3 as having two subdomains (the extracelluar-/intracellular- facing regions, TM-EC and TM-IC respectively) with varying levels of sequence and structural diversity. The overlap between the 7TM and ECL/ICL regions when attempting to compartmentalize GPCRs using functional criteria, such as ligand binding or downstream signaling, hints that the transmembrane domain is more than a scaffold. Indeed, despite being far more conserved than the ICLs and ECLs, which have evolved to recognize specific downstream effectors and ligands, the 7TM has retained functional significance throughout evolution. Examples of superfamily-wide conserved motifs include (blue segments in figure 1.3) (i) the DRY motif (helix III), (ii) the WxP motif (helix VI), and (iii) the NPxxY motif (helix VII) (Nygaard et al., 2009). The ligand binding pocket itself is usually formed by changes in the helical conformation of the 7TM, meaning that residues involved in ligand binding process are usually found within helices, facing the ECLs. This is one 18 of the most interesting aspects of the transmembrane domain, as it exemplifies that despite being the molecular fingerprint of GPCRs, the helices have been evolutionarily designed so as to be flexible enough to form binding pockets for very different molecules. Furthermore, proline residues (indicated in figure 1.3 as orange spots) which act as “kink inducers”, introducing a local weakening of the helix due to their cyclic sidechains imposing a backbone dihedral angle of -75⁰ (Nugent and Jones, 2011), are well conserved throughout GPCRs. Successful crystallization attempts demonstrate an overall Cα RMSD (root mean squared deviation) of <3 Å in TM helices between any pairs of GPCRs, despite local variations such as bulges, kinks and other deviations form canonical α helices usually associated with proline residues (Katritch et al., 2011). The profile of the loop regions is quite different, as there is remarkable sequence and structural diversity. Structural variation across GPCR families is especially pronounced on the ligand binding modules, comprised of ECLs and the extracellular-facing 7TM (EC-TM), reflecting the diversity in the geometry of ligands such as photons, ions, lipids, and hormones (Lagerström and Schiöth, 2006). In published structures, the ECL region shows variations of up to 7 Å; this diversity is reflected on the sequence level as well where only 6% of residues are exactly conserved (Katritch et al., 2011). Furthermore, high length variation and secondary structure element diversity (particularly for ECL2) is evident among existing crystal structures; perhaps the only well conserved structural feature of the ECL region is the disulfide bridge connecting ECL2 with the extracellular tip of helix III. The extracellular N-terminal sequence is also highly variable in length, ranging from less than ten amino acids up to several hundred. The downstream signaling module, comprised of ICLs and the intracellular- facing 7TM, has less structural diversity than the ligand binding counterpart. This domain has not been under evolutionary pressure to adapt for such a large diversity of binding partners, since downstream effectors (G proteins and arrestins) in addition to regulatory proteins are the main cytosolic elements that GPCRs are known to interact with. In spite of possessing higher conservation than the ECL, dramatic sequence and structural variations are present. The most salient of these is ICL3, whose length ranges from as short as five residues to several hundred and is associated with receptor selectivity to G proteins. 19

Details of structural and mechanistic aspects of GPCR biology are limited on account of the scarcity of successfully solved structures. At present, only 13 GPCRs have been structurally characterized, with the first breakthrough occurring in 2000. Within the past decade, the pace of structure characterization has lagged far behind sequence-based methods used to discovery previously unknown members of GPCR superfamily. Table 1.2 summarizes the GPCRs for which published crystal structures currently exist.

Table 1.2. Solved GPCR Structures Receptor PDB ID* Year Publication bovine rhodopsin 1F88 2000 Palczweski et al. squid rhodopsin 2ZIY 2008 Shimamura et al. human β2-adrenergic 2RH1 2007 Cherezov et al. turkey β1-adrenergic 2VT4 2008 Warne et al. human α2-adenosine 3EML 2008 Jaakola et al. human chemokine type 4 3OE0 2010 Wu et al. human dopamine type 3 3PBL 2010 Chien et al. human histamine 1 3RZE 2011 Shimamura et al. human M3 muscarinic ACh 4DAJ 2012 Kruse et al. human kappa opioid 4DJH 2012 Wu et al. human nociception/orphanin 4EA3 2012 Thompson et al. 4GRV 2012 White et al. human protease-activated 3VW7 2012 Zhang et al. *NB. Most published structures have been solved at multiple resolutions (measured in Å) and in complexes with various inverse . Receptors that have been solved recently and for which there is only one published structure are indicated in italics. The reported PDB code in all cases refers to the first chronologically published structure of each receptor.

The current status of GPCR structural coverage in the form of a tree is illustrated graphically in figure 1.3. High resolution structures of GPCRs that have been published are indicated with circles and highlighted areas show close homologs of the solved crystal structures (>35% sequence identity).

20

Figure 1.4. GPCR Family Structural Coverage

This figure places currently published crystal structures in the context of the GRAFS classification scheme, covering those published up to 2011. Figure adapted from (Katritch et al., 2011).

As displayed, all currently solved GPCR structures are members of the rhodopsin family and belong to either the α-rhodopsins or γ-rhodopsins. Structural coverage remains scarce on account of the technical challenges associated with crystallizing insoluble membrane proteins. In spite of the current limitations of X-ray crystallography, solved GPCR structures have yielded important insights into structural patterns and functional mechanisms.

1.3.3 GPCR Functional Mechanisms Although highly complex and involved, the general mechanism of GPCR signaling has been understood as a result of successful crystallization studies. The process involves the following steps: (i) ligand binding, (ii) signal propagation and (iii) G protein activation. Before considering this process further, a brief consideration of G proteins is required.

21

G (guanine nucleotide-binding) proteins are a family of molecular switches, effectively transmitting extracellular chemical signals within a cell’s interior environment. Their activity is regulated by their ability to bind to and hydrolyze GTP (guanine triphosphate) to GDP (guanine diphosphate); hence they are members of the larger group of enzymes known as GTPases. G proteins can be divided into two classes – monomeric and heterotrimeric – sometimes termed small and large G proteins, respectively. Although they share common functional mechanisms, only heterotrimeric G proteins are activated by GPCRs. They are made up of three subunits, termed alpha (Gα), beta (Gβ), and gamma (Gγ), with the latter two being bound to each other in what is known as the beta-gamma complex (Gβγ). Although only a limited number of Gβγ dimers are known to exist, a multitude of Gα monomeric subunits are recognized. These can be grouped into four main classes (Gαs, Gαq,

Gαi/o, and Gα12), with each class having numerous subtypes (Kaziro et al., 1994; Watson and Arkinstall, 1994). All three subunits are associated in the inactive state of a G protein. The Gα monomeric subunit, possessing intrinsic GTPase activity, carries a molecule of GDP bound to it in its inactive state, which is exchanged for GTP upon GPCR-mediated activation. GPCRs are intracellularly coupled to G proteins and transmit extracellular signals through them.

The GPCR functional cycle progresses as follows: (i) a receptor binds its endogenous ligand and subsequently undergoes a conformational change that is transmitted to the coupled G protein in its resting state (GDP-associated Gα subunit).

(ii) The G protein is in turn activated by the exchange of GDP for GTP on the Gα subunit, followed by its dissociation from the Gβγ dimer. (iii) The now free monomeric Gα finds its target effector protein whose activity it modulates via binding. This initiates an intracellular signal cascade, a pathway involving numerous proteins being activated to achieve some biochemical change within the cell. Among several signal transduction pathways modulated by GPCRs, notable examples include the opening of potassium channels as well as the production of cyclic AMP (mediated by the coupling of G proteins to the enzyme ). Cyclic AMP (adenosine monophosphate) is a secondary messenger involved in regulating several important biological processes. The effects of these highly complex signal transduction pathways on cellular function are diverse and include protein , gene transcription, and membrane depolarization. (iv) The final step involves Gα unbinding

22 from its effector via the hydrolysis of GTP to GDP (enabled by the innate GTPase activity of α subunits) and re-association with the Gβγ dimer, effectively reinstating the inactive form of the G protein and switching off the signal. Figure 1.5 illustrates this process, highlighting the changes experienced by the membrane-spanning GPCR molecule (1.5A) and by its intracellular G protein partner (1.5B).

Figure 1.5. Overview of the GPCR Functional Mechanism

A. GPCR-mediated G Protein Activation

23

B. G Protein-mediated Signal Propagation

(A) Upon ligand binding, the receptor undergoes a conformational change that is transmitted to its G protein partner (a). This causes the α subunit to exchange GDP for GTP and then dissociate from the βγ dimer. The α subunit then searches for its target protein and binds to it, propagating the extracellular signal to an intracellular cascade. Following hydrolysis of GTP to GDP, Gα leaves the effector and re-associates with Gβγ, terminating the signal. The G protein then regains its inactive state, re-associating with a GPCR (b). Figure adapted from (http://watcut.uwaterloo.ca/). (B) Block diagram of G protein functional cycle. Activation of the receptor causes the α subunit of the coupled G protein to exchange GDP for GTP. This leads to G protein activation and dissociation into the Gα monomer and the Gβγ dimer. Upon binding to its effector protein, the signal is switched off by the innate GTPase activity of Gα, which hydrolyzes GTP to GDP and allows re-association of Gα and Gβγ to return the G protein to its inactive state.

1.3.4 GPCRs as Drug Targets

The prominence of GPCRs in regulating signaling pathways that control diverse physiological processes renders them highly significant from a pharmacological perspective. Indeed, they are the most popular therapeutic target class, with an estimated 40 to 50% of clinically approved and marketed drugs acting directly on GPCRs or through GPCR-associated mechanisms (Flower, 1999; Robas et al., 2003). When referring to ligands that activate receptors thereby initiating signal cascades, three different types, based on binding effects, can be distinguished:

24 agonists, inverse agonists, and antagonists. The term describes molecules that mimic the behavior of the endogenous ligand; its binding stabilizes the receptor in an active conformation. Inverse agonists apply to substances that bind to and stabilize the inactive form of a GPCR, while antagonists (also called inhibitors) compete with the endogenous ligand and bind to the ligand binding pocket. Antagonists hence inhibit GPCR-mediated signaling by preventing the agonist-induced conformational change that activates its coupled G protein. Efforts to use GPCRs as drug targets have yielded beneficial effects on a wide array of health disorders. The discovery of propranolol, an antagonist for β adrenergic receptors, led to the 1988 Nobel Prize in Physiology or Medicine (awarded to Sir James Black). Propranolol and its derivatives are widespread inhibitors known as β-blockers, and are used in the treatment of cardiovascular conditions such as . Pharmaceutical compounds with agonist properties have also been employed, activating dopamine and serotonin receptors to manage Parkinson’s disease, migraine, and other neurological disorders. Despite the success of developing drugs to target GPCRs, they often suffer from lack of specificity leading to undesired side effects. Deeper structural and functional understanding of GPCRs is therefore crucial to improving the efficacy and selectivity of pharmacologic agents.

1.4 Electrostatics

Electrostatics is the branch of physics that is concerned with the phenomena and properties of stationary electric charges. Electrical charge is one of the two main sources of potential energy (the other being gravity), which in turn is one of several forms of energy and is associated with the force acting on a body. Charge is a fundamental physical concept and as such cannot be readily defined in terms of other concepts; instead it is assigned an operational definition, i.e. it is a property of matter that causes a body to experience a force when nearby to another electrically charged body. Therefore, two bodies carrying charges will exert a force on each other; this force is repulsive for identical charges and attractive for opposite charges. The force experienced by each body can be described using the notion of an electric field, i.e. the space surrounding a charged body that acts on another charge to produce an electrical force. These ideas are illustrated in the figure below.

25

Figure 1.6. Electric field created by two oppositely charged bodies

An electric field permeates the space around a positive and negative point charge. Opposite charges are attractive towards each other. Figure adapted from (http://www.phy.ntnu.edu.tw).

A particle can have positive, negative or zero charge (termed neutral); protons are subatomic particles with positive charge while electrons possess negative charge, and neutrons are uncharged. All atoms are composed of protons, neutrons and electrons (with the exception of hydrogen whose atom contains a single proton and electron). Atoms are neutral, balancing their overall charge by possessing an equal number of protons and electrons, and most atomic properties, including its interactions with other atoms, are derived from its constituent electrons. When an atom loses or gains an electron, it gains a positive or negative charge, respectively; an atom or molecule with an unequal number of protons and electrons that has acquired a net positive or negative charge is termed an ion. Proteins, being essentially very large polymeric molecules comprised of thousands of atoms, can be ionic species if they possess an overall nonzero charge. The ionic nature of macromolecules such as proteins means that they possess electrostatic properties, which play key roles in their interactional propensities with substrates (enzymes), ligands (receptors), as well as other proteins to form functional complexes. Electrostatic properties of proteins are conferred by the ionic properties of their constituent amino acids, which have extensive roles in mediating interactions. Given the accepted biochemical dogma of primary sequence dictating tertiary structure, which in turn determines the function of a protein molecule, it follows that aspects of sequence and structure are key parts of computational studies aimed at better understanding protein function.

26

All naturally occurring proteins are composed from the same set of 20 amino acids linked together in covalent bonds to form polymers known as polypeptides. Each of these amino acids consists of a main backbone made from an amino and a carboxyl functional group, and a side-chain (R group) unique to each residue. The 20 different side-chains give each different properties, which are commonly classified according to their capacity to interact with water, the universal solvent. On the basis of this criterion, four classes of amino acids can be distinguished: (i) polar, (ii) nonpolar, (iii) acidic, and (iv) basic. Certain amino acids also have the ability to become ionized in solution and can therefore possess electrically charged side-chains (termed ionizable or titrable groups). Amino acids with ionizable groups were the focus of the current work, as they are the mediators of electrostatic interactions within and between proteins. Out of the 20 amino acids, seven are known to have ionizable side-chains with five of those being in ionized states at neutral pH; lysine, , histidine, aspartic acid, and glutamic acid (K, R, H, D, and E respectively). Lysine and arginine are basic amino acids with positively charged groups (histidine is generally only partially charged at neutral pH); aspartic and glutamic acid are acidic amino acids with negatively charged groups and thus are often referred to as aspartate and glutamate, respectively. Cysteine and tyrosine also have ionizable side-chains but are usually uncharged at neutral pH. In order to computationally determine the functionality of ionizable groups, some physical quantity must be used to detect and measure changes in their charge state. The quantities most commonly used to measure electrostatic properties of amino acids (i.e. to characterize their charge states and how they vary) and their polymeric forms (polypeptides and proteins) are pH as well as pKa, and are discussed in the following section.

1.4.1 pH and pKa

pH and pKa are concepts that describe the propensities of molecules in aqueous solutions to donate or lose protons. Species that tend to lose protons (deprotonation) are termed acids while those that accept protons (protonation) are called bases. Cellular and physiological environments are almost exclusively aqueous, and consequently these concepts are very important in describing electrostatic interactions. pH (potential hydrogen) is one of the most important quantities in biochemistry and physiology, and measures the concentration and 27

+ activity of solvated hydronium ions (H3O , commonly referred to as hydrogen ions and abbreviated as H+) in a solution. Solvation refers to any process of a solvent (usually water) interacting with a solute (small molecules, proteins, ionic species, etc.). Thermodynamic interactions between solvent and solute will usually have certain effects on the solute, such as the dissociation of a salt compound into its component elements. In other words, pH is a measure of the acidity or basicity of a solution and is mathematically expressed as the negative logarithm of hydronium ion concentration: + pH = – log([H3O ]) This relationship means that pH is a function of two variables: (i) the concentration of the solution (given two solutions of the same acid, the more concentrated one will be + more acidic on account of having more free H3O ions and therefore a lower pH) and (ii) the identity of the acid (given two equally concentrated acids, the solution of the stronger acid will have a lower pH because it is more dissociated than the weaker acid). The pH scale measures dissociation on a logarithmic scale ranging from 1 (extremely acidic) to 14 (extremely alkaline), with 7 being neutral (neither acidic nor basic). The majority of bodily fluids has a pH near neutral and range from 6.5 – 8.0, with that of blood being regulated to remain around a value of 7.4. The process of dissociation can be expressed by the chemical equation - + AH + H2O == A + H3O In this reaction equation, AH is a generic acid and A- is its conjugate base formed by the deprotonation of AH. A special equilibrium constant can be defined to represent the point at which the ratio of reactants and products being exchanged is constant and consequently substances move between both sides of the equation at an equal rate producing no net change, i.e. the acidity constant: + - Ka = [H3O ] [A ] [AH]

The equilibrium constant usually factors into account the concentration of water, but in solutions where acidic compounds are sufficiently diluted, such as the intracellular environment, the concentration of water can be approximated as being constant. Like hydronium ion concentration, the acidity constant can also be expressed in logarithmic form as follows:

pKa = – log(Ka)

28 pKa, the acid dissociation constant, provides a quantitative measure of the strength of an acid in solution. Because of the negative sign, a low pKa indicates a high Ka value and therefore a strong acid and vice versa. pH and pKa are therefore related concepts; the former measures the acidity of a solution while the latter measures the tendency of an acid to dissociate. There is an important mathematical relationship between the two quantities; pKa is the pH value at which an acid is exactly half-dissociated. This is demonstrated by rearranging the equilibrium equation: + [H3O ] = Ka x [AH] [A-]

pH = pKa – log [AH] [A-]

- so that when [AH] = [A ], it holds that pH = pKa. This relationship is important because it dictates solubility properties and charge states of amino acids. In general, - at a pH above the pKa an acid exists mostly as A in water and will therefore be soluble, whereas at a pH below the pKa the acid exists mostly as AH in water and will be insoluble; the opposite is true for basic compounds. The acid-base dynamics apply to proteins in cellular environments as they are polyionic macromolecules with amphoteric units. Amino acids are amphoteric (also termed zwitterions) substances because they react with both acids and bases on account of their α-amino (-NH2) and α-carboxyl (-COOH) groups. All 20 amino + acids therefore have two dissociable H3O ions, but only charged amino acids possessing acidic or basic side-chains have more than two. When an amino acid is incorporated into a polypeptide, the charges on the α-amino and α-carboxyl groups disappear; hence the complex ionic properties of a protein are conferred by the five amino acids whose side-chains can possess a charged state under certain pH conditions. The charge state dynamics and their pH dependence for each ionizable side-chain are reviewed in the following section.

1.4.2 pH dependence of charge state for ionizable amino acids Among the five amino acids with side-chains that are generally charged at neutral pH, two become negatively charged (aspartic acid and glutamic acid) and three become positively charged at neutral pH levels (lysine, arginine, and histidine). Despite having slightly different R groups, aspartic and glutamic acid both have a carboxylic group (-COO-) on their respective side-chains. At a pH superior to their 29 pKa, the carboxylic group is deprotonated (loses a hydronium ion) and becomes negatively charged; hence in their deprotonated states they become acids. At a pH inferior to their pKa, both aspartic and glutamic acid are uncharged. The deprotonation occurs as shown in figure 1.7:

Figure 1.7. Deprotonation of carboxylic side-chain group (Asp, Glu)*

*Diagram is illustrational; no structural aspects are included

Lysine, arginine, and histidine have markedly different side-chains but share an amino group (-NH2). At a pH inferior to their pKa, the amino group is protonated (gains a hydronium ion) and becomes positively charged; hence in their protonated states they become basic. At a pH superior to their pKa, all three side-chains are uncharged. The protonation occurs as shown in figure 1.8:

Figure 1.8. Protonation of amino side-chain group (Lys)*

*Diagram is illustrational; no structural aspects are included. Protonation of the lysine side- chain is shown; arginine and histidine have different ionizable groups on their side-chains.

30

In the current context, the pKa is the pH value at which half of the carboxylic or amino side-chain group is charged. The pKa values of the five ionizable side- chains, as well an overview of the pH-dependent charge state dynamics are summarized in tables 1.3 – 1.4 below:

Table 1.3. pKa values for charged side-chains

Amino Acid pKa of side-chain group* Aspartic Acid (Asp) 3.9 Glutamic Acid (Glu) 4.2 Lysine (Lys) 10.5 Arginine (Arg) 12.5 Histidine (His) 6.0 *pKa values adapted from (www.imgt.org)

Table 1.4. pH-dependent variation of charge states Ionizable side- pH / pK Amino Acid a Charge State chain group Relationship Aspartic Acid pH > pK Negative (-) Carboxyl (-COOH) a (Asp) pH < pKa Neutral* Glutamic Acid pH > pK Negative (-) Carboxyl (-COOH) a (Glu) pH < pKa Neutral pH > pKa Neutral Lysine (Lys) Amino (-NH2) pH < pKa Positive (+) Guanidino pH > pK Neutral Arginine (Arg) a -HNC(NH2)2 pH < pKa Positive (+) Imidazole pH > pK Neutral Histidine (His) a -(CH)2N(NH)CH pH < pKa Positive (+) * The pH at which a molecule carries no net charge is termed the isoelectric point (pI)

Electrostatic interactions between proteins and other molecules are based on ionized states of amino acid side-chains; hence changes in their pKa values can be used to probe potential functional relevance of a particular residue or group of residues. In the context of the current work, pKa calculations were used to identify potentially functional charged residues located in the transmembrane domain of GPCRs.

1.4.3 Electrostatics in Protein Interactions The links between the structure and function of a protein and its electrostatic properties have been explored by numerous studies, some of which have been reviewed by Sinha and Smith-Gill (2002). The complex aspects of cellular

31 physiology mean that very large numbers of macromolecules interact with specific binding partners; these interactions are most commonly mediated through non- covalent forces. Amino acids with ionizable side-chains compose, on average, 29% of the amino acids in proteins (Pace et al., 2009). Although significantly weaker than the covalent bonds commonly found in small molecules, non-covalent bonds such as hydrophobic interactions and van der Waals forces have the ability to form stable interaction networks and are prominent in ternary protein complexes. Such interactions can be both short-range and long-range. At short distances, electrostatic interactions act in conjunction with other factors such as binding interface orientation, changes in desolvation energy as well as specific pairwise interactions (Nielsen, 2007). Long-range electrostatic interactions are observed in the binding specificity of proteins that can recognize and discriminate their binding partners in the crowded cellular environment (Schreiber and Keating, 2011). The distribution of charged residues is significant when considering protein interactions. Charge clusters, which have long been noted to exist within proteins (Karlin, 1995) may consist of either oppositely charged residues or like charged residues. Clusters that consist of opposite charges may mediate the formation of multi-protein complexes, whilst those of like charge are thought to keep certain assemblies apart (Karlin, 1995). In the context of membrane proteins such as GPCRs, the interactions most appropriate to be described by electrostatics are receptor-ligand binding and receptor- G protein coupling. It has been observed that human GPCRs contain disordered segments in their N-terminal, C-terminal, and ICL3 sequences (Jaakola et al., 2005). IDPs (intrinsically disorders proteins) and IDP-like regions often play roles in controlling cellular processes such as signaling, transcription regulation and phosphorylation (Iakoucheva et al., 2002; Iakoucheva et al., 2004), implying that such regions often carry functional significance. Low hydrophobicity and high net charge are signatures of intrinsically disordered regions (Dyson and Wright, 2005). Although the focus of this study was on the transmembrane domain of GPCRs in which disordered regions have not been observed and hydrophobic conditions dominate, the presence of charged residues is interesting due to the fact that the 7TM scaffold has roles in both receptor-ligand binding and receptor-G protein coupling. In terms of environmental factors, pH has large effects on the properties of proteins and hence on aspects of cellular function; it is well known that proteins can become unstable or denature completely under extreme acidic or alkaline pH 32 conditions. Ionizable groups of amino acid side-chains are usually localized on the solvent accessible area (surface) of a protein on account of the unfavorable solvation energy associated with charged residues that are buried in the protein core. Receptor- ligand interactions depend heavily on the ionization state of side-chains near the binding interface. Interactions based on electrostatic properties have been used to study structural and functional aspects of several protein systems including enzymes (Greaves and Warwicker, 2005), mRNA translation factors (Magee and Warwicker, 2005), and the ECL domain of GPCRs (Hawtin et al., 2006) among others, reflecting that charged regions often provide insight to functional areas. Electrostatic energies are expressed by the pKa values of ionizable groups. These pKa values are highly sensitive to local protein conformation and environment, and thus provide a useful way to study electrostatic interactions in proteins.

1.4.4 Modeling Electrostatics In macromolecular systems, electrostatic interactions are mostly based on the polarization created from the presence of ionic species and include: (i) polarization due to electron density distribution around atoms, (ii) polarization due the redistribution of electrons in response to local electrical fields and (iii) polarization due to the reorientation of polar groups in response to an electrical field (Neves- petersen and Peterson, 2003). The computational modeling of electrostatic interactions and their effects on protein behavior can be broadly categorized as one of two techniques – continuum electrostatics and molecular dynamics. Both these techniques face similar challenges in calculating electrostatic effects. These challenges arise from the fact that interactions between ionizable groups take place in the cellular environment where (i) specific pH levels exist, (ii) specific salt concentrations that may alter binding energies exist, (iii) the surrounding aqueous environment means water molecules with orientational freedom exist. Although classical electrostatics offers a description of the quantity of charge and the forces that exist between two charged bodies in Coulomb’s law, this approach is too simplistic for modeling protein-solvent systems where the dielectric properties of each differ substantially. A dielectric refers to any substance which can behave as an electrical insulator and become polarized upon introduction of an electrical field. The current work employs a combination of two continuum models, which are

33 generally able to account for different ionizable groups whilst providing fast calculations.

1.4.4.1 Finite Difference Poisson Boltzmann

The Poisson-Boltzmann equation (PBE) describes electrostatic interactions between molecules in solutions with ionic species by treating proteins and solvent as uniformly smooth objects. This model was initially used to model charge effects in enzyme active sites (Klapper et al., 1986) and to calculate pKa values (Bashford and Karplus, 1990). A variation of this method comes in the form of a numerical FDPB (Finite Difference Poisson-Boltzmann) method which calculates charge interactions by applying a discretization to the PBE and incorporating detailed geometric information, i.e. proteins do not have to be approximated as spherical objects (Warwicker and Watson, 1982). The introduction of finite difference techniques in this context provided a representation of the protein-solvent interface in macroscopic models (Alexov et al., 2011). The FDPB method is based on the changes experienced in the dielectric properties of the protein-solvent system which occur due to dipolar reorientation in the presence of an electrical field. Dipoles are separations between positive and negative charges that occur on the atomic level and can be classified into two types – permanent and induced. The former occur when in a covalent bond the shared electron pair between two atoms is permanently pulled closer to the more electronegative atom, while the latter are experienced when the electron cloud around a covalently bonded molecule shifts to one side. Due to the relative freedom of water molecules in aqueous environments, there is extensive dipolar rotation which leads to a high dielectric constant (~80 at 298 K). In contrast, permanent dipoles resulting from the covalently linked backbone in proteins maintain much smaller dielectric constants (~4 at 298 K). The FDPB model therefore describes a protein with a low dielectric constant immersed in a medium with a much higher dielectric. This concept is illustrated in figure 1.9.

34

Figure 1.9. Dielectric environment of protein-solvent systems

Differences in dielectric constants within the protein-solvent system. The highly mobile ions in water arise from high dipolar rotation while the polyionic nature of a protein arises from immobile ions with low dipolar rotation. Figure 1.9 is meant as an illustration only; geometry and scale of the actual molecules are ignored for simplicity.

The calculation of charge interactions within the protein-solvent system is done by mapping the system onto a Cartesian grid in order to subdivide the domain in which the PBE is numerically solved.

1.4.4.2 Debye-Hückel Theory A second method for modeling electrostatics is based on the Debye-Hückel equation which can be used to represent the average charge-charge environments at protein surfaces (Warwicker, 2011). The method is essentially a simplified Poisson- Boltzmann model to account for the non-ideality of electrolyte solutions, especially at low concentrations. The activity of ions in solution is relatively large compared to that of neutral compounds. Although an electrolyte dissolved in water tends to strengthen the solvation of charges, in the context of the cellular environment this effect is minimal compared to the powerful solvating effects of water itself (Gilson,

35

2006). For a point charge in solution, the redistribution of mobile ions can be calculated through a screened potential given by a potential φ(r) defined as:

( ) ( ) where q is the charge at distance r, ε0 is zero permittivity, and rDebye is the Debye length, which is the distance within which significant Coulombic contributions between charged residues occur. The Debye length is inversely proportional to ionic strength; i.e. it decreases with increasing ionic strength. At physiological ionic strength (~150 mM), the Debye length is approximately 8 Å. The underlying principle is that Coulombic interactions experienced between ions in solution are so dominant in contributing to the non- ideality of an ionic solution to the point that other contributions can be neglected. The Debye-Hückel theory postulates that ions in solution are non-uniformly distributed; overall a solution may be electrically neutral, but near any given ion there is a higher concentration of counterions (oppositely charged ions). Over time, a greater amount of counterions than like ions accumulate around a given ion, and their directions are uncorrelated. Averaged over time, this movement appears as a spherically shaped haze with the same net charge as the ion, but with opposite charge. The spherical distribution of this resulting excess of counterions is termed the ionic atmosphere. When considering the ionic atmosphere around an isolated ion in solution, the concentration of counterions decreases exponentially with distance from the ion. The relationship between Debye length and counterionic concentration is illustrated in figure 1.10 below.

36

Figure 1.10. Debye Length vs. Counterionic Concentration

Counterionic Concentration

Debye length The counterionic concentration decreases exponentially with Debye length. Physically, the Debye length is the distance at which an ion does not feel the presence of other ions. The concentration of counterions is exponentially proportional to the distance from the ion. Image adapted from (www.uvm.edu)

A combination of the two methods is found in the FD/DH (Finite Difference Debye Huckel) method (Warwicker, 2004), which focuses on aqueous phase proteins. The FD/DH offers fast calculations for distinguishing between buried and solvent accessible side-chains by avoiding multiple conformational sampling of charged group rotamer configurations by assuming that a purely buried side-chain can interact only in an FDPB manner, while water accessible groups interact according to the DH regime in which a constant dielectric exists.

1.5 Bioinformatics Tools and Methodologies

Bioinformatics is the science of extracting knowledge and establishing correlations from biological data, which often comes in the form of sequence and structure. Hence sequence-based and structure-based methods constitute a large part of the current repertoire of bioinformatics techniques.

37

1.5.1 Sequence Analysis Methods

One of the cornerstones of bioinformatics is addressing the issue of sequence comparison, i.e. deducing whether sequences are actually related to each other. Sequence analysis refers to techniques of extracting information from biological sequences (protein and nucleic acid), and encompasses methodologies to understand their structural and functional features as well as their evolution. The underlying principle behind sequence analysis is that protein structures and functions are usually more highly conserved than their primary sequences. Extensive similarity in amino acid sequence among proteins indicates that those proteins will usually share common, or similar, structures and functions. The significance of protein sequence similarity lies in allowing the prediction of functions for newly characterized proteins; if a previously unknown protein is found to share a significant similarity to one of known function, it is possible to assign the same function to the novel protein. At the center of the sequence analysis field is the sequence alignment method, in which sequences of macromolecules (DNA, RNA, and proteins) are aligned to identify regions that hint toward evolutionary conservation and possible structural and functional similarities. Sequence alignment methods generally fall into one of two categories – pairwise alignment and multiple alignments; the former compares two sequences while the latter studies multiple sequences in one alignment. In the majority of cases, the datasets that are used in bioinformatics analyses require multiple sequence alignments to compare sequences informatively. Homology is the term used to refer to relationships between proteins that have diverged through an evolutionary process from a common ancestor; homologues are thus proteins that have arisen from a common ancestral protein. Two such types of proteins are recognized – orthologs, which perform the same function in different organisms, and paralogs, which perform similar but distinct functions in one organism and have arisen through a gene duplication event. The percentage similarity between sequences is perhaps the most pivotal criterion for making functional assignments to newly characterized sequences. In general, high sequence identity is considered greater than 50%, moderate identity ranges between 30 – 50%, and similarity in the range of 20 – 30% is known as the “twilight zone” (Doolittle, 1986). Two proteins chosen randomly and unrelated may share sequence identity within the twilight zone making homology very difficult to infer, even though it is possible for distant 38 homologues to share as little as 20% identity to one another. This is not to imply that high sequence similarity will always manifest as structural homology. The methods used to assess sequence similarity and infer homology can be categorized as either global sequence alignment or local sequence alignment methods. Global alignment methods compare sequences along their entire length and try to achieve the best alignment throughout. This renders them applicable to highly similar sequences of approximately equal lengths, but non-optimal for sequences without substantial similarity, as they tend to miss important biological relationships that may not be evident when considering sequences as a whole. Local alignment methods take a different approach, attempting to find the most similar regions (subsequences) of sequences being compared rather than searching for the best way align their entire length (global alignment). This allows them to find shorter subsequences within the sequences that may be biologically related. Hence, local alignment methods are appropriate for sequences that share some degree of similarity, or for sequences of different lengths. GPCRs comprise a remarkably diverse superfamily of proteins, with little overall similarity evident in primary sequence throughout family members. Sequence identity between the five classes is on average less than 25% and is generally though not always confined to the transmembrane region; the 7TM helical bundle is hence the most salient GPCR signature. Poor sequence similarity renders the evolutionary history of these proteins ambiguous; it has been suggested that they arose through divergence from a distant ancestor (Josefsson, 1999). Their ubiquitous mediation of G-protein interactions illustrates that they have adapted a common function; furthermore, the presence of a common transmembrane helical architecture suggests that they may have converged from different ancestors. The importance of computational tools and techniques applies to ensemble of genes and proteins (the entire genome and proteome), and is especially important in the study of GPCR biology and more generally in the study of membrane proteins. This derives from the difficulties that this important class of proteins poses to the more traditional biochemical and biophysical methods for structural characterization. On account of being membrane-spanning proteins, GPCRs are highly insoluble and contain significant amounts of hydrophobic amino acid residues; membrane proteins tend to aggregate when removed from the lipid bilayer solution. Purification of such proteins, which is necessary for analytical techniques such as X-ray crystallography and NMR (nuclear magnetic resonance) spectroscopy, is currently very challenging as 39 they can unfold when the plasma membrane is disrupted. Progress in solving GPCR structures has been slow, compared to the impressive growth of sequence data, as sophisticated experimental techniques have their shortcomings; NMR spectroscopy requires high concentrations of dissolved protein and X-ray crystallography requires crystal structures of the protein to be prepared, which is extremely difficult in the case of GPCRs. It is interesting to note that GPCR data reflects the general trend in biological data, whereby generation of structural data lags far behind that of sequence data. The difficulty of establishing successful biochemical protocols for such purification procedures and standardizing them renders computational techniques useful alternatives for predicting the structure of unsolved proteins based on primary sequence, and also extends into discovering and characterizing novel GPCRs. Bioinformatics methods are hence imperative to build on the collected sequence information and to progress to structural and functional inferences, since experimentally determined GPCR structures remain very scarce.

1.5.1.1 BLAST and PSI-BLAST

The most widely used technique for detecting similarity between biological sequences is the Basic Local Alignment Search Tool, or BLAST (Altschul et al., 1990). The BLAST algorithm has achieved great popularity on account of its optimal trade-off between speed and sensitivity; it is able to detect similarities accurately and rapidly without major sacrifices in sensitivity. The underlying principle of the algorithm is that true matches to a query sequence are highly likely to contain somewhere within them a short stretch of identities, or very high scoring matches. These short stretches (termed “query words”) are used to construct a local alignment of maximal length by finding both identical and related words; the search is done against a target database. In order to determine which words are closely enough related, scoring matrices are used to develop a neighborhood; a list of ranked words whose score values are compared to a threshold (the neighborhood score threshold, T) and either nominated as close relatives or discarded. When the original query word has been aligned with another word from the neighborhood (with score T), the alignment is extended tallying a cumulative score resulting from matches, mismatches and gaps (insertions and deletions). The extension process continues until the cumulative score drops below a certain threshold (minimum score threshold, S) on 40 account of mismatches and gaps outweighing matches and substitutions. Once the extension process in terminated, the resulting alignment is termed a high-scoring segment pair, or HSP; using the cumulative score from the alignment in combination with a number of other parameters, an expectation value (E value) is calculated to test the significance of the HSP. For each hit, the E value represents the number of HSPs that would be expected purely by chance; hence lower values imply greater biological significance. There are several variations of the BLAST algorithm optimized for different types of sequences based on criteria of similarity and length. One such variation is position-specific iterated, or PSI-BLAST (Altschul et al., 1997), which is optimized for identifying distantly related proteins that may not be detected using the traditional BLAST method. PSI-BLAST is more effective than traditional BLAST because it relies on the use of a PSSM (position-specific scoring matrices), also referred to as a profile. The profile is essentially a matrix containing sequence information in the form of frequency scores, encoding the common characteristics of a multiple sequence alignment; hence HSP hits are scored using a matrix. Given a query sequence, PSI-BLAST performs a standard BLASTP on the first search and uses the matches to construct a PSSM, which then serves as the query in all subsequent searches of the target database. For each of the iterations, specific sites are given more or less weight and conservation at those sites is considered more or less important in choosing further hits. This allows the algorithm to find distant homologs.

1.5.2 Structure Prediction Protein structure prediction refers to the attempt to computationally characterize the 3D structure of a protein from its primary (amino acid) sequence. Being one of the central themes of modern bioinformatics, it has attracted enough attention to be considered a separate field in its own right, as structure characterization is important for making functional inferences about any specific protein. There are three methods used to predict protein structure, and they can be classified into template-based (homology modeling and protein threading) and template-free (ab initio modeling). Homology, or template-based modeling, rests on the generally accepted fact that where proteins are concerned, structure is better conserved than sequence. It is possible for proteins that have diverged beyond detectable sequence similarity to 41 retain topological features of their ancestral fold. The converse, however, does not generally hold; proteins that are not structural homologues would not be expected to have similar sequences. More specifically, it has been estimated that proteins with 30-50% sequence identity share at least 80% of their structure, those with 20-30% share at least 55%, and those below 20% may display 30% or less structural similarity (Ginalski, 2006). Based on these observations, atomic resolution models of a protein with unknown structure can be generated from an experimentally determined structure (the template structure) if one is available. Template-based modeling includes the techniques of homology modeling and protein threading, both of which offer structural predictions by attempting to build on prior knowledge about protein structure and to extrapolate these principles towards the generation of new structures. When no template structure is available, the spatial coordinates of a query protein can be predicted using ab initio, or template-free modeling.

1.5.2.1 Homology Modeling Homology modeling is the most powerful of the three predictive methods that are currently available and works by generating detailed three-dimensional structures of proteins based on the coordinates of known homologs found in PDB. The premise of the method is that structures of evolutionary related proteins are better conserved than their primary sequences. The evolutionary relationship is inferred by sequence comparison, with a typical sequence alignment algorithm querying the sequence of the template protein against a database of proteins with known 3D structure. In principle, homology modeling is a multi-step process and can be divided into eight steps, summarized in figure 1.11

Figure 1.11. The steps involved in homology modeling

42

A fragment of template corresponding to the region aligned with the target sequence forms the basis of the model. After alignment correction the short unaligned regions and missing side-chains are predicted, and then the model is optimized and validated. Figure adapted from (Gu and Bourne, 2009)

Initial alignment and template recognition

43

Sequence alignment programs are used to identify templates by comparing the target sequence to all the sequences of known protein structures in the Protein Data Bank (PDB).

Alignment Correction When acceptable modeling templates have been identified, more sophisticated methods that produce more accurate alignments between the target sequence and its templates are required. Aligning two sequences in a region where percentage sequence identity is very low is notoriously difficult, and multiple alignments that use large numbers of protein or DNA sequences are necessary. It is not uncommon for manual inspection and adjustment of the alignment to be necessary.

Backbone Generation The actual model building begins after the alignment has been produced. The protein backbone is created by the transfer of the coordinates of those template residues that are matched with the target sequence residues. When two aligned residues differ, the coordinates confined to the polypeptide backbone (N, Cα, C, and O atoms) can be copied. Conserved residues are copied as is to provide an initial guess of their conformation. In the case where multiple templates are used, the coordinates are weighted in an algorithm-specific manner.

Loop Modeling Virtually all multiple sequence alignments will contain gaps. Gaps in the target sequence are treated by omitting residues from the template structures, thereby creating a vacancy in the initial model. Gaps in the templates are addressed by inserting the missing residue into the contiguous backbone of the initial model. Both cases imply a conformational change in the backbone, but conformational changes do not in principle occur in secondary structure elements, hence insertions or deletions in the alignment can be shifted out of α helices and placed into loops.

Side-chain modeling 44

Side-chain replacement is necessary to correct for side-chains that have been altered due to the alignment process or the loop insertion and deletion process. The conformation of side-chains (also referred to as rotamers) in proteins with high (> 40%) sequence identity are similar due to the formation of networks of contact, and in such cases rotamers can be directly copied from template to model.

Model Optimization Correct prediction of 3D protein structure requires iterative steps of both side- chain modeling and backbone modeling, followed by energy minimisation. To achieve this, a highly accurate energy function (also termed “force field”) that fits the model structure to the correct minimum on the energy landscape is employed. The underlying algorithms that calculate energy minima are based on Monte Carlo statistical methods, a class of computational algorithms that rely on repeated random sampling procedures.

Iteration Iteration of some of the above steps of homology modeling may be necessary to correct for errors in the protein model. Although most errors that occur are small, significant errors such as those in computing protein backbone conformation usually require readjustment of the multiple sequence alignment or selection of a new template altogether, thus rendering iteration of the modeling procedure imperative to achieve the best possible model quality.

1.5.3 GPCR Information Repositories

On account of the significance of GPCRs in physiology and pharmacology, there have been several initiatives to produce publicly accessible web databases tailored for the storage and dissemination of information pertinent to GPCRs. Repositories collecting sequence-level and structural information, as well as functional annotations include PRINTS (Attwood, 2002), GPCRDB (Vroling et al., 2011) and GPCRRD (Zhang and Zhang, 2010), amongst several others.

PRINTS

45

Perhaps the most comprehensive database of sequence-level information is the PRINTS database. The classification approach used by PRINTS is that of a fingerprint, which is a characteristic signature sequence constructed by incorporating several highly conserved regions (motifs) that can be used to characterize a GPCR sequence. PRINTS currently holds over 2050 fingerprints encoding more than 12000 sequence motifs, which are not limited to GPCRs and include other protein families. Although it a database containing exclusively sequence-level descriptors, its power resides in being able to classify novel sequences into their constituent family and subfamily. URL: http://prints.cs.man.ac.uk/

GPCRDB The GPCRDB is a database that collects, combines, validates, and disseminates large amounts of heterogeneous data on G protein coupled receptors. The contents of GPCRDB are divided into three classes: primary, secondary, and tertiary data. Primary data consists of experimentally verified sequence data, ligand- binding constants, mutant information, structural data, and oligomer interactions. Data inferred from primary data such as multiple sequence alignments, homology models, and other such computationally derived data are classified as secondary data. Finally, tertiary data comprise interpretations provided by curators and other manual annotations. URL: http://www.gpcr.org/7tm

GPCRRD The main purpose of GPCRRD is to automatically retrieve and update experimental information for GPCR 3D structure modeling and functional annotation. The GPCRRD contains sequences, mutations, electron microscopy, neutron diffraction, site-directed mutagenesis, FTIR (Fourier Transform Infrared) Spectroscopy, disulfide bridge and X-ray crystallographic data that are regularly collected from the literature and from the original sources such as GPCRDB, UniProt, and EMBL. Additionally, GPCRRD is hyperlinked to GPCRDB so that certain annotations can be accessed immediately. URL: http://zhanglab.ccmb.med.umich.edu/GPCRRD 1.6 Aims and Objectives

46

An important part of the effort to understand the inner workings of GPCR mechanisms is the effort to pinpoint specific residues, regions, and sequence motifs that are involved in their prominent functions, e.g. ligand binding and G protein coupling. The underlying hypothesis of this work is that charged residues in GPCRs will be important functionally since it is energetically unfavorable for amino acids with charged side-chains to be buried in hydrophobic environments that are not within the solvent accessible area. Therefore, the presence of charged groups in TM regions of GPCRs may potentially indicate some aspects function. The aim of this study was to use methods that have successfully characterized functional regions in other classes of proteins and apply them to GPCRs, and addressing the questions of (i) how charge localized to the transmembrane regions is distributed and (ii) if charged areas correlate in any way to functional areas. Ultimately, if charged residues coincide with functional residues in a significant way, then such studies could be used to identify novel, uncharacterized functional areas through experimentation. The current work focuses on ascertaining to what extent pinpointing charged residues within GPCR sequences can help guide the computational search for functional areas within this important class of receptors. To this end, a dataset of GPCRs was constructed so that trends regarding charge distribution and correlations between charged regions and functional regions could be established. The responses of such regions and their sensitivity to changes in pH were measured quantitatively so that predictions concerning their functionality could be made.

Chapter 2. Methods 47

This chapter presents the technical framework of the project, including the methods used and some of the programming tasks that required implementation. An overview of the technical aspects is presented with the tools and programming tasks that were used throughout to address issues related to sequence analysis, structural modeling, and molecular visualization. The pKa calculations that were used to determine if identified residues were functionally important are outlined. The crystal structure of the human β2 adrenergic receptor (PDB code 3NY8) was selected as a modeling template.

2.1 Sequence Analysis Methodologies A dotplot was used to determine if low-complexity regions existed which could potentially affect BLAST searches carried out in the process of generating a dataset. The hits were then used to build a multiple sequence alignment whose sequences served as the GPCR dataset for which homology models were constructed.

2.1.1 Detecting Low-Complexity Regions It is often the case that protein sequences will contain large numbers of tandem repeats, or identical residues that appear adjacent to one another several times. Such a pattern causes an overrepresentation of residues in a sequence and introduces a bias that may affect similarity searches. In order to detect if the sequence of 3NY8 contains regions of biased composition such as tandem repeats, a dotplot was created. Although the conventional use of dotplots is to compare two sequences in a non- statistical, visually informative way, they can be used to test for low-complexity regions by comparing a sequence against itself. The rationale underlying this is that low-complexity regions tend to complicate sequence analysis studies by yielding false positive results in BLAST-like similarity searches. Indeed, low-complexity segments are a leading cause of contamination to automated iterative searches, and although BLAST searches have a parameter that allows for filtering such regions, it was considered appropriate to identify any extant regions in the template sequence. The dotplot was therefore used to ensure that there was no inherent propensity for the template sequence to yield false positive hits in subsequent similarity searches. Current similarity search tools are invariably prone to produce false positive hits, so a protein sequence with substantial low-complexity components would likely 48 compound this shortcoming, potentially undermining the quality of the GPCR dataset if too many false positive sequences were included. The figures below illustrate dotplots of the β2 adrenergic receptor (UniProtKB P07550) and that of human mucin- 1 (UniProtKB P15941, used as a control), both compared against themselves. The dotplots were rendered with the Dotlet program (Junier and Pagni, 2000), using a BLOSUM62 matrix and a sliding window of 51 residues in both the query and the control sequences.

Figure 2.1. Dotplots of template sequence and control sequence A. B2AR B. Mucin

Using Dotlet to identify tandem repeats in A) human β2 adrenergic receptor sequence and B) human mucin. Both sequences have been aligned to themselves. The output of (B) shows the signature of low-complexity sequences: a series of diagonals representing repetitive occurrence of a single residue.

Mucin was chosen as a control as it is a prime example of a protein with significant low-complexity components; specifically, it contains 42 tandem repeats of a 20-amino acid stretch consisting mostly of proline, serine, and threonine (P, S, and T respectively). Because identical residues will align position by position with each other, repetitive sequences will appear on a dotplot as a series of diagonals enclosed in squares. As is illustrated in figure 2.1, the sequence of 3NY8 does not have any discernible low-complexity regions, so it is not prone in itself to influence false positive occurrences.

2.1.2 PSI-BLAST

49

In order to construct a dataset consisting of GPCR models, a sequence alignment containing an adequate number of protein sequences had to be ascertained. The modeling pipeline that was used requires a multiple sequence alignment in CLUSTAL format as input and constructs one homology model per sequence member. The PSI-BLAST (Position-Specific Iterated BLAST) algorithm was chosen as it is particularly well suited for identifying distantly related proteins of a query sequence; the maximum number of related sequences was desired so that the GPCR dataset would be of adequate size. This method is more effective than the traditional BLASTP because it relies on the use of a PSSM (position-specific scoring matrices), also referred to as a profile, which is essentially a matrix containing sequence information in the form of frequency scores, encoding the common characteristics of a multiple sequence alignment. Given a query sequence, PSI-BLAST performs a standard BLASTP search and uses the matches to construct a PSSM, which then serves as the query in all subsequent searches of the target database. Several PSI-BLAST searches were performed with 3NY8 as the query, varying parameter values. The parameters used to construct the final multiple sequence alignment serving as the basis for generating the dataset are summarized in table 2.1.

Table 2.1. PSI-BLAST search parameters

Parameter Setting Algorithm BLASTP (PSI-BLAST) Database Non-redundant protein sequences (nr)* Models, uncultured/environmental Exclude sample sequences Maximum target sequences 1000 Expect threshold (E value) 10 Word size 3 Maximum matches in a query range 0 Matrix BLOSUM62 Gap Costs (Existence: 11, Extension: 1) Conditional compositional score matrix Compositional adjustments: adjustment Filter – PSI-BLAST Threshold 0.005 Pseudocount 0 *GenBank CDS translations, PDB, SwissProt, PIR, PRF (25,626,282 sequences)

50

Three iterations of the search were performed, each with the maximum cutoff specified as 1000 sequence matches. After this point, no further hits were detected and the final number of detected GPCR sequences was 498 (see appendix A for a full list of sequence names). The web-accessible alignment program CLUSTALW (Thompson et al., 1994) was used to generate the necessary multiple sequence alignments.

2.2 Structural Analysis Methodologies

Homology Model Generation Template-based modeling includes the techniques of comparative modeling and protein threading, both of which offer structural predictions by attempting to build on prior knowledge about protein structure and to extrapolate these principles towards the generation of new structures. When no template structure is available, the spatial coordinates of a query protein can be predicted using ab initio (template-free) modeling. Despite the fact that this method requires neither sequence homology nor a template, and uses mainly physical principles as opposed to previously solved structures, it was considered too challenging to build models for loop regions, which would have necessitated template-free methods. At first sight, this may appear somewhat contradictory, taking into account that the loop regions (i.e. ECL and ICL domains) are the primary interfaces for functional interactions such as ligand binding and G protein coupling. However, loop modeling poses great challenges to comparative modeling, as it is difficult to reconstruct loops ab initio due to lack of templates. The existing GPCR crystal structures show significant variation in loop regions in terms of sequence composition and secondary structure elements (Katritch et al., 2011) and hence are not amenable to template-based methods. Taking into account the extensive sequence and structural variations found in ECLs and ICLs, it was preferred to focus on the most conserved structural domain of the GPCR superfamily, i.e. the 7TM domain; charges located in this domain would also be buried as opposed to solvent accessible and thus likely to be functional. Given current standards, where datasets of a couple hundred sequences are considered small, having to construct loop regions for even a few hundred models would have been to computationally demanding. With the goal of studying 51 functionally relevant charge clusters across the superfamily of GPCRs, and taking into consideration the large number of proteins that would have to be modeled so that thorough conclusions could be established, comparative modeling was preferred as a technique in order to ensure the best possible accuracy. Hence, in spite of template- free methods currently being the only efficient way of studying structural domains with extensive variability (ECL and ICL domains are both prime examples of such regions), comparative modeling was deemed appropriate since the focus of the current work was restricted to the transmembrane domain of GPCRs. An in-house computational modeling pipeline that was used to generate homologs to a template structure; the pipeline works by performing side-chain replacement on a fixed main-chain (Cole and Warwicker, 2002). Side-chain rotamers possessing minimal solvent accessibility are selected in this procedure. Side-chain rotamer packing is calculated using a modified mean-field algorithm (Koehl and Delarue, 1994) to assess maximal solvent accessibility for each ionizable group. This algorithm is based on the correlation of two variables: (i) the - torsion angles (which determine backbone conformation) and (ii) the  dihedral angles (which define side-chain orientation). Protein tertiary structure is strongly dependent on these two variables because of the effects side-chain packing has on protein stability. The method uses a rotamer library (a collection of energetically favorable conformations) to iteratively refine a conformational matrix CM, such that at any point CM(i,j) represents the probability that side-chain i adopts a rotamer conformation j. Each residue feels the average of all plausible conformations weighted by their respective probabilities. The rotamer with the highest probability is selected to represent a residue’s side-chain orientation. The modeling approach therefore differs from the generic comparative modeling procedure discussed in the introduction (outlined in figure 1.11), as it does not perform alignment correction, loop modeling, or iteration. Instead, it focuses on side-chain replacement such that solvent accessibility and steric clash minimization are satisfied.

2.3 GPCR Dataset Generation In order to study the distribution of ionizable side-chains throughout the GPCR superfamily the first issue to be addressed was the size and scope of the

52 dataset. It was decided that the investigation would only consider human GPCR sequences, so that the number of generated protein homology models would not become excessive. Prior to the modeling procedure, a template structure had to be identified. The crystal structure of the human β2 adrenergic receptor (PDB code 3NY8), solved in complex with an inverse agonist (Wacker et al., 2010), was selected as a template structure. Although only a limited number of GPCRs have been successfully crystallized, the 3NY8 structure was specifically chosen as a template because it has been solved with its transmembrane helices arranged parallel to the z-axis (in a Cartesian frame of reference). This meant that all models generated using the comparative modeling pipeline would be built in the same orientation, i.e. aligned to the 3NY8 7TM position. The z-axis spatial coordinates were subsequently be used to delineate the intracellular and extracellular coordinates. This was done in an empirical manner, using the program Swiss PDB Viewer (Guex and Peitsch, 1997) to estimate where the boundaries between loops and helices were located. Having chosen a template structure, a multiple sequence alignment was constructed so that homology models could be generated to form a dataset across which charge distribution would be examined. As mentioned, the multiple sequence alignment that was generated using the PSI-BLAST algorithm on the NCBI webserver consisted of 498 human GPCR sequences. This alignment was converted to a CLUSTAL format and fed into the comparative modeling pipeline that built GPCR models for each sequence member; the pipeline was run on a University Unix cluster and all generated models were rendered as PDB-formatted files. These PDB files were subsequently analyzed in terms of charge distribution.

2.4 PDB File Processing Once homology models for all 498 multiple sequence alignment members were generated, the PDB files of the GPCR models were used to begin investigating charged residues. Due to the complexity and information-rich structure of PDB files and the extent to which they were used, a series of text parsing scripts were developed to extract relevant coordinate information so that an the distribution of charged residues could be studied. Taking into account that all pipeline-generated comparative models would be aligned to the template structure 3NY8 whose transmembrane helices are arranged 53 parallel to the z-axis, it was necessary to focus on the z-coordinates of the generated homologues. The extraction of the values of atomic z-coordinates, so that the distribution of charge-possessing residues could be studied, is a typical text- processing task, and the Python programming language (version 2.7.3) was used. Python was chosen because of its syntactical simplicity, text processing routines, and numerical analysis packages (SciPy/NumPy – readily available as Python extensions). The first program worked by reading any file adhering to the PDB format and extracting the z-coordinate of each atom, which was subsequently used to generate a histogram that displayed the distribution of charged residues over the GPCR dataset. The script was implemented so that only the charged chemical groups of atoms on any of the five ionizable at neutral pH amino acids (lysine, arginine, histidine, aspartic acid, and glutamic acid) were extracted. Practically, this meant that for each charged amino acid in the primary sequence of a GPCR model, only one atomic z-coordinate would be extracted; this was always the atom central to the ionizable group. Table 2.2 summarizes the charged residues in terms of the atoms that confer electrical charge.

Table 2.2. Charged atom indicators for ionizable side-chains Residue Ionizable group LYS (K) NZ ARG (R) CZ HIS (H) CE1 ASP (D) CG* GLU (E) CD *Technically, an aspartic acid side-chain has its charge delocalized between the two oxygen atoms of the carboxyl group (denoted OD1 and OD2 in PDB nomenclature) but when approximating free charge the side-chain’s carboxylic carbon atom (CG) is used.

Subsequent programs that were written all performed tasks related to measuring the frequency of charged residue occurrence and highlighting regions within the GPCR helices that contained high frequencies of such residues. The trends that emerged in charge distribution throughout the GPCR dataset (explained in the following chapter) led to a set of interesting residues, which were further considered using pKa calculations.

54

2.5 pKa Calculations

Having computationally characterized transmembrane charges likely to be removed from the bulk aqueous environment, it was necessary to (i) assess their likely charge state (positive, negative, or overall neutral) with pKa calculations, and (ii) perform literature and database searches for annotations relevant to these charge locations. The pKa values were calculated with a software pipeline tailored for charge calculations (Warwicker, 2004), which has been previously applied to membrane proteins (Liu et al., 2007). The employed methodology was FD/DH (Finite Difference/Debye-Hückel), which combines the Finite Difference Poisson-Boltzmann

(FDPB) technique with Debye-Hückel (DH) theory to predict pKa values that are applicable to ionizable groups exposed and solvent accessible, as well as groups buried in the protein interior. A recent benchmark study of pKa prediction software reported that FD/DH performed on par with comparable methods (Alexov et al., 2011). Calculations of pKa values for the extracted residues were performed using both the FDPB (henceforth referred to as FD) and FD/DH methods. Additionally, all calculations were performed under two biophysical environments – one including a pseudo-membrane cylindrical slab of non-polar atoms around the GPCR and one with no slab. The slab consists of a cylindrical arrangement of “dummy” atoms around the GPCR, possessing a geometry that roughly resembles the cell membrane (figure 2.2).

55

Figure 2.2. Cylinder of pseudo-atoms enclosing GPCR model

The pseudo-membrane (illustrated here as being viewed from above) cylinder has a radius of approximately 40 Å and encloses 3NY8 all around

The cylindrical slab is not a precise atomic model of the phospholipid bilayer, as this would require conformational sampling from a molecular dynamics simulation, and would substantially increase the time and scale of work required. Instead, the slab was used to simulate the low dielectric membrane environment where GPCRs reside. Each pKa value, aside from being calculated with and without the pseudo- membrane slab, was calculated for two dielectric values (diel = 4, diel = 10). The lowest dielectric for protein/slab is that assigned to the FD calculations, while the high dielectric was included to account for the possibility that a membrane environment with charges may have similarities with that of the interior of engineered proteins (3NY8 crystal structure) possessing introduced charges (Warwicker, 2011). The interior of engineered proteins may potentially simulate the type of environment that lipids provide, potentially allowing space for water clusters to develop around membrane buried charges. This range of calculations was used to predict the charge state (charged vs. neutral) of each location pinpointed by the analysis of ionizable group distribution in the dataset of GPCR models.

56

Each ionizable group was then mapped back to a location within a representative human GPCR containing that group. A search was then performed using Pubmed as well as GPCR-specific databases (GPCRDB, GPCRRD) for any relevant information. UniprotKB database entries for each representative GPCR were also searched for annotations specific to the residue number of the corresponding ionizable group. The methods used throughout the study are outlined in figure 2.3.

Figure 2.3. Methods for Charge Studies on GPCRs

Template selection (3NY8)

PSI-BLAST search for related sequences

CLUSTALW alignment to generate dataset*

Computational pipeline for compartive modeling (side-chain replacement using Cole- Warwicker Method)

Charge coordinate analysis

Retrieval of buried transmembrane ionizable groups

pKa calculations (FD & FD/DH) for TM charged residues to assess potential functional role

Literature search in Pubmed and GPCR databases for functional annotations

*Refer to appendix A for details of dataset

2.6 Molecular Visualization The criteria for deciding on the boundaries between ICL, ECL, and TM domains were based on the spatial coordinates of the template β2-adrenergic receptor. The z-coordinates of atoms that were judged to be located at the termini of TM helices were collected and averaged so that the approximate boundary (could be used to decide whether a particular atom was located within the transmembrane helices or on the loop regions. Since all homology models would be built in alignment with the 3NY8 orientation, the estimation process was considered valid for the entire dataset. Figure 2.3 shows the structure of the β2 adrenergic receptor with an inverse agonist.

57

Figure 2.4. Structure of human β2 adrenergic receptor

The crystal structure of the human β2 adrenergic receptor (PDB ID 3NY8) showing the receptor (white) solved with an engineered fusion T4 lysozyme protein (yellow).

58

Chapter 3. Results

As was outlined in the previous chapter, a computational pipeline was used to create comparative models for the generated dataset, and subsequently the distribution of charged groups was analyzed.

3.1 Empirically defined GPCR topology Prior to performing any modeling, it was necessary to decide on how to differentiate transmembrane regions from loop regions since this study work focused on the 7TM scaffold common to all GPCR proteins. This was necessary so that throughout the entire dataset, the ionizable groups retrieved could be spatially categorized as belonging to the 7TM domain or the loop regions. The empirically defined boundaries between ICL, ECL, and 7TM were estimated based on the template structure 3NY8 using the Swiss PDB Viewer package. The z-coordinates of atoms that were judged to be located at the termini of TM helices were then collected and averaged so that the approximate spatial coordinate boundary could be used to decide whether a particular atom was located within the transmembrane helices or on the loop regions. Because all homology models were generated so that they would be aligned to the template structure, the coordinates used as physical boundaries were valid across the entire dataset. Although this methodology is purely empirical, it presented the most straightforward way of classifying coordinates as either transmembrane or ECL/ICL. The z-coordinate cutoff limits for TM helices were determined as being 19.7 – 60.9 Angstroms. Using a Cartesian frame of reference, the layout of all comparative models in the dataset is summarized in table 3.1.

Table 3.1. Delineation of GPCR Domains in Cartesian Coordinates Region z-coordinate range Intracellular 0 – 19.6 Transmembrane 19.7 – 60.9 Extracellular 61.0 – 80.0

59

3.2 GPCR Sequence Dataset The dataset generated using PSI-BLAST that was subsequently used for building models consisted of 498 sequences, the overwhelming majority of which belonged to the rhodopsin family. Figures 3.1A – B summarize the composition of the dataset and table 3.2 lists the number of each receptor sequence by subfamily.

Table 3.2. GPCR Dataset Sequence Composition Receptor Family Sequence Count Proportion of Dataset Adrenergic 71 14% Dopamine 41 8% 5HT (Serotonin) 59 12% Melanocortin 4 1% Somatostatin 15 3% Angiotensin 7 1.5% Adenosine 16 3% Histamine 19 4% Olfactory 7 1.5% Chemokine 11 2.2% Opioid 30 6% Neuromedin 9 2% Neuropeptide 19 4% Orexin 4 1% Muscarinic 13 2.5% Cholecystokinin 9 2% Galanin 5 1% Amine 11 2.2% Generic GPCR 25 5% Unnamed 30 6% Other* 93 18.5% *Includes sequences of unnamed GPCRs, generic 7TM helix receptors, GPCR variants, isoforms

60

Figure 3.1. GPCR Dataset Composition

A. Receptor Family Representation in Dataset

100 90 80

70 60 50 40 30 Sequence Count

20 10 0

Other

Amine

Orexin

Opioid

Galanin

Olfactory

Unnamed

Histamine Dopamine

Adenosine

Muscarinic

Adrenergic

Chemokine

Angiotensin

Neuromedin

Somatostatin

Melanocortin

GenericGPCR Neuropeptide

5HT/Serotonin Cholecystokinin

B. GRAFS Classification Receptor Family Representation

Glutamate Rhodopsin Adhesion Frizzled Secretin

(A) The PSI-BLAST search yielded 498 hits consisting of sequences from a wide range of receptor families, mainly from adrenergic receptors. (B) The composition of the dataset in terms of the GRAFS classification system. The sequences belong almost entirely within the rhodopsin family, with a small minority being from the glutamate family.

61

3.2.1 The distribution of ionizable groups in the GPCR dataset The GPCR dataset consisted of 498 models and in order to study the occurrence of ionizable groups in the charged residues their distribution was plotted on a histogram. Figure 3.2 displays the frequency of all charged residues (lysine, arginine, histidine, aspartic acid, glutamic acid) as a function of atomic spatial coordinates in a Cartesian frame of reference (x, y, z).

Figure 3.2. Frequency of Ionizable Groups vs. Cartesian Location

A. Absolute Frequency vs. Ionizable Atom Coordinate

62

B. Relative Frequency vs. Ionizable Atom Coordinate

The histogram plots illustrate the occurrence of z-coordinate values corresponding to ionizable atoms of one of the five amino acids with charged side-chains (see table 2.2) throughout the dataset. These are plotted in (A) absolute frequency and (B) relative frequency (normalized frequency).

The delineation of ICL, ECL regions and transmembrane helices was performed empirically by using the Swiss PDB Viewer to visualize the arrangement of the structure elements on the chosen template structure. The charge distribution illustrated in figure 3.1 reveals that charged residues are largely present in the loop regions (where the majority of peaks are located) and mostly absent from the transmembrane region. This observation is in alignment with the generally accepted paradigm of GPCR function; largely diverse extracellular loops and considerably variable intracellular loops serve to accommodate ligand binding and G protein coupling mechanisms that cover a vast range of ligands and more restricted range of G protein coupling interactions. However, the transmembrane region is largely devoid of peaks, reflecting that the 7TM platform is generally less functionally specialized. It is important to note that the frequencies refer to ionizable atoms within spatial coordinates (depicted as bins with size 1 in each histogram) as opposed to residues; this implies that a bin may correspond to multiple residues as each z value

63 slice represents an imaginary plane intersecting a GPCR molecule. Several bins show frequencies above 500 (the size of the dataset) for this reason; they reflect the total ionizable atoms of charged side-chains. The peaks that do occur within the empirically defined transmembrane region were an interesting finding, as their functional role was the underlying question of this investigation. As a validation step, the most salient transmembrane peaks were cross- referenced in scientific literature using databases with GPCR sequence and functional annotations (e.g. GPCRDB and GPCRRD, see section 1.5.3) as well as Pubmed to verify that they corresponded to known functional residues. This was done to verify that the distribution of charged groups was biologically plausible – it would be doubtful if groups with frequencies greater than 100 occurrences in the GPCR dataset had never before been recorded. The peaks were divided into groups so that charged residues with atomic coordinates corresponding to each group could be further examined. Table 3.3 summarizes the peak groups and the range of z-coordinate values that they covered.

Table 3.3. Peak groups selected for charge frequency distribution

Major Peaks Histogram Bin Group 1 20 – 25 Group 2 38 Group 3 50 Group 4 56 Minor Peaks Histogram Bin Group 5 58 – 60 Group 6 26 – 28 Group 7 29 – 32 Group 8 34 Group 9 37 Group 10 39 – 42 Group 11 44 – 48 Group 12 52 – 55

Two categories of peaks were used – major and minor – reflecting the geometry of the histogram. As can be seen in the histogram, major peaks correspond to residues whose z-coordinates occur at least 180 times, and minor peaks to residues whose coordinates occur far less frequently. This emphatic disparity in frequency suggests that major peaks most likely correspond to amino acids that are very well

64 conserved across GPCRs, whereas minor peaks indicate charges that have not been conserved throughout evolution. Consequently, locations corresponding to salient transmembrane charge regions were searched for documentation of functional role in literature. This was done by performing a text search that matched the all z- coordinates of a peak to their respective residues and then counting the frequency of each of the five charged groups so that the peak could be identified in terms of amino acid population. Once the residues producing the charge peaks from the GPCR models had been identified, web repositories were searched for functional annotations. A Python script was used to return a list of the z-coordinates of charged residues belonging to one of the individual peaks (single z value) or peak clusters (range of z values) shown in table 3.3. The output generated by this script facilitated the literature cross- referencing by listing the residues that each of the peaks describe. Figure 3.3 gives an example of the generated output as entries in PDB files.

Figure 3.3. Sample output of histogram analysis script

The output follows the generic structure of any PDB file, listing the atom (CG) followed by the residue type (ASP), and the polypeptide chain identifier (A). The three numbers that follow are respectively the x, y, z spatial coordinates of the atom; in the current context, the z values (final column) were the ones being used to plot frequency.

65

3.2.2 Characterizing Major Histogram Peaks Due to the approach used in this study, which was based on the spatial location of ionized atoms located on an imaginary plane cutting through a GPCR, residues that have different primary sequence positions (and would appear on different columns of a sequence alignment) could end up in the same plane. In particular, the alternating helix conformation of the 7TM implies that a given z-plane could traverse all helices of a GPCR model. Hence it was necessary to account for this and extract all side-chains whose ionizable atom falls within a certain plane (represented in the histogram as a bin). This was achieved using Unix text filters to query the dataset according to either a particular histogram bin or a known conserved residue; details are provided in appendix B.

Peak Group 1: Histogram Bins 20 – 25

R131: The major peak occurring in bin 22 corresponds to the arginine residue of the highly conserved DRY motif (Arg-131) located at the junction of transmembrane helix III and ICL2 of GPCRs (Rhee et al., 1999) immediately adjacent to an aspartic acid (see below) residue that functions as a micro-switch in the receptor-activation mechanism. Arg-131 occurred in 477 (96%) dataset models.

D130: The major peak occurring in bin 25 is indicated as corresponding to the [D/E]RY motif, a well conserved sequence on helix III that is crucial in the formation of the hydrogen bonded network (“ionic-lock”) between a highly conserved aspartic acid (Asp-130 residue) and a partially conserved glutamic acid residue (Glu-247) on helix VI (Vanni et al., 2010; Scheerer et al., 2008). Asp-130 occurred in 424 (85%) dataset models.

Peak Group 2: Histogram Bin 38

D79: The major peak occurring at 38.0 was confirmed as corresponding to a highly conserved aspartic acid residue (D2.50 or the Asp-79), which is known to act as a protonation-induced switch that activates diffusible-ligand GPCRs (Vanni et al., 2010). Asp-79 was recorded in 465 (93%) dataset models.

66

Peak Group 3: Histogram Bin 50

D113: This residue corresponds to the peak at 50.0, and is annotated as being a conserved agonist/antagonist binding site for adrenaline (Strader et al., 1988) located on helix III. This binding site is conserved in amine binding receptors and is known to be functional through its role in dermatologic conditions (Schaullreuter et al., 2007). Asp-113 occurred in 265 (53%) dataset models.

As was anticipated, most of the major histogram peaks were highly conserved residues with well-established functional annotations. The bins 20 – 23, 56 and 58 – 60 all had large peaks which could not be mapped with certainty to a known conserved or functional residue. The correspondence of major peaks in the histogram plots to conserved residue locations is summarized in table 3.4. Charge networks (clusters of charged residues located near each other), apart perhaps from the adjacent Asp-130 and Arg-131 of the DRY motif, were not observed in the dataset.

Table 3.4. Matching Histogram Peaks to Conserved Residues Histogram Bin Residue Occurrence in Reference(s) Dataset 22 R131 96% Rhee et al., 199 25 D130 85% Vanni et al., 2010 Scheerer et al., 2008 38 D79 93% Vanni et al., 2010 50 D113 53% Strader et al., 1988

3.3 Shortlisting charged residues for pKa calculations Although extracting the amino acid residues relevant to the defined peak clusters greatly reduced the number of charged residues contained in the dataset, they were still too many for the purposes of this study and further downsizing was required. The first downsizing involved further restricting the range of z coordinates taken to be valid transmembrane territory; in effect, narrowing the window of allocated TM space so as to exclude charged residues that might potentially exhibit snorkeling. Snorkeling is the phenomenon by which lysine and arginine sidechains (positively charged) that are within close proximity to the transmembrane – water

67 interfaces can place their positive charge near negatively charged phospholipid groups (Kim et al., 2011). The empirically defined transmembrane space (19.7 – 60.9 Å) was narrowed to the interval 30 – 54 Å so as to exclude major peaks at the beginning (bins 20 – 25 in histogram) and the end (bins 55 – 60 in histogram) of the TM domain. In this manner, remaining peaks consisted of low frequency (>100) spatial coordinates (with the exception of two high frequency coordinates, i.e. histogram bins 38 and 50, which were excluded on the basis of corresponding to well-conserved functional residues). For all peaks within the newly defined margin, the frequency of all amino acids whose ionizable atom occurred within one of the peaks was obtained using the method outlined in appendix B. Once the residues accounting for peaks within bins 30 – 54 were extracted, the resulting list was further filtered to include only those amino acids occurring more than 10 times. This provided a way to essentially “zoom-in” on the small peaks of figure 3.2, and to select those for which pKa calculations would be carried out. The filtering steps are outlined in figure 3.4 below, while table 3.5 lists the residues that were selected for pKa calculations.

68

Figure 3.4. Filtering residues for pKa calculations

Initial GPCR Dataset: ICL/TM/ECL

Step 1: 7TM domain

Step 2: Smaller TM window

Step 3: zoom in on minor peaks

69

Table 3.5. Minor Peak Shortlisted Residues Residue Histogram bin Frequency* GLU A 80 37 13 GLU A 122 40 21 GLU A 120 41 11 ARG A 109 52 12 GLU A 40 52 21 LYS A 90 52 11 ARG A 35 53 20 *residues occurring in >10 dataset models were considered

The eight residues listed in table 3.2 were selected for performing pKa calculations. Taking into account that each of these residues occurs in multiple homology models, it was considered appropriate to investigate whether there existed any co-occurrence, meaning that any single homology model would have occurrences of two or more of the shortlisted residues. For example, a GPCR model might contain both the glutamic acid at position 80 and the lysine at position 90. The rationale behind this was to single out specific homology models that could be used for the pKa calcuations, and models where the selected residues co-occurred would be prime targets. However, no co-occurrence was found and consequently the actual dataset models used for further analysis were selected based simply on possessing the shortlisted residue; i.e. any of the 13 models possessing a glutamic acid at position 80 (table 3.5, second row) was considered valid. More comprehensive statistical techniques to investigate occurrence of residues in multiple dataset models, such as mutual information, were not considered as required for assessing co-occurrence of ionizable groups since a screen for co-location of groups revealed as interesting in the model-building and pKa calculations, gave no indication of any substantial clustering.

3.4 pKa Predictions Having singled out the residues corresponding to the minor peaks of the charge histogram (figure 3.2), pKa calculations were used to assess their charge states. Subsequently, database and literature searches were performed to identify any functional annotations relevant to these charge locations. Table 3.3 summarizes the calculations performed for each residue as described in section 2.5.

70

Table 3.6. pKa Predictions Method FD FD FD FD FDDH FDDH FDDH FDDH Dielectric Value 4 10 4 10 4 10 4 10 Membrane Slab No No Yes Yes No No Yes Yes GPCR Model* Residue Predicted pKa tem1_mod272.pdb E80 5.6 5.3 15.27 9.09 4.6 4.54 14.67 8.77 tem1_mod15.pdb E122 7.14 6.06 14.2 8.57 4.62 4.61 13.42 8.27 tem1_mod331.pdb E120 8.23 6.76 8.63 7 4.4 4.78 4.82 4.79 tem1_mod87.pdb R109 10.31 11.05 3.62 9.11 12.03 11.99 3.8 8.54 tem1_mod359.pdb D35 3.77 4.02 10.2 7.08 3.82 3.9 10.02 6.94 tem1_mod447.pdb E40 7.14 5.84 10.8 7.52 4.57 4.6 10.43 7.31 tem1_mod41.pdb K90 10.95 10.91 10.95 10.98 10.59 10.64 10.59 10.57 tem1_mod224.pdb R35 11.98 11.94 3.13 8.13 11.96 11.97 3.15 8.06 * The output of the side-chain replacement pipeline named models in the format temX_modY.pdb

3.4.1 Database and Literature Searches

All residue numbers that were collected for pKa calculations originated from homology models and not actual proteins, and as such could not be used to perform searches for functional annotations. In order to “map” these residues onto real protein sequences, and hence obtain correct residue numbers, the twelve amino acids preceding and succeeding each residue in its model were extracted. A protein BLAST search was then performed restricting all results to human sequences, and the position containing the same residue nearest to the corresponding number in the homolog was used as a query to perform extensive literature searches. This process is illustrated in table 3.4. In certain cases, a matching residue was not found while in others the sequence was not complete.

71

Table 3.7. Mapping model residues onto wild-type GPCRs Residue Local sequence* Top BLAST Hit** Residue Annotation Uniprot ID number in hit E80 TNIYLLNLAVAD – E - 4 E94 Yes LFMLSVPFVASS P31391 E122 TSIDVLCVTASI – E – Beta-2 adrenergic E129 No receptor TLCVIAVDRYFA Q9BYZ0 E120 MVPFVQSTAVVT – E – GPCR 103 E132 No J3KNR3 ILTMTCIAVERH

R109 DIWYGEVFCLV – R - R118 Yes 5-hydroxytryptamine4 TSLDVLTTASIF receptor Q13639

D35 DE – D – cDNA FLJ76003 D34 No A8K1T0 GIALAGIMLVCGIGNFVFIAAL***

E40 AYIGI – E – None No variant VLIAALVSVPGNVLVIWAV*** D2CGD0

K90 SDLLVAVLVMPW – K – DRD1 protein K81 Yes AVAEIAGFWPFG Q6FH34 R35 AGA – R – R27 No isoform 5 LALLIVATVLGNALVMLAFVA*** Q8NI50

* The 12 amino acids preceding and following the shortlisted residue (shown in bold) ** Default BLAST parameters used *** Denotes cases where preceding sequence was missing in the PDB file. To compensate for this, more residues from the sequence following the residue number were taken.

As was discussed in the introduction (section 1.4), the charge state of

ionizable side-chains is dependent upon their pH/pKa relationship. For convenience, these relationships are reiterated in table 3.8 (identical to table 1.4) below. The pH

range in which all pKa calculations were performed was 7.0 – 7.5.

72

Table 3.8. pH-dependent variation of charge states (see table 1.4) Ionizable side- pH / pK Amino Acid a Charge State chain group Relationship Aspartic Acid pH > pK Negative (-) Carboxyl (-COOH) a (Asp) pH < pKa Neutral* Glutamic Acid pH > pK Negative (-) Carboxyl (-COOH) a (Glu) pH < pKa Neutral* pH > pKa Neutral Lysine (Lys) Amino (-NH2) pH < pKa Positive (+) Guanidiun pH > pK Neutral Arginine (Arg) a -HNC(NH2)2 pH < pKa Positive (+) Imidazole pH > pK Neutral Histidine (His) a -(CH)2N(NH)CH pH < pKa Positive (+)

As outlined in section 2.5, of pKa calculations for the shortlisted low- frequency charged groups were performed using both the traditional FD and combined FD/DH methods to incorporate both Poisson-Boltzmann and Debye-Hückel descriptions of ionizable side-chain interactions. These were done under two environments; one with a pseudo-membrane (cylindrical slab) to reflect the higher dielectric (diel = 10) resulting from lipid-like environments and one with no pseudo- membrane and lower dielectric (diel = 4).

3.4.2 Predicted pKa Values and Charge States for Residues In total, annotations were found for three out of the eight residues for which pKa predictions were performed, and only two shortlisted residues were predicted to have charged side-chains based on their projected pKa values. Almost all residues fell within the 7TM domain of their respective model, and when they were not found within the transmembrane region they were located very near to the closest helix. Table 3.9 summarizes these findings followed by an interpretation for each residue.

Table 3.9. Predicted charge states of residues Residue in dataset Residue in top Predicted charge Literature model BLAST hit state* Annotation E80 E94 Neutral Yes E122 E129 Neutral No E120 E132 Negative No R109 R118 Uncertain Yes D35 D34 Uncertain No E40 None Neutral No K90 K81 Positive Yes R35 R27 Uncertain No

73

E80: Both the FD and FD/DH predictions differed significantly between the presence and absence of the pseudo-membrane. In the absence of the slab, the predicted pKa values were marginally higher for low dielectric environment, while when the slab was present the lower dielectric pKas were substantially higher. In all cases, the glutamic acid residue mapped to residue number 94 shows significantly larger values when the slab is present compared to those when it is absent. The pKa values vary from below pH 7.0 – 7.5 to significantly above, suggesting that this residue is most likely neutral. An annotation (source: GPCRDB) was found locating E94 on helix II facing inward toward helix I. Figure 3.5 shows that in dataset model tem1_mod272, E80 is localized well within the TM region, and this is in agreement with UniProtKB sequence annotations for human somatostatin type 4 (P31391).

Figure 3.5. Residue E80 in dataset model

E80 is highlighted in red located on TM helix II

E122: Both the FD and FD/DH methods differed significantly between the presence and absence of the pseudo-membrane. In the absence of the slab, the predicted pKa values were marginally higher for low dielectric environment, while when the slab was present the lower dielectric pKas were substantially higher. In all cases, the glutamic acid residue mapped to residue number 129 shows significantly larger values when the slab is present compared to those when it is absent. As in E80, its pKa

74 values vary from below pH 7.0 – 7.5 to significantly above, suggesting that this residue is most likely neutral. No literature annotations were found for this residue. Figure 3.6 shows that in dataset model tem1_mod15, E122 is located within the TM domain on helix III.

Figure 3.6. Residue E122 in dataset model

E122 is highlighted in red located on TM helix III

E120: The FD method predicted higher pKa values for lower dielectric under both the presence and absence of the pseudo-membrane. In contrast the FD/DH predictions were similar under all conditions in this case. Since most predictions yielded pKa values less than the considered pH range, the glutamic acid residue mapped to residue number 132 is indicated as possessing a negatively charged side-chain. However, no literature annotations were found for this residue. Figure 3.7 shows that in dataset model tem1_mod331, E120 is located within the TM domain on helix III.

75

Figure 3.7. Residue E120 in dataset model

E120 is highlighted in red located on TM helix III

R109: This residue exhibits a potentially interesting trend. Both FD and FDDH methods predict its pKa to be acidic under the low dielectric, pseudo-membrane environment while in all other cases the pKa is basic and significantly higher. The arginine residue mapped to position 118 exhibits excessive variation in pKa values; hence a solid prediction on whether it is charged or neutral cannot be made. However, a functional annotation was found (source: GPCRDB) indicating it is localized close to the TM – ICL boundary on helix III and points to the core of the protein. R118 is suggested as forming a salt bridge with an aspartic acid on the same helix. Figure 3.8 shows that in dataset model tem1_mod87, R109 is located adjacent to the helix III/ICL2 boundary.

76

Figure 3.8. Residue R109 in dataset model

R109 is highlighted in red at the helix III/ICL2 boundary

D35: The FD and FD/DH methods gave similar predictions; acidic pKa s in the absence of the pseudo-membrane environment and basic pKa s when the slab was present in low dielectric. The slab-consisting high dielectric scenarios yielded values within the neutral pH range. Again, this variability renders the charge state of the aspartic acid residue mapped to residue number 34 difficult to judge; no literature annotations were found either. Interestingly, in dataset model tem1_mod359, D35 appears to be located outside the TM regions as is illustrated in figure 3.9.

Figure 3.9. Residue D35 in dataset model

D35 highlighted in red is located just outside the 7TM domain.

77

E40: The predictions of FD and FD/DH methods were in agreement for pseudo- membrane conditions across both dielectric environments. However, the predictions differ in the absence of the slab where FD shows a higher pKa for the low versus the high dielectric environment whereas FD/DH predicts almost identical values. This glutamic acid residue was not found in the sequence of the highest-scoring BLAST hit

(table 3.7, row 7) and the projected pKa values in this case suggest overall neutrality. It is indicated as being located just within the TM region in its corresponding dataset model. Figure 3.10. Residue E40 in dataset model E40 highlighted in red

78

K90: The lysine residue mapped to residue number 81 is unique in having very similar projected pKa values across both conditions (dielectric, slab) for both FD and FD/DH methods. A structural annotation (source: GPCRDB) indicates that it localized close to the helix II/ECL1 boundary and faces inward towards the core of the protein, and this is reflected in its homology model (figure 3.11). Figure 3.11. Residue K90 in dataset model

K90 shown in red at TM2/ECL1 boundary

R35: The arginine residue mapped to residue number 27 displays a similar trend to the Arg-118 (E109 in table 3.6) residue; FD and FD/DH methods both predict its pKa to be acidic under the low dielectric, pseudo-membrane environment while in all other cases the pKa is basic and significantly higher, preventing a concrete prediction concerning its charge state made. No literature annotations were identified for this residue. It is located near the start of helix 1 of the 7TM (figure 3.12). Figure 3.12. Residue R35 in dataset model R35 appears at the start of helix 1

79

Chapter 4. Discussion

The main objective of this project was to identify transmembrane charges in GPCRs and to ascertain whether their locations possessed any functional roles. Such an endeavor cannot be limited to sequence-level analysis, and is inevitably structural, as the charge states of ionizable groups were assessed using pKa calculations. The underlying hypothesis was that charged locations within the TM domain may relate to functional sites, since burying charged side-chains from water solvent is energetically unfavorable unless a solvation network to replace the water is available. Perhaps the most important issue in such studies is the lack of adequate structural data; although a suitable template structure was identified (2 adrenergic receptor), comparative modeling was necessary to generate a dataset large enough for studying the distribution of ionizable groups and their locations. However, since breakthroughs in crystallography cannot be currently expected to accelerate in order to keep pace with that of sequence characterization, computational methods are becoming increasingly necessary in the effort to better understand the mechanistic details of GPCR function.

4.1 7TM Charge Distribution The distribution of ionizable groups throughout the GPCR dataset exhibited a fairly anticipated shape – two large curves representing the intracellular and extracellular loops, as well as a nearly flat region representing the transmembrane helices, with three distinct peaks. Most of high frequency peaks could be identified as residues known to be conserved, however the majority of peaks in the histogram were medium to low frequency (>100 occurrences). One of the most important findings of the distribution of ionizable atoms of charged side-chains (figures 3.2) is that the largest TM peaks occur close enough to the water interface to allow for side-chain “snorkelling”, a term used to describe the property of positively charged residues (lysine and arginine) to move their side-chains close to the negative charge of the phospholipid bilayer. This phenomenon has been observed in other classes of proteins possessing transmembrane domains, most notably integrin proteins (Ginsberg et al., 2011) and may contribute to some of the medium frequency peaks (~200) that occur close to the empirically defined boundaries of the 7TM domain.

80

Alternatively, medium frequency peaks could be accounted for by virtue of being subfamily-specific residues, being dependent on a particular class of ligands. The dataset created from the multiple sequence alignment included GPCR sequences across several members of the rhodopsin family (of which adrenergic, serotonin, and dopamine families were the most common and account for 34% of the dataset); an ionizable side-chain occurring at a conserved location in one receptor family would yield a large peak for that particular family unless this location was a conserved throughout the entire GPCR spectrum. One intriguing result illustrated in the charge frequency histogram is the peak in bin 50, which was matched to an aspartic acid residue (Asp-113). This residue in known to be conserved in amine-binding GPCRs, e.g. adrenergic, serotonin, dopamine, and muscarinic receptors. These receptor families collectively comprise 184 of the 498 PSI-BLAST hits (table 3.2), or 37% of the dataset; however, the side-chain corresponding to this amino acid occurs in 265 models (53% of the dataset). This disparity is interesting, and raises the question of whether perhaps a multiple template modeling approach should have been used, although this would have been difficult given the scarcity of available structure templates (table 1.2). In general, the correspondence of peak intensity on the charge histogram to conservation levels of matched residues was consistent, however D113 presents a discrepancy that would merit reassessing the PSI-BLAST method of generating the current dataset. Family weighting biases occurring from overrepresentation of one type of sequence is one aspect that future work would likely focus on, and one way of dealing with these would be to aim for a larger dataset. Alternatively, the original dataset could be divided into smaller datasets, each comprising the sequences of one receptor family, and subsequently plotting ionizable group frequencies separately for each family. Although such a method may enable the detection of functional sites specific to receptor families, the datasets would be significantly smaller; hence low frequency transmembane charged residues (e.g. the minor peaks of the charge histogram) may occur too sparsely to be considered for pKa calculations.

4.2 Assessment of the Hypothesis and Alternative Approaches The underlying hypothesis that ionizable groups located on transmembrane domain being functionally significant was tested by reducing the number of target ionizable groups and attempting to cross-reference their locations in literature. Out of 81 the eight shortlisted residues, only three were found to have literature annotations.

More importantly, two were predicted to be charged based on their predicted pKa values while for the other six, three were projected as neutral and the other three had excessive variation in pKas to make an accurate prediction. K81 and E132 were predicted as having their side-chains charged, however neither of them was found to be functionally annotated. Interestingly, residue R118 whose charge state could not be assessed due to the disparity in projected pKas between high and low dielectric environments for both FD and FD/DH schemes was found to be functionally important. This result suggests that large variation between different dielectric environments may indicate that the side-chain is responding to changes in electrical fields and hence is electrically active. In searching for functional hotspots using the criterion of ionizable groups, it is perhaps useful to consider a reverse strategy for further studies. Specifically, instead of shortlisting target residues for pKa calculations and then cross-referencing those with GPCR-relevant literature, it may be insightful to work backwards by performing a literature searches prior to any modelling. Besides the well-established ligand binding (extracellular domain) and G protein coupling (intracellular domain) activities of GPCRs, other interactions (transmembrane-involving) are known to be common. GPCR dimerization is known to occur frequently and to involve transmembrane helices (Kim et al., 2009). If the distribution of ionizable groups within a dataset is reasonably in line with experimentally derived annotations, then it could be expanded to include non-human sequences and would be able cover a considerably larger sequence pool. Although this approach would not enrich knowledge of functional areas, it would provide a way of benchmarking the biological accuracy of charge distribution plots prior to generating datasets. A different interpretation of the trends in charge distribution is that this may be the energetic tail of a Boltzmann distribution of GPCR folding, which must accommodate the insertion energies of the seven transmembrane helices. In a similar manner to dictating that a protein will adopt the most favorable energetic (lowest energy) conformation, Boltzmann statistics maintains that a small fraction of proteins will be unfolded at any given time. In the context of the current dataset, this implies that the distribution of charged atoms can be accounted for by a fraction of GPCRs that have adopted a less optimal fold, one in which charged residues were inserted into TM regions. The occurrence of such residues within lipid bilayers is hard to 82 explain due to the energetic difficulties in accommodating their side-chains in a hydrophobic environment. The cost of burial for an ionizable group (which is the energy cost of neutralizing an otherwise charged group) is capped by the difference of normal pKa for that group to the surrounding pH, as opposed to the actual shift in pKa value under different environments. Practically, this means that any charged residues placed within a TM region would introduce a destabilisation with a quantifiable energetic penalty (measured in kJ/mole). However, because the TM region is considerably long (~25 residues per helix), the energetic penalty for a misplaced residue may be low enough so that a certain number of misplacements can be accommodated. If this is the case, then it is possible that the energetic penalty, which is interpreted under the null hypothesis as a functional signature, falls within the allowed tail of variation in TM insertion energy that has been occurred throughout evolution.

4.3 Conclusions It is clear ionizable groups do coincide with functional residues of GPCRs to a certain extent; although whether this can be used to drive mutagenesis experiments for the discovery of novel functional areas is not clear. If the low frequency peaks used for pKa calculations represent such areas, and this can be verified experimentally, then similar methodologies could be applied to other classes of proteins with structural transmembrane domains.

83

1.5 References

Alexov, E., Mehler, E.L., Baker, N., Baptista, A.M., Huang, Y., Miller, F., Nielsen, J.E., Farrell, D., Carstensen, T., Olsson, M.H., Shen, J.K., Warwicker, J., Williams, S., Word, J.M. (2011). Progress in the prediction of pKa values in proteins. Proteins. 79(12), 3260-3275

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol. 215(3), 403-410

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389-3402

Attwood, T.K. (2002). The PRINTS database: a resource for identification of protein families. Brief Bioinform. 3(3), 252-263

Attwood, T.K. and Findlay, J.B. (1994). Fingerprinting G-protein-coupled receptors. Protein Eng. 7(2), 195-203

Bashford, D. and Karplus, M. (1990). pKa’s of ionizable groups in proteins: atomic detail from a continuum electrostatic model. Biochemistry 29(44), 10219-10225

Cole, C. and Warwicker, J. (2002). Side-chain conformational entropy at protein-protein interfaces. Protein Sci. 11(12), 2860-2870

Doolittle, R.F. (1986). Of URFs and ORFs: A primer on how to Analyze Derived Amino Acid Sequence. Mill Valley, CA.

Dyson, J.H. and Wright, P.E. (2005). Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 6(3), 197-208

Flower, D.R. (1999). Modelling G-protein-coupled receptors for drug design. Biochim. Biophys. Acta. 1422(3), 207-234

Fredriksson, R. and Sciöth, H.B. (2006). Ligand Design for G Protein-coupled Receptors. Weinheim.

Fredriksson, R., Lagerström, M.C., Lundin, L., Sciöth, H.B. (2003). The G- Protein-Coupled Receptors in the Human Genome Form Five Families. Phylogenetic Analysis, Paralogon Groups, and Fingerprints. Mol Pharmacol 63(6), 1256-1272

84

Ginalski, K. (2006). Comparative modeling for protein structure prediction. Curr Opin Struct Biol. 16(2), 172-177.

Greaves, R. and Warwicker, J. (2005). Active site identification through geometry-based and sequence profile-based calculations: burial of catalytic clefts. J Mol Biol. 349(3), 547-557

Gu, J. and Bourne, P.E. (2009). Structural Bioinformatics. Hoboken, NJ.

Guex, N. and Peitsch, M.C. (1997). SWISS-MODEL and the Swiss- PdbViewer: An environment for comparative protein modelling. Electrophoresis 18, 2714-2723.

Hawtin, S.R., Simms J., Conner, M., Lawson, Z., Parslow, R.A., Rim J., Sheppard A., Wheatley M. (2006). Charged extracellular residues, conserved throughout a G-protien-coupled receptor family, are required for ligand binding, receptor activation, and cell-surface expression. J Biol Chem. 281(50), 38478-38488

Iakoucheva, L.M., Brown, C.J., Lawson, J.D., Obradović, Z., Dunker, A.K. (2002). Intrinsic Disorder in Cell-signaling and Cancer-associated Proteins. J Mol Biol. 323(3), 573-584

Iakoucheva, L.M., Radivojac, P., Brown, C.J, Connor, T.R.O., Sikes, J.G., Obradović, Z., Dunker, A.K. (2004). The importance of intrinsic disorder for protein phosphorylation. Nucleic Acid Res. 32(3), 1037-1049

Jaakola, V.P., Prilusky, J., Sussman, J.L., Goldman, A. (2005). G protein- coupled receptors show unusual patterns of intrinsic unfolding. Protein Eng Des Sel. 18(2), 103-110

Josefsson, L.G. (1999). Evidence for kinship between diverse G-protein coupled receptors. Gene. 239(2), 333-340

Junier, T. and Pagni, M. (2000). Dotlet: diagonal plots in a web browser. Bioinformatics. 16(2), 178-179.

Karlin, S. (1995). Statistical significance of sequence patterns in proteins. Current opinion in structural biology 5(3), 360-371

Katritch, V., Cherezov, V., Stevens, R.C. (2012). Diversity and modularity of G protein-coupled receptor structures. Trends Pharmacol Sci. 33(1), 17-27

85

Kim, C., Schmidt, T., Cho, E.G., Ye, F.,Ulmer, T.S., Ginsberg, M.H. (2011). Basic amino-acid side chains regulate transmembrane integrin signalling. Nature. 481(7380), 209-213.

Kim, C., Lee, B.K., Naider, F., Becker, J.M. (2009). Identification of specific transmembrane residues and ligand-induced interface changes involved in homo-dimer formation of a yeast G protein-coupled receptor. Nature. 481(7380), 209-213

Klapper, I., Hagstrom, R., Fine, R., Sharp, K., Honig, B. (1986). Focusing of electric fields in the active site of Cu-Zn superoxide dismutase: effects of ionic strength and amino-acid modification. Proteins 1(1), 47-59

Koehl, P. and Delarue, M. (1994). Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J Mol Biol. 239(2), 249-275.

Kolakowski, L.F. Jr. (1994). GCRDb: a G-protein-coupled receptor database. Receptors Channels 2(1), 1-7

Lagerström, M.C. and Schiöth, H.B. (2008). Structural diversity of G protein- coupled receptors and significance for drug discovery. Nat Rev Drug Discov. 7(4), 339-357

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., Fitzhugh, W. (2001). The International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921

Lie, D.C., Colamarino, S.A., Song, H.J., Désiré, L., Mira, H., Lein, E.S., Jessbeger, S., Lansford, H., Dearie, A.R., Gage, F.H. (2005). Wnt signaling regulates adult hippocampal neurogenesis. Nature 437(7063), 1370-1375

Liu, G., Liu, T., Mal, S.S., Kortz, U. (2006). Wheel-shaped polyoxotungstate [Cu20Cl(OH)24(H2O)12(P8W48O184)]25- macroanions form supramolecular “blackberry” structure in aqueous solution. J Am Chem Soc. 128(31), 10103-10110.

Logan, C.Y. and Nusse, R. (2004). The Wnt signaling pathway in development and disease. Annual Rev Cell Dev Biol. 20, 781-810

Magee J. and Warwicker J. (2005). Simulation of non-specific protein-mRNA interactions. Nucleic Acids Res. 33(21), 6694-6699

86

Marchese, A., Sawzdargo, M., Nguyen, T., Cheng, R., Heng, H.H., Nowak, T., Im, D.S., Lynch, K.R., George, S.R., O’Dowd, B.F. (1999a). Discovery of three novel orphan G-protein-coupled receptors. Genomics 56(1), 12-21

Meldrum, B.S. (2000). Glutamate as a neurotransmitter in the brain: review of physiology and pathology. J Nutr 130(4S Suppl), 1007S-1015S

Neves-petersen, M.T. and Petersen, S.B. (2003). Protein electrostatics: A review of the equations and methods used to model electrostatic equations in biomolecules – Applications in biotechnology. Biotechnol Annu Rev. 9, 315-395

Nielson, J.E. (2007). Analyzing the pH-dependent properties of proteins using pKa calculations. Journal of molecular graphics & modelling 25(5), 691-699

Nugent, T. and Jones, D.T. (2012). Membrane protein structural bioinformatics. J Struct Biol. 179(3), 327-337

Nygaard, R., Frimurer, T., Holst, B., Rosenkilde, M.M., Schwartz, T.W. (2009). Ligand binding and micro-switches in 7TM receptor structures. Trends Pharmacol Sci. 30(5), 249-259

Pace, N.C., Grimsley, G.R., Scholtz, M.J. (2009). Protein ionizable groups: pK values and their contribution to protein stability and solubility. J Biol Chem. 284(20), 13285-13289

Pennacchio, L.A., Olivier, M., Hubacek, J.A., Cohen, J.C., Cox, D.R., Fruchart, J.C., Krauss, R.M., Rubin, E.M. (2001). An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294(5), 169-173.

Pierce, K.L., Premont, R.T., Lefkowitz, R.J. (2002). Seven-transmembrane receptors. Nature Rev Mol Cell Biol 3(9), 639-650

Rhee, M.H., Nevo, I., Levy, R., Vogel, Z. (2000). Role of the highly conserved Asp-Arg-Tyr motif in signal transduction of the CB2 . FEBS Lett. 466(2-3), 300-304.

Robas, N., O’Reilly, M., Katugampola, S., Fidock, M. (2003). Maximizing serendipity: strategies for identifying ligands for orphan G-protein- coupled receptors. Curr Opin Pharmacol. 3(2), 121-126

87

Schallreuter, K.U., Wei, Y., Pittelkow, M.R., Swanson, N.N., Gibbons, N.C., Wood, J.M. (2007). Structural and functional alterations in the beta2- adrenoceptor are caused by a point mutation in patients with atopic eczema. Exp Dermatol. 16(10), 807-813.

Scheerer, P., Park, J.H., Hildebrand, P.W., Kim, Y.J., Krauss, N., Choe, H.W., Hofmann, K.P., Ernst, O.P. (2008). Crystal structure of opsin in its G- protein-interacting conformation. Nature 455(7212), 497-502.

Schreiber, G. and Keating, A.E. (2011). Protein binding specificity versus promiscuity. Current opinion in structural biology 21(1), 50-61

Sinha, N., and Smith-Gill, S.J. (2002). Electrostatics in protein binding and function. Current protein & peptide science 3(6), 601-614

Strader, C.D., Sigal, I.S., Candelore, M.R., Rands, E., Hill, W.S., Dixon, R.A. (1988). Conserved aspartic aci residues 79 and 113 of the beta- adrenergic receptors have different roles in receptor functions. J Biol Chem. 263(21), 10267-10271.

Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 16(2), 4673-4680.

Vanni, S., Neri, M., Tavernelli, I., Rothlisberger, U. (2010). A conserved protonation-induced switch can trigger “ionic-lock” formation in adrenergic receptors. J Mol Biol. 397(5), 1339-1349.

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. (2001). The sequence of the human genome. Science 29(5507), 1304-1351

Vroling, B., Sanders, M., Baakman, C., Borrmann, A., Verhoeven, S., Klomp, J., Oliveira, L., de Vlieg, J., Vriend, G. (2011). GPCRDB: information system for G protein-coupled receptors. Nucleic Acids Res. 39(Database issue), D309-D319.

Wacker, D., Fenalti, G., Brown, M.A., Katritch, V., Abagyan, R., Cherezov, V., Stevens, R.C. (2010). Conserved binding mode of human beta2 adrenergic receptor inverse agonists and antagonist revealed by X-ray crystallography. J Am Chem Soc. 132(33), 11443-11445.

Warwicker, J. and Watson, H. (1982). Calculation of the electric potential in the active site cleft due to alpha-helix dipoles. J Mol Biol. 157(4), 671-679

88

Warwicker, J. (2004). Improved pKa calculation through flexibility based sampling of a water-domainated interaction scheme. Protein Sci. 13(10), 2793-2805

Warwicker, J. (2011). pKa predictions with a coupled finite difference Poisson- Boltzmann and Debye-Hückel method. Proteins. 79(12), 3374-3380

Zhang, Y., DeVries, M.E., Skolnick, J. (2006). Structure Modeling of All Identified G Protein-Coupled Receptors in the Human Genome. PLoS Comput Biol 2(2), e13

Zhang, J. and Zhang, Y. (2010) GPCRRD: G protein-coupled receptor spatial restraint database for 3D structure modeling and function annotation. Bioinformatics. 26(23), 3004-3005.

Zozulya, S., Echeverri, F., Nguyen, T. (2001). The human repertoire. Genome Biol 2(6), research0018.1-0018.12

89

Appendix A. GPCR Dataset Sequence Names

1. D-1 [Homo sapiens] 2. unnamed protein product [Homo sapiens] 3. unnamed protein product [Homo sapiens] 4. serotonin receptor, partial [Homo sapiens] 5. alpha-1A-adrenergic receptor [Homo sapiens] 6. alpha-2-adrenergic receptor, partial [Homo sapiens] 7. adenosine receptor [Homo sapiens] 8. alpha-2-adrenergic receptor old gene name 'ADRA2R' [Homo sapiens] 9. alpha-2-adrenergic receptor [Homo sapiens] 10. alpha-2 adrenergic receptor old gene name 'ADRA2R' [Homo sapiens] 11. alpha-2-adrenergic receptor (alpha-2 C2) old gene name 'ADRA2RL1' [Homo sapiens] 12. beta-3-adrenergic receptor [Homo sapiens] 13. EBV induced G-protein coupled receptor [Homo sapiens] 14. [Homo sapiens] 15. receptor [Homo sapiens] 16. [Homo sapiens] 17. neurokinin-2 receptor [Homo sapiens] 18. neurokinin 1 receptor [Homo sapiens] 19. [Homo sapiens] 20. putative [Homo sapiens] 21. HM145 [Homo sapiens] 22. G protein coupled receptor 23. 24. dopamine D3 receptor 25. dopamine D4 receptor 26. neurokinin-2 receptor, NK-2 receptor [human, Peptide, 398 aa] 27. substance K receptor, SK receptor [human, Peptide, 398 aa] 28. dopamine D5 pseudogene [human, Peptide, 154 aa] 29. neurokinin-3 receptor [Homo sapiens] 30. adenosine A2 receptor [Homo sapiens] 31. somatostatin receptor [Homo sapiens] 32. melanocortin-3 receptor [human, Peptide, 360 aa] 33. dopamine D3 receptor [Homo sapiens] 34. somatostatin receptor [Homo sapiens] 35. A3 adenosine receptor [Homo sapiens] 36. D2 dopamine receptor [Homo sapiens] 37. alpha1C adrenergic receptor [Homo sapiens] 38. somatostatin receptor [Homo sapiens] 39. alpha1C adrenergic receptor [Homo sapiens] 40. adenosine receptor A3 [Homo sapiens] 41. Mu opiate receptor [Homo sapiens] 42. angiotensin II type 1b receptor [Homo sapiens] 43. alpha adrenergic receptor subtype alpha 1b [human, heart, Peptide, 516 aa] 44. alpha adrenergic receptor subtype alpha 1c [human, heart, Peptide, 466 aa] 45. dopamine D2 receptor, partial [Homo sapiens] 46. serotonin receptor [Homo sapiens] 47. [Homo sapiens]

90

48. mu opioid receptor variant [Homo sapiens] 49. alpha 1c-adrenoceptor, alpha 1c-AR [human, prostate, Peptide Partial, 349 aa] 50. alpha-1B adrenergic receptor [Homo sapiens] 51. alpha-1C-adrenergic receptor [Homo sapiens] 52. opioid receptor-like protein, partial [Homo sapiens] 53. angiotensin receptor 54. galanin receptor [Homo sapiens] 55. dopamine D3 receptor [Homo sapiens] 56. neuromedin K receptor [Homo sapiens] 57. alpha 1C adrenergic receptor isoform 2 [Homo sapiens] 58. alpha 1C adrenergic receptor isoform 3 [Homo sapiens] 59. G protein-coupled receptor [Homo sapiens] 60. G protein-coupled receptor [Homo sapiens] 61. serotonin 4 receptor (5-HT4) [Homo sapiens] 62. type 2 neuropeptide Y receptor [Homo sapiens] 63. GPR10 [Homo sapiens] 64. beta-3-adrenergic receptor, splice form 2 - human 65. RecName: Full=D(3) dopamine receptor; AltName: Full=Dopamine D3 receptor 66. putative G protein-coupled receptor; similar to human C5A anaphylatoxin chemotactic receptor, Swiss-Prot Accession Number P21730 [Homo sapiens] 67. cholecystokinin B receptor [Homo sapiens] 68. receptor [Homo sapiens] 69. neuropeptide y/peptide YY receptor type 2 [Homo sapiens] 70. melatonin-related receptor [Homo sapiens] 71. RecName: Full=D(4) dopamine receptor; AltName: Full=D(2C) dopamine receptor; AltName: Full=Dopamine D4 receptor 72. RecName: Full=Alpha-2A adrenergic receptor; AltName: Full=Alpha-2 adrenergic receptor subtype C10; AltName: Full=Alpha-2A adrenoreceptor; Short=Alpha-2A adrenoceptor; Short=Alpha-2AAR 73. [Homo sapiens] 74. Y5 receptor [Homo sapiens] 75. cholecystokinin A receptor 76. adenosine receptor subtype A2a [Homo sapiens] 77. G PROTEIN-COUPLED RECEPTOR CKR-L3 [Homo sapiens] 78. somatostatin receptor-like protein [Homo sapiens] 79. G protein-coupled receptor [Homo sapiens] 80. 5-HT2c receptor - human 81. P2Y5 [Homo sapiens] 82. CCR5 receptor [Homo sapiens] 83. putative neurotransmitter receptor [Homo sapiens] 84. beta2-adrenergic receptor [Homo sapiens] 85. beta2-adrenergic receptor [Homo sapiens] 86. beta2-adrenergic receptor [Homo sapiens] 87. mu opioid receptor [Homo sapiens] 88. high-affinity lysophosphatidic acid receptor homolog [Homo sapiens] 89. -1 [Homo sapiens] 90. orexin receptor-2 [Homo sapiens] 91. alpha 1A adrenergic receptor isoform 4 [Homo sapiens] 92. BC62940_1 [Homo sapiens] 93. 5-HT4 receptor [Homo sapiens]

91

94. /cholecystokinin brain receptor [Homo sapiens] 95. G-protein coupled receptor RE2 [Homo sapiens] 96. dopamine D2 receptor [Homo sapiens] 97. neuropeptide Y receptor type 2 [Homo sapiens] 98. thyrotropin-releasing [Homo sapiens] 99. alpha 1c-adrenoceptor subtype, partial [Homo sapiens] 100. orphan G protein-coupled receptor GPR45 [Homo sapiens] 101. adenosine receptor A1 [Homo sapiens] 102. adenosine receptor A2b [Homo sapiens] 103. adenosine receptor A3 isoform 2 [Homo sapiens] 104. alpha-1D adrenergic receptor [Homo sapiens] 105. alpha-1B adrenergic receptor [Homo sapiens] 106. beta-2 adrenergic receptor [Homo sapiens] 107. type-1 angiotensin II receptor isoform 1 [Homo sapiens] 108. subtype-3 [Homo sapiens] 109. type A [Homo sapiens] 110. C-C chemokine receptor type 1 [Homo sapiens] 111. C-C chemokine receptor type 7 precursor [Homo sapiens] 112. muscarinic receptor M2 [Homo sapiens] 113. muscarinic M3 [Homo sapiens] 114. cannabinoid receptor 2 [Homo sapiens] 115. D(1A) dopamine receptor [Homo sapiens] 116. D(2) dopamine receptor isoform long [Homo sapiens] 117. D(1B) dopamine receptor [Homo sapiens] 118. galanin receptor type 2 [Homo sapiens] 119. galanin receptor type 3 [Homo sapiens] 120. receptor [Homo sapiens] 121. [Homo sapiens] 122. 5-hydroxytryptamine receptor 1B [Homo sapiens] 123. 5-hydroxytryptamine receptor 1D [Homo sapiens] 124. 5-hydroxytryptamine receptor 1E [Homo sapiens] 125. 5-hydroxytryptamine receptor 2C isoform a precursor [Homo sapiens] 126. 5-hydroxytryptamine receptor 6 [Homo sapiens] 127. 5-hydroxytryptamine receptor 7 isoform a [Homo sapiens] 128. neuropeptide Y receptor type 1 [Homo sapiens] 129. neuropeptide Y receptor type 2 [Homo sapiens] 130. [Homo sapiens] 131. substance-P receptor isoform long [Homo sapiens] 132. thyrotropin-releasing hormone receptor [Homo sapiens] 133. G-protein-coupled receptor [Homo sapiens] 134. G protein-coupled receptor [Homo sapiens] 135. beta-1 adrenergic receptor [Homo sapiens] 136. beta-3 adrenergic receptor [Homo sapiens] 137. somatostatin receptor type 1 [Homo sapiens] 138. somatostatin receptor type 2 [Homo sapiens] 139. somatostatin receptor type 3 [Homo sapiens] 140. somatostatin receptor type 5 [Homo sapiens] 141. growth hormone secretagogue receptor type 1 isoform 1b [Homo sapiens] 142. -releasing peptide receptor [Homo sapiens] 143. [Homo sapiens]

92

144. probable G-protein coupled receptor 21 [Homo sapiens] 145. chemokine XC receptor 1 [Homo sapiens] 146. histamine H3 receptor [Homo sapiens] 147. type 1A [Homo sapiens] 148. melatonin receptor type 1B [Homo sapiens] 149. probable G-protein coupled receptor 19 [Homo sapiens] 150. neuropeptide Y receptor type 5 [Homo sapiens] 151. 5-hydroxytryptamine receptor, partial [Homo sapiens] 152. beta-2-adrenergic receptor [Homo sapiens] 153. visual pigment-like receptor peropsin [Homo sapiens] 154. beta-1-adrenergic receptor [Homo sapiens] 155. adenosine receptor A2a [Homo sapiens] 156. beta-2 andrenergic receptor [Homo sapiens] 157. beta-2 adrenergic receptor [Homo sapiens] 158. G protein-coupled receptor 58 [Homo sapiens] 159. 5-hydroxytryptamine (serotonin) receptor 2A [Homo sapiens] 160. 5-hydroxytryptamine (serotonin) receptor 1F [Homo sapiens] 161. 5-HT4 receptor [Homo sapiens] 162. edited 5-hydroxytryptamine receptor 2C [Homo sapiens] 163. muscarinic acetylcholine receptor M5 [Homo sapiens] 164. somatostatin receptor 2B [Homo sapiens] 165. orphan G-protein coupled receptor GPR72 [Homo sapiens] 166. dopamine receptor D2longer [Homo sapiens] 167. 5-hydroxytryptamine (serotonin) receptor 1E [Homo sapiens] 168. muscarinic acetylcholine receptor m2 [Homo sapiens] 169. substance-P receptor isoform short [Homo sapiens] 170. neuromedin-K receptor [Homo sapiens] 171. CCK-B/gastrin receptor [Homo sapiens] 172. cholecystokinin B receptor [Homo sapiens] 173. alpha adrenergic receptor subtype alpha 1a [Homo sapiens] 174. KIAA1540 protein [Homo sapiens] 175. receptor 2 [Homo sapiens] 176. -type 2 [Homo sapiens] 177. G-protein coupled receptor 84 [Homo sapiens] 178. [Homo sapiens] 179. 5-hydroxytryptamine receptor 2A isoform 1 [Homo sapiens] 180. 5-hydroxytryptamine receptor 1F [Homo sapiens] 181. 5-hydroxytryptamine receptor 7 isoform d [Homo sapiens] 182. 5-hydroxytryptamine receptor 7 isoform b [Homo sapiens] 183. G-protein coupled receptor [Homo sapiens] 184. alpha 2C adrenergic receptor variant [Homo sapiens] 185. m3 muscarinic acetylcholine receptor [Homo sapiens] 186. 5-hydroxytryptamine receptor 4 isoform b [Homo sapiens] 187. neuropeptide FF receptor 1 [Homo sapiens] 188. 5-hydroxytryptamine4 receptor [Homo sapiens] 189. alpha 2A adrenergic receptor [Homo sapiens] 190. alpha 2B adrenergic receptor [Homo sapiens] 191. beta-2 adrenergic receptor, partial [Homo sapiens] 192. 5-hydroxytryptamine receptor 5A [Homo sapiens] 193. histamine H2 receptor isoform 2 [Homo sapiens]

93

194. probable G-protein coupled receptor 63 [Homo sapiens] 195. olfactory receptor 7C2 [Homo sapiens] 196. H3S [Homo sapiens] 197. neurokinin-2 receptor [Homo sapiens] 198. histamine H4 receptor isoform 1 [Homo sapiens] 199. histamine H3 receptor [Homo sapiens] 200. G-protein-coupled receptor 74 [Homo sapiens] 201. somatostatin receptor type 5 [Homo sapiens] 202. Unknown (protein for IMAGE:3354783), partial [Homo sapiens] 203. m1 muscarinic receptor [Homo sapiens] 204. m2 muscarinic cholinergic receptor [Homo sapiens] 205. m3 muscarinic cholinergic receptor [Homo sapiens] 206. m4 muscarinic cholinergic receptor [Homo sapiens] 207. m5 muscarinic cholinergic receptor [Homo sapiens] 208. adrenergic receptor alpha-1a [Homo sapiens] 209. G-protein-coupled receptor GPR54 [Homo sapiens] 210. melanopsin isoform 1 [Homo sapiens] 211. neuropeptide NPVF receptor [Homo sapiens] 212. histamine receptor H4 [Homo sapiens] 213. G protein-coupled receptor [Homo sapiens] 214. neuropeptide FF receptor 2 isoform 2 [Homo sapiens] 215. trace amine-associated receptor 8 [Homo sapiens] 216. probable G-protein coupled receptor 101 [Homo sapiens] 217. [Homo sapiens] 218. D(2) dopamine receptor isoform short [Homo sapiens] 219. KOR-3D [Homo sapiens] 220. Dopamine D4 receptor [Homo sapiens] 221. histamine H3 receptor isoform 3 [Homo sapiens] 222. histamine H3 receptor isoform 4 [Homo sapiens] 223. Angiotensin II receptor, type 1 [Homo sapiens] 224. Adenosine A1 receptor [Homo sapiens] 225. 1 [Homo sapiens] 226. opioid receptor kappa [Homo sapiens] 227. histamine H3 receptor isoform 5 [Homo sapiens] 228. histamine H3 receptor isoform 6 [Homo sapiens] 229. trace amine-associated receptor 1 [Homo sapiens] 230. [Homo sapiens] 231. [Homo sapiens] 232. seven transmembrane helix receptor [Homo sapiens] 233. seven transmembrane helix receptor [Homo sapiens] 234. seven transmembrane helix receptor [Homo sapiens] 235. seven transmembrane helix receptor [Homo sapiens] 236. seven transmembrane helix receptor [Homo sapiens] 237. seven transmembrane helix receptor [Homo sapiens] 238. seven transmembrane helix receptor [Homo sapiens] 239. seven transmembrane helix receptor [Homo sapiens] 240. beta 2 adrenergic receptor [Homo sapiens] 241. [Homo sapiens] 242. cholecystokinin-C receptor [Homo sapiens] 243. G-protein coupled receptor 161 isoform 2 [Homo sapiens]

94

244. 5-hydroxytryptamine 4 receptor subunit b [Homo sapiens] 245. seven transmembrane helix receptor [Homo sapiens] 246. DRG kappa 1 splice variant KOR 1A [Homo sapiens] 247. KOR-3A splice variant [Homo sapiens] 248. mu opioid receptor variant MOR-1R [Homo sapiens] 249. Cholinergic receptor, muscarinic 5 [Homo sapiens] 250. delta opioid receptor [Homo sapiens] 251. Similar to , partial [Homo sapiens] 252. mu3 opiate receptor [Homo sapiens] 253. trace amine-associated receptor 6 [Homo sapiens] 254. gastrin/cholecystokinin type B receptor [Homo sapiens] 255. unknown [Homo sapiens] 256. glucose-dependent insulinotropic receptor [Homo sapiens] 257. B/W receptor type 2 [Homo sapiens] 258. probable G-protein coupled receptor 45 [Homo sapiens] 259. hypothetical protein [Homo sapiens] 260. hypothetical protein [Homo sapiens] 261. D(4) dopamine receptor [Homo sapiens] 262. G protein-coupled receptor 50 variant b [Homo sapiens] 263. alpha-2B adrenergic receptor [Homo sapiens] 264. lysophosphatidic acid receptor 6 [Homo sapiens] 265. mu opioid receptor splice variant hMOR-1B4 [Homo sapiens] 266. mu opioid receptor splice variant hMOR-1B5 [Homo sapiens] 267. mu opioid receptor splice variant hMOR-1Y [Homo sapiens] 268. C-C chemokine receptor type 6 [Homo sapiens] 269. muscarinic acetylcholine receptor M1 [Homo sapiens] 270. QRFP receptor [Homo sapiens] 271. mu opioid receptor variant MOR1W [Homo sapiens] 272. Histamine receptor H1 [Homo sapiens] 273. RecName: Full=Putative trace amine-associated receptor 3; Short=TaR-3; Short=Trace amine receptor 3; AltName: Full=G-protein coupled receptor 57 274. growth hormone secretagogue receptor type 1 isoform 1a [Homo sapiens] 275. 5-hydroxytryptamine (serotonin) receptor 2B [Homo sapiens] 276. kappa-type opioid receptor [Homo sapiens] 277. alpha 1A adrenoceptor isoform 2b [Homo sapiens] 278. alpha 1A adrenoceptor isoform 2c [Homo sapiens] 279. alpha 1A adrenoceptor isoform 3c [Homo sapiens] 280. alpha 1A adrenoceptor isoform 5a [Homo sapiens] 281. alpha 1A adrenoceptor isoform 6 [Homo sapiens] 282. 5-hydroxytryptamine receptor 4 isoform g [Homo sapiens] 283. unknown [Homo sapiens] 284. TPA_inf: olfactory receptor OR19-22 [Homo sapiens] 285. G protein-coupled receptor 45 [Homo sapiens] 286. G protein-coupled receptor 63 [Homo sapiens] 287. G protein-coupled receptor 83 [Homo sapiens] 288. TAAR2 protein [Homo sapiens] 289. TAAR2 protein [Homo sapiens] 290. G protein-coupled receptor 63 [Homo sapiens] 291. G protein-coupled receptor 63 [Homo sapiens] 292. Neuropeptides B/W receptor 2 [Homo sapiens]

95

293. Angiotensin II receptor, type 1 [Homo sapiens] 294. CHRM5 protein [Homo sapiens] 295. adenosine receptor A3 isoform 1 [Homo sapiens] 296. Trace amine associated receptor 5 [Homo sapiens] 297. Cannabinoid receptor 2 (macrophage) [Homo sapiens] 298. RecName: Full=Beta-1 adrenergic receptor; AltName: Full=Beta-1 adrenoreceptor; Short=Beta-1 adrenoceptor 299. G protein-coupled receptor 21 [Homo sapiens] 300. DRD1 [Homo sapiens] 301. PPYR1 [Homo sapiens] 302. OPRL1 [Homo sapiens] 303. Cannabinoid receptor 2 (macrophage) [Homo sapiens] 304. muscarinic acetylcholine receptor M4 [Homo sapiens] 305. olfactory receptor 7D2 [Homo sapiens] 306. neuropeptides B/W receptor type 1 [Homo sapiens] 307. olfactory receptor 14J1 [Homo sapiens] 308. 5-hydroxytryptamine receptor 1A [Homo sapiens] 309. neuropeptide Y receptor Y1 variant [Homo sapiens] 310. G protein-coupled receptor 63 variant [Homo sapiens] 311. isoform long variant [Homo sapiens] 312. unknown [Homo sapiens] 313. kiSS-1 receptor [Homo sapiens] 314. Adrenergic, beta-2-, receptor, surface variant [Homo sapiens] 315. G-protein coupled purinergic receptor P2Y5 variant [Homo sapiens] 316. chemokine (C-C motif) receptor 6 variant [Homo sapiens] 317. Chemokine (C motif) receptor 1 [Homo sapiens] 318. delta-type opioid receptor [Homo sapiens] 319. Adrenergic, alpha-1A-, receptor [Homo sapiens] 320. TACR3 protein, partial [Homo sapiens] 321. CNR2 protein, partial [Homo sapiens] 322. TACR1 protein, partial [Homo sapiens] 323. [Homo sapiens] 324. G protein-coupled receptor 52 [Homo sapiens] 325. Trace amine associated receptor 5 [Homo sapiens] 326. Neuromedin B receptor [Homo sapiens] 327. adrenergic, alpha-2B-, receptor isoform [Homo sapiens] 328. 2 [Homo sapiens] 329. Pancreatic polypeptide receptor 1 [Homo sapiens] 330. G protein-coupled receptor 119 [Homo sapiens] 331. alpha-2C adrenergic receptor [Homo sapiens] 332. melanopsin isoform 2 [Homo sapiens] 333. GPR50 protein [Homo sapiens] 334. trace amine-associated receptor 2 isoform 2 [Homo sapiens] 335. trace amine-associated receptor 2 isoform 1 [Homo sapiens] 336. opsin 4 (melanopsin) [Homo sapiens] 337. D(3) dopamine receptor isoform a [Homo sapiens] 338. D(3) dopamine receptor isoform e [Homo sapiens] 339. CCR5 chemokine receptor T143N variant [Homo sapiens] 340. somatostatin receptor subtype 5B [Homo sapiens] 341. somatostatin receptor subtype 5C [Homo sapiens]

96

342. 5-hydroxytryptamine receptor 4 isoform a [Homo sapiens] 343. 5-hydroxytryptamine receptor 4 isoform i [Homo sapiens] 344. 5-hydroxytryptamine (serotonin) receptor 5A [Homo sapiens] 345. Opsin 4 [Homo sapiens] 346. trace amine-associated receptor 9 [Homo sapiens] 347. alpha-1A adrenergic receptor isoform 2 [Homo sapiens] 348. alpha-1A adrenergic receptor isoform 3 [Homo sapiens] 349. alpha-1A adrenergic receptor isoform 4 [Homo sapiens] 350. alpha-1A adrenergic receptor isoform 1 [Homo sapiens] 351. CHRM4 protein [Homo sapiens] 352. G protein-coupled receptor 50 [Homo sapiens] 353. [Homo sapiens] 354. mu-type opioid receptor isoform MOR-1X [Homo sapiens] 355. mu-type opioid receptor isoform MOR-1O [Homo sapiens] 356. mu-type opioid receptor isoform MOR-1A [Homo sapiens] 357. mu-type opioid receptor isoform MOR-1 [Homo sapiens] 358. QRFPR protein [Homo sapiens] 359. DRD3 protein [Homo sapiens] 360. opioid receptor, mu 1, isoform CRA_a [Homo sapiens] 361. hCG1786552, isoform CRA_a [Homo sapiens] 362. adrenergic, beta-1-, receptor [Homo sapiens] 363. , isoform CRA_b [Homo sapiens] 364. neuropeptide FF receptor 1 [Homo sapiens] 365. dopamine receptor D1, isoform CRA_a [Homo sapiens] 366. adrenergic, alpha-1B-, receptor [Homo sapiens] 367. 5-hydroxytryptamine (serotonin) receptor 4, isoform CRA_a [Homo sapiens] 368. 5-hydroxytryptamine (serotonin) receptor 4, isoform CRA_d [Homo sapiens] 369. adrenergic, alpha-1A-, receptor, isoform CRA_b [Homo sapiens] 370. adrenergic, alpha-1A-, receptor, isoform CRA_d [Homo sapiens] 371. dopamine receptor D2, isoform CRA_a [Homo sapiens] 372. dopamine receptor D2, isoform CRA_d [Homo sapiens] 373. dopamine receptor D2, isoform CRA_e [Homo sapiens] 374. cholecystokinin B receptor, isoform CRA_a [Homo sapiens] 375. cholecystokinin B receptor, isoform CRA_b [Homo sapiens] 376. opiate receptor-like 1, isoform CRA_a [Homo sapiens] 377. opiate receptor-like 1, isoform CRA_b [Homo sapiens] 378. histamine receptor H3, isoform CRA_a [Homo sapiens] 379. histamine receptor H3, isoform CRA_b [Homo sapiens] 380. histamine receptor H3, isoform CRA_e [Homo sapiens] 381. angiotensin II receptor, type 1, isoform CRA_a [Homo sapiens] 382. angiotensin II receptor, type 1, isoform CRA_c [Homo sapiens] 383. dopamine receptor D3, isoform CRA_a [Homo sapiens] 384. dopamine receptor D3, isoform CRA_b [Homo sapiens] 385. dopamine receptor D3, isoform CRA_c [Homo sapiens] 386. dopamine receptor D3, isoform CRA_d [Homo sapiens] 387. dopamine receptor D3, isoform CRA_e [Homo sapiens] 388. dopamine receptor D3, isoform CRA_g [Homo sapiens] 389. dopamine receptor D3, isoform CRA_h [Homo sapiens] 390. opsin 4 (melanopsin), isoform CRA_b [Homo sapiens] 391. hCG1981539 [Homo sapiens]

97

392. olfactory receptor, family 7, subfamily C, member 2 [Homo sapiens] 393. G protein-coupled receptor 161, isoform CRA_b [Homo sapiens] 394. hCG2005742, isoform CRA_a [Homo sapiens] 395. hCG2039474, isoform CRA_b [Homo sapiens] 396. G protein-coupled receptor 50 [Homo sapiens] 397. 5-hydroxytryptamine (serotonin) receptor 2C, isoform CRA_a [Homo sapiens] 398. olfactory receptor, family 5, subfamily U, member 1 [Homo sapiens] 399. , isoform CRA_b [Homo sapiens] 400. G protein-coupled receptor 103, isoform CRA_a [Homo sapiens] 401. , isoform CRA_b [Homo sapiens] 402. prokineticin receptor 2 [Homo sapiens] 403. delta opioid receptor 1 variant [Homo sapiens] 404. Adrenergic, beta-2-, receptor, surface [Homo sapiens] 405. prokineticin receptor 2 [Homo sapiens] 406. somatostatin receptor type 4 [Homo sapiens] 407. melatonin-related receptor [Homo sapiens] 408. neuropeptide Y receptor type 4 [Homo sapiens] 409. substance-K receptor [Homo sapiens] 410. pyroglutamylated RFamide peptide receptor [Homo sapiens] 411. unnamed protein product [Homo sapiens] 412. unnamed protein product [Homo sapiens] 413. unnamed protein product [Homo sapiens] 414. unnamed protein product [Homo sapiens] 415. unnamed protein product [Homo sapiens] 416. unnamed protein product [Homo sapiens] 417. unnamed protein product [Homo sapiens] 418. unnamed protein product [Homo sapiens] 419. unnamed protein product [Homo sapiens] 420. unnamed protein product [Homo sapiens] 421. unnamed protein product [Homo sapiens] 422. Chain A, Crystal Structure Of The Human Beta2 Adrenoceptor 423. Chain A, Crystal Structure Of The Human Beta2 Adrenoceptor 424. neuromedin-B receptor [Homo sapiens] 425. mu opioid receptor splice variant hMOR-1K2 [Homo sapiens] 426. G protein-coupled receptor 83 [Homo sapiens] 427. galanin receptor type 1 [Homo sapiens] 428. adrenergic, alpha-1A-, receptor variant 2 [Homo sapiens] 429. adrenergic, alpha-1A-, receptor variant 4 [Homo sapiens] 430. melanocortin receptor 3 [Homo sapiens] 431. neuromedin-U receptor 2 [Homo sapiens] 432. C-C chemokine receptor type 2 isoform A [Homo sapiens] 433. C-C chemokine receptor type 2 isoform B [Homo sapiens] 434. probable G-protein coupled receptor 52 [Homo sapiens] 435. unnamed protein product [Homo sapiens] 436. unnamed protein product [Homo sapiens] 437. unnamed protein product [Homo sapiens] 438. unnamed protein product [Homo sapiens] 439. unnamed protein product [Homo sapiens] 440. unnamed protein product [Homo sapiens] 441. unnamed protein product [Homo sapiens]

98

442. unnamed protein product [Homo sapiens] 443. 5-hydroxytryptamine (serotonin) receptor 2C [Homo sapiens] 444. 5-hydroxytryptamine (serotonin) receptor 2C [Homo sapiens] 445. 5-hydroxytryptamine (serotonin) receptor 2B [Homo sapiens] 446. 5-hydroxytryptamine (serotonin) receptor 2A [Homo sapiens] 447. unnamed protein product [Homo sapiens] 448. unnamed protein product [Homo sapiens] 449. unnamed protein product [Homo sapiens] 450. histamine H3 receptor [Homo sapiens] 451. alpha-2A adrenergic receptor [Homo sapiens] 452. unnamed protein product [Homo sapiens] 453. unnamed protein product [Homo sapiens] 454. unnamed protein product [Homo sapiens] 455. unnamed protein product [Homo sapiens] 456. histamine H2 receptor isoform 1 [Homo sapiens] 457. dopamine receptor D2 isoform short [Homo sapiens] 458. beta3 adrenoceptor [Homo sapiens] 459. beta3 adrenoceptor [Homo sapiens] 460. Opsin 4 [Homo sapiens] 461. unnamed protein product [Homo sapiens] 462. unnamed protein product [Homo sapiens] 463. neuropeptide FF receptor 2 isoform 3 [Homo sapiens] 464. neuropeptide FF receptor 2 isoform 1 [Homo sapiens] 465. 5-hydroxytryptamine receptor 2B [Homo sapiens] 466. orexin receptor type 1 [Homo sapiens] 467. orexin receptor type 2 [Homo sapiens] 468. G protein-coupled receptor 19 [Homo sapiens] 469. probable G-protein coupled receptor 83 precursor [Homo sapiens] 470. trace amine-associated receptor 5 [Homo sapiens] 471. mu-type opioid receptor isoform MOR-1B4 [Homo sapiens] 472. mu-type opioid receptor isoform MOR-1H [Homo sapiens] 473. mu-type opioid receptor isoform MOR-1G1 [Homo sapiens] 474. mu-type opioid receptor isoform MOR-1G2 [Homo sapiens] 475. mu-type opioid receptor isoform MOR-1B3 [Homo sapiens] 476. mu-type opioid receptor isoform MOR-1B5 [Homo sapiens] 477. mu-type opioid receptor isoform MOR-1B1 [Homo sapiens] 478. mu-type opioid receptor isoform MOR-1B2 [Homo sapiens] 479. 5-hydroxytryptamine (serotonin) receptor 4 [Homo sapiens] 480. opioid receptor mu 1 transcript variant hMOR-1JL [Homo sapiens] 481. 5-hydroxytryptamine receptor 2A isoform 2 [Homo sapiens] 482. Chain A, Crystal Structure Of A Methylated Beta2 Adrenergic Receptor- Fab Complex 483. RecName: Full=Beta-2 adrenergic receptor; AltName: Full=Beta-2 adrenoreceptor; Short=Beta-2 adrenoceptor 484. 5-hydroxytryptamine receptor 4 isoform d [Homo sapiens] 485. Chain A, Solution Conformation Of In Water Complexed With Nk1r 486. RecName: Full=Prolactin-releasing peptide receptor; Short=PrRP receptor; Short=PrRPR; AltName: Full=G-protein coupled receptor 10; AltName: Full=hGR3

99

487. RecName: Full=Probable G-protein coupled receptor 19; AltName: Full=GPR- NGA 488. Chain A, Thermostabilised Human A2a Receptor With Adenosine Bound 489. Chain A, Thermostabilised 490. pyroglutamylated RFamide peptide receptor [Homo sapiens] 491. Chain A, Crystal Structure Of Human Adenosine A2a Receptor With An Allosteric Inverse-Agonist Antibody At 2.7 A Resolution 492. 5-hydroxytryptamine receptor 2C isoform b precursor [Homo sapiens] 493. adenosine receptor A3, partial [Homo sapiens] 494. 5-hydroxytryptamine (serotonin) receptor 7 (adenylate cyclase-coupled) [Homo sapiens] 495. opioid receptor, mu 1 [Homo sapiens] 496. neuropeptide FF receptor 1 [Homo sapiens] 497. olfactory receptor family 5 subfamily U member 1 [Homo sapiens] 498. Melanocortin receptor 3; Short=MC3-R [Homo sapiens]

100

Appendix B. Unix text filtering commands

# $2 : atom name/identifier of amino acid # $3 : amino acid three-letter code # $5 : amino acid residue number # $8 : z coordinate value

1 – Query dataset by histogram bin for all residues ## Specify only the bin to be searched** for i in ~/GPCR_Dataset/*.pdb do cat $i | awk ' /^ATOM/ && $8 ~ /^50./ {print $2" "$3" "$5" "$8} ' >> results.out; done **All other bin searches employ the same syntax with $8 specifying the bin to be searched

2 - Query dataset by histogram bin for ionizable residues ## Specify amino acid code and its respective side-chain’s ionizable group and count the matches

# bin 20* awk ' /^ATOM/ && $8 ~ /^20./ && $3 == "LYS" && $2 == "NZ" ' ./* | wc -l awk ' /^ATOM/ && $8 ~ /^20./ && $3 == "ARG" && $2 == "CZ" ' ./* | wc -l awk ' /^ATOM/ && $8 ~ /^20./ && $3 == "HIS" && $2 == "CE1" ' ./* | wc -l awk ' /^ATOM/ && $8 ~ /^20./ && $3 == "ASP" && $2 == "CG" ' ./* | wc -l awk ' /^ATOM/ && $8 ~ /^20./ && $3 == "GLU" && $2 == "CD" ' ./* | wc -l *All other bin searches employ the same syntax with $8 specifying the bin to be searched

3 - Query dataset by conserved residue ## Specify amino acid code and residue number awk ' /^ATOM/ && $3 == "ASP" && $2 == "CG" && $5 == "113" ' ./* | wc -l grep ^ATOM ./* | grep " ASP " | grep " CG " | grep " 113 " awk ' /^ATOM/ && $3 == "ASP" && $2 == "CG" && $5 == "130" ' ./* | wc -l grep ^ATOM ./* | grep " ASP " | grep " CG " | grep " 130 " awk ' /^ATOM/ && $3 == "ASP" && $2 == "CG" && $5 == "79" ' ./* | wc -l grep ^ATOM ./* | grep " ASP " | grep " CG " | grep " 79 " awk ' /^ATOM/ && $3 == "ARG" && $2 == "CZ" && $5 == "131" ' ./* | wc -l grep ^ATOM ./* | grep " ARG " | grep " CZ " | grep " 131 "

101

102

103