New approaches to data-driven analysis and enhanced sampling simulations of G protein-coupled receptors G protein-coupled of simulations sampling enhanced and analysis data-driven to approaches New

Doctoral Thesis in Biological Physics New approaches to data-driven analysis and enhanced sampling simulations of G protein-coupled receptors

OLIVER FLEETWOOD

ISBN:    TRITASCIFOU : KTH KTH www.kth.se Stockholm, Sweden   New approaches to data-driven analysis and enhanced sampling simulations of G protein-coupled receptors

OLIVER FLEETWOOD

Academic Dissertation which, with due permission of the KTH Royal Institute of Technology, is submitted for public defence for the Degree of Doctor of Philosophy on June 2nd 2021, at 1.00 p.m. in room F3, Lindstedtsvägen 26, Sing-Sing, floor 2, KTH Campus, Stockholm

Doctoral Thesis in Biological Physics KTH Royal Institute of Technology Stockholm, Sweden 2021 © Oliver Fleetwood

ISBN: 978-91-7873-849-6 TRITA-SCI-FOU 2021:015

Printed by: Universitetsservice US-AB, Sweden 2021 Abstract

Proteins are large biomolecules that carry out specific functions within living organisms. Understanding how proteins function is a massive scientific challenge with a wide area of applications. In particular, by controlling protein function we may develop therapies for many diseases. To understand a protein’s function, we need to consider its full conformational ensemble, and not only a single snapshot of a structure. Allosteric signaling is a factor often driving protein conformation change, where the binding of a molecule to one site triggers a response in another part of the protein. G protein-coupled receptors (GPCRs) are transmembrane proteins that bind molecules outside the membrane, which enables coupling to a G protein in their intracellular domain. Understanding the complex allosteric process governing this mechanism could have a significant impact on the development of novel drugs.

Molecular dynamics (MD) is a computational method that can capture protein conformational change at an atomistic level. However, MD is a computationally expensive approach to simulating proteins, and is thus infeasible for many applications. Enhanced sampling techniques have emerged to reduce the computational cost of standard MD. Another challenge with MD is to extract useful information and distinguish signal from noise in an MD trajectory. Data-driven methods can streamline analysis of protein simulations and improve our understanding of biomolecular systems.

Paper 1 and 2 contain methodological developments to analyze the results of MD in a data-driven manner. We provide methods that create interpretable maps of important molecular features from protein simulations (Paper 1) and identify allosteric communication pathways in biological systems (Paper 2). As a result, more insights can be extracted from MD trajectories. Our approach is generalizable and can become useful to analyze complex simulations of various biomolecular systems.

In Paper 3 and 4, we combine the aforementioned methodological advancements with enhanced sampling techniques to study a prototypical GPCR, the β2 adrenergic receptor. First, we make improvements to the string method with swarms of trajectories and derive the conformational change and free energy along the receptor’s activation pathway. Next, we identify key molecular microswitches directly or allosterically controlled by orthosteric ligands and show how these couple to a shift in probability of the receptor’s active state. In Paper 4, we also find that ligands induce ligand-specific states, and the molecular basis governing these states.

These new approaches generate insights compatible with previous simulation and experimental studies at a relatively low computational cost. Our work also provides new insights into the molecular basis of allosteric communication in membrane proteins, and might become a useful tool in the design of novel GPCR drugs.

3 Sammanfattning Proteiner är stora biomolekyler som utför specifika funktioner inom levande organismer. Att förstå hur proteiner fungerar är en enorm vetenskaplig utmaning med många användningsområden. I synnerhet genom att kontrollera proteiners funktion kan vi utveckla botemedel för många sjukdomar. För att förstå ett proteins funktion måste vi beakta dess fullständiga tillståndsensemble och inte bara en enda ögonblicksbild av en struktur. Alloster reglering är en faktor som kan få proteiner att ändra tillstånd. Det innebär att en molekyl binder till ett säte och därmed orsakar förändring i en annan del av proteinet. G-proteinkopplade receptorer (GPCRs) är transmembranproteiner som binder molekyler utanför membranet, vilket möjliggör koppling till ett G-protein i deras intracellulära domän. Att förstå den komplexa allostera regleringen som styr denna mekanism kan ha en betydande inverkan på utvecklingen av nya läkemedel.

Molekylsimuleringar (Molecular dynamics; MD) är en beräkningsmetod som kan användas för att studera förändringar i proteiners struktur och tillstånd på atomnivå. MD är ett kostsamt simuleringsverktyg och är därför opraktiskt i många sammanhang. Så kallade ‘enhanced sampling’-metoder har utvecklats för att minska beräkningskostnaden av standard-MD. Ytterligare en utmaning är att utvinna användbar information och skilja signal från brus i en MD-simulering. Datadrivna metoder kan effektivisera analysen av proteinsimuleringar och förbättra vår förståelse av biomolekylära system i allmänhet.

Artikel 1 och 2 beskriver utveckling av datadrivna metoder för att analysera MD-simuleringar. Metoderna kartlägger viktiga molekylära egenskaper i proteinsimuleringar (Artikel 1) och identifierar allostera kommunikationsvägar i biologiska system (Artikel 2), samt presenterar resultaten på ett lättillgängligt sätt. Därav kan MD leda till fler och bättre insikter. Vårt tillvägagångssätt är generaliserbart och kan användas för att analysera komplexa simuleringar av många biomolekylära system.

I artikel 3 och 4 kombinerar vi de tidigare nämnda metodologiska framstegen med enhanced sampling-metoder för att studera en prototypisk GPCR, β2-adrenoceptorn. I steg ett förbättrar vi den så kallade strängmetoden, vilket är en typ av enhanced sampling-teknik, och använder den för att härleda de funktionella tillstånden och den fria energi längsmed receptorns aktiveringsväg. Därefter identifierar vi viktiga molekylära interaktioner som kontrolleras av ligander och visar hur dessa kopplas till en förändring i sannolikhet för receptorns aktiva tillstånd. I artikel 4 visar vi även att enskilda ligander inducerar specifika tillstånd samt de molekylära egenskaperna hos dessa tillstånd.

Metoderna genererar resultat som överensstämmer med tidigare simulerings- och experimentella studier till en relativt låg beräkningskostnad. Vårt arbete leder även till nya insikter om den molekylära grunden för alloster kommunikation i membranproteiner, och kan bli ett användbart verktyg i utvecklingen av framtidens GPCR-läkemedel.

4 List of publications

Paper 1. Fleetwood, O., Kasimova, M.A., Westerlund, A.M. and Delemotte, L. 2020. Molecular Insights from Conformational Ensembles via . Biophysical Journal 118(3), pp. 765–780.

Paper 2. Westerlund, A.M., Fleetwood, O., Pérez-Conesa, S. and Delemotte, L. 2020. Network analysis reveals how lipids and other cofactors influence membrane protein allostery. The Journal of Chemical Physics 153(14), p. 141103.

Paper 3. Fleetwood, O., Matricon, P., Carlsson, J. and Delemotte, L. 2020. Energy Landscapes Reveal Agonist Control of G Protein-Coupled Receptor Activation via Microswitches. Biochemistry 59(7), pp. 880–891.

Paper 4. Fleetwood, O., Carlsson, J. and Delemotte, L. 2021. Identification of ligand-specific G-protein coupled receptor states and prediction of downstream efficacy via data-driven modeling. eLife 10.

All papers are distributed under the terms of the Creative Commons CC-BY license (https://creativecommons.org/licenses/by/4.0/).

5 Author contributions Paper 1. O.F., M.A.K., and L.D. conceptualized the project. O.F., M.A.K., and A.M.W. wrote the code and performed different parts of data curation and formal analysis: A.M.W. for CaM, O.F. for β2AR, and M.A.K. for the VSD. O.F. and M.A.K. designed and analyzed the toy model. L.D. supervised the project. O.F., M.A.K., and A.M.W. wrote the first draft of the manuscript. All authors reviewed and edited the manuscript.

Paper 2. L.D. initiated and coordinated the project. L.D. and A.M.W. defined the project scope. A.M.W. developed and implemented the framework and analyzed the datasets.

O.F., S.P.C., and A.M.W. performed simulations of the β2 adrenergic receptor, KcsA, and KCNQ1, respectively. A.M.W. prepared the figures and the first manuscript draft. All authors participated in interpreting the results and writing the paper. O.F. and S.P.C. contributed equally to this work.

Paper 3. L.D. and J.C. initiated and supervised the project. O.F. developed the computational methods and wrote the software implementation. O.F. and P.M. performed the simulations and data curation. All authors participated in the formal analysis. O.F. wrote the first draft of the paper. All authors participated in interpreting the results, creating visualizations and editing the paper.

Paper 4. O.F. and J.C. conceptualized the project. L.D and J.C supervised the project. O.F. developed the computational methods and wrote the software implementation. O.F. and L.D performed the formal analysis. O.F. wrote the first draft of the paper. All authors participated in interpreting the results, creating visualizations and editing the paper.

6 Table of contents 1) Introduction 8

2) Proteins 10 2.1) Structures 10 2.2) Dynamics 11 2.3) Membrane proteins 12 2.4) G protein-coupled receptors 13

3) Molecular dynamics (MD) simulations 15 3.1) Principles of MD 15 3.2) Conformational state populations and free energy landscapes 16 3.3) Limitations and challenges of MD 18 3.4) Simulations of membrane proteins 19

4) Extending and improving enhanced sampling protocols 20 4.1) Types of enhanced sampling 20 4.2) The string method with swarms of trajectories 21 4.3) Kinetically trapped state sampling 23

5) Data-driven analysis of MD trajectories 24 5.1) Dimensionality reduction methods for visualization 24 5.2) Finding important molecular features with machine learning 26 5.3) Allosteric network analysis 28

6) Summary of papers 31 6.1) Paper 1—Molecular Insights from Conformational Ensembles via Machine Learning 31 6.2) Paper 2—Network analysis reveals how lipids and other cofactors influence membrane protein allostery 35 6.3) Paper 3—Energy Landscapes Reveal Agonist Control of G Protein-Coupled Receptor Activation via Microswitches 38 6.4) Paper 4—Identification of ligand-specific G-protein coupled receptor states and prediction of downstream efficacy via data-driven modeling 41

7) Outlook 43

Acknowledgements 45

Bibliography 47

7 1) Introduction

Proteins are large biomolecules consisting of chains of amino acids that carry out specific functions within living organisms. Understanding how proteins function is a massive scientific challenge with a wide area of applications. In particular, by controlling protein function, we may develop therapies for many diseases.

Proteins are not static structures, but flexible molecules (1). To understand a protein’s function, we need to consider its full conformational ensemble, and not only a single snapshot of a structure. Allosteric signaling is an important factor often driving conformational change, where the binding of a molecule to one site triggers a response, such as enzyme catalysis or large-scale conformational change, in another protein domain. This flow of information through the system is not a completely random process, but follows certain allosteric pathways of interacting residues.

Membrane proteins, as their name betrays, are located in the cell membrane. Since the membrane lipid bilayer plays a key role in protecting the cell, proteins have evolved to transmit signals between the interior of the cell and its environment. G protein-coupled receptors (GPCRs) constitute a large group of proteins, with over 800 members in the human genome. All GPCRs share structural features: they have seven transmembrane helices, an extracellular N-terminus and an intracellular C-terminus. Molecules outside the cell, such as hormones or neurotransmitters, bind to the receptor which allows the receptor to couple to an intracellular G protein, triggering an intracellular response. Understanding how ligands control receptor signaling can have a significant impact on the development of novel drugs with specific signaling properties.

Molecular dynamics (MD) is a computational method that simulates movement of interacting particles. Despite the relatively simple equations of motion governing an MD simulation, in silico studies of a biological system require a careful setup. An important prerequisite to capture protein conformational change in an MD trajectory is a starting structure of the protein embedded in its native environment. For a properly configured system, MD can make it evolve. Given enough time, the protein will transition between its different functional states. However, because of the timescales involved in protein conformational change, enormous computational resources are typically required to study many biologically interesting questions. Thus, the use of brute force MD simulations is impractical for many real-world applications.

Enhanced sampling methods have emerged to reduce the computational cost of standard MD. These methods introduce a bias into the system, which accelerates exploration of the conformational ensemble (2). A variety of enhanced sampling methods have emerged for different purposes. For example, the string method with swarms of trajectories (3) finds the most probable transition path (the string) between two end states.

8 As the output from MD simulation grows, efficient post-processing methods capable of handling large amounts of data becomes a critical step in simulation studies. A system with tens to hundreds of thousands of atoms is very high dimensional. The number of interactions grows quadratically with the number of particles, and these interactions occur at every snapshot in time. Thus, to extract useful information and distinguish signal from noise in an MD trajectory is like finding a needle in a haystack. Data-driven methods, and machine learning in particular, have emerged to leverage big data in nearly every scientific domain (4–7). The field of biomolecular simulations is no exception. Data-driven methods can, for example, be used for dimensionality reduction (8, 9), enhanced sampling (10–14), visualization and identification of protein allostery (15–18).

The major methodological development of this thesis—primarily the work published in Paper 1 (8) and 2 (15)—provides new methods and a protocol for systematic and efficient analysis of biomolecular simulations. We advocate for data-driven analysis, and for machine learning methods in particular, to aid human interpretation of biomolecular simulations. Our approach is generalizable and can become useful for complex simulation studies of various biomolecular systems. The last two papers also feature new methodological developments. Paper 3 (10) introduces improvements to the string method with swarms of trajectories. In Paper 4 (11), we present an enhanced sampling method which thoroughly samples a stabilized state, while avoiding the cost of exploring the entire conformational ensemble.

In Paper 3 and 4, we further utilize the methodological advancements to study the β2 adrenergic receptor—a prototypical GPCR and drug target for asthma medication. We derive the conformational ensemble and free energy along the receptor’s activation pathway, and identify key microswitches that are directly or allosterically controlled by orthosteric ligands. These new approaches lead to results obtained at a comparatively low computational cost, which are compatible with previous simulation and experimental studies. Our work also provides new insights into the molecular basis of allosteric communication in membrane proteins, and may aid the design of novel GPCR drugs.

9 2) Proteins 2.1) Structures

A protein’s structure governs its function. At the most basic level, amino acids form the fundamental building blocks of any protein. All amino acids share a similar structure with a common backbone and a unique side chain. The backbone comprises an amino (NH2) and a carboxyl (COOH) functional group. The side chains are polar, hydrophobic or charged, and vary in size. Some amino acids contain a titratable group and change their charge at different pH. Aspartic acid and glutamic acid are negatively charged at neutral pH but can become neutrally charged by gaining a proton at lower pH. Lysine, arginine and histidine have basic side chains and are positively charged at neutral pH.

An organism’s genome ultimately determines a protein amino acid sequence. Although only twenty-one amino acids are present in the human body, the human proteome contains many thousands of proteins (19). For a protein to function, the chain of amino acids residues must fold into a three-dimensional structure. There are different levels of structure, where the primary structure is the previously mentioned sequence of amino acids (Fig. 1a). The secondary structure can be an α-helix or a β-sheet (Fig. 1b). Side chain interactions between non-neighboring residues stabilize the three-dimensional structure referred to as the tertiary structure. Finally, several protein subunits can interact to form a quaternary structure (Fig. 1b).

Structures can be determined with various experimental methods, two of the most common being X-ray crystallography and Cryogenic electron microscopy (cryo-EM). In X-ray crystallography, proteins are first crystallized into a three-dimensional lattice—an extremely challenging process for many proteins. By capturing the diffraction pattern of a high-intensity x-ray beam against the resulting crystal, it is possible to reconstruct the entire protein structure at high resolution (20). With cryo-EM, protein samples are rapidly cooled to low, cryogenic, and embedded in vitreous ice. The protein samples are sprayed onto a grid and photographs are taken with an electron microscope. Finally, a three-dimensional structure can be reconstructed from the many projections of proteins in random orientations captured in the photographs (20). Advances in cryo-EM have over the last decade led to a steep increase in the number of solved protein structures. The method has an advantage to X-ray crystallography since it does not require the samples to crystalize and allows proteins to maintain their near-native states. On the downside, the cryo-EM structures published to date are typically of lower resolution.

10 Figure 1: The four levels of protein structure using the β2 adrenergic receptor as an example. (a) The polypeptide chain of amino acids shown by their one-letter code (21). (b) Two proteins, the receptor (red) bound to the agonist BI-167107 and a G protein mimicking nanobody (orange), form a quaternary structure. At the secondary level, the receptor consists of seven transmembrane α-helices, while the nanobody comprises β-sheets and disordered regions. The surrounding POPC lipid bilayer is shown in dark gray and the solvent in light gray.

Computational methods for structure prediction become decent alternatives when no experimental structure is available. For example, a protein homology model can be inferred from another existing experimental structure with a similar genetic sequence (22). Recent methodological breakthroughs in computational structure prediction (23) from mere genetic sequence also shows promise as a future alternative source of protein structures. Despite methodological advances, however, many structures are yet to be determined, and protein folding remains a partially unsolved problem.

2.2) Dynamics

Proteins are not static, but flexible molecules (1). To understand a protein’s function, we need to take into account its full conformational ensemble, not just a single snapshot of its structure. At the atomistic level, atomic bonds vibrate at a femtosecond timescale due to thermal fluctuations. Larger conformational rearrangements such as a protein’s transitions between different functional states are slower processes, ranging from the millisecond timescale and upwards (24). An important factor often driving protein conformation change is the mechanism of allosteric signaling. A perturbation at one site, such as the binding of

11 a molecule, can trigger an allosteric response, such as enzyme catalysis or a change in protein structure at the tertiary or quaternary level. The residues that propagate the information between the allosteric sites form an allosteric pathway.

Spectroscopy techniques are commonly used to study protein dynamics. All spectroscopy methods build on the interaction between molecules and electromagnetic radiation to infer molecular behavior as a function of changes in radiation wavelength. In general, spectroscopy methods either measure the absorption or emission of radiation, the luminescence or fluorescence of matter, or scattering of photons against a sample (20). Notable techniques that have provided useful insights for the biomolecular systems studied in this thesis include: Nuclear magnetic resonance (NMR) spectroscopy (25–27), double electron-electron resonance (DEER) spectroscopy (28), and single-molecule fluorescence resonance energy transfer (29).

Experimental techniques generally resolve timescales much longer than the fastest molecular rearrangements, and are limited to study the dynamics of only a few probes, not the entire protein structure. A complementary approach that can overcome these limitations are molecular simulations, which we will cover in Chapter 3.

2.3) Membrane proteins

For a protein to function, it must reside in its native environment, for example, in a solvent at a certain and pH. Membrane proteins, as their name suggests, are located in the cell membrane. Historically, they have been inherently challenging targets for structure determination, primarily because they need to be crystallized in a hydrophobic environment to prevent protein denaturation. Since the membrane lipid bilayer plays a key role in protecting the cell, proteins have evolved to govern the transport of molecules and signaling between cells. Much communication through the cell membrane is allosteric and mediated via proteins. Transmembrane proteins can have, in addition to a hydrophobic domain embedded in the bilayer, one extracellular and one intracellular domain. Many are examples of how large-scale conformational changes are important for function. For example, ion channels in the cell membrane, which play an important role in the nervous system, switch between closed and open states, allowing ions to pass through the channel down a voltage gradient. Since transmembrane proteins play a key role in cellular regulation and are accessible directly from extracellular medium, transmembrane proteins are interesting drug targets.

12 2.4) G protein-coupled receptors

With over 800 members in the human genome, G protein-coupled receptors (GPCRs) constitute a large group of proteins. All GPCRs have seven transmembrane helices, an extracellular N-terminus and an intracellular C-terminus (Fig. 1 and 2a). These proteins are involved in many important signaling pathways and are hence important drug targets. More than 30% of all FDA approved drugs act on GPCRs, with more drugs designed for previously untargeted GPCRs currently undergoing clinical trials (30). Breakthroughs in GPCR structure determination over the last decades have paved the way for understanding receptor signaling at a molecular level (31–36). Spectroscopy (25–29, 37, 38) and simulation studies (17, 39–48) have further shed light on the receptor dynamics.

Figure 2: Structure and microswitches of the β2 adrenergic receptor (β2AR). (a) A molecular dynamics (MD) snapshot of the β2 adrenergic receptor in complex with adrenaline in an active-like state (simulation starting from PDB 3P0G). The vignettes show the conformations of residue pairs reflecting important microswitches in the active and inactive structures 3P0G (color) and 2RH1 (white): the transmembrane helix 5 (TM5) bulge (red), measured as the closest heavy atom distance between S207(5.46) and G315(7.41); the connector region’s conformational change (pink), measured as the difference in root-mean-square deviation (RMSD) between the active and inactive state structure of the residues I121(3.40) and F282(6.44); the Y-Y motif (black), measured as the C-ζ distance between Y219(5.58) and Y326(7.53) of the NPxxY motif; and the Ionic lock displacement (orange), measured as the closest heavy atom distance between E268(6.30) and R131(3.50). (b) β2AR ligands examined in Paper 4: agonists BI-167107 and adrenaline; biased partial agonist salmeterol; antagonists timolol and alprenolol; and the inverse agonist carazolol. Atoms are colored according to their partial charge. Copied from Paper 4.

13 Molecules, or ligands, outside the cell, such as hormones or neurotransmitters, bind to the receptor’s orthosteric site, which influences conformational change in the transmembrane and intracellular region (Fig. 2). GPCRs are sometimes described as two-state switches, where an agonist ligand flicks the receptor into its active state, enabling G protein coupling and triggering an intracellular response. In the ligand-free apo state, a GPCR can still access active-like conformations, and thus exhibits a small degree of signaling, a process termed basal activity. Antagonists occupy the orthosteric site without affecting basal activity, preventing agonists from binding. An inverse agonist has the opposite effect of an agonist and reduces receptor activity.

The aforementioned model fails to capture several important features of receptor signaling and ligand efficacy. Primarily, it does not account for the existence of multiple intracellular partners associated with different cellular responses. Full agonists activate all signaling pathways, whereas partial agonists activate G protein coupling to a lesser extent. Some agonists favor only one type of intracellular binding partner, a phenomenon termed biased signaling. Understanding how ligands control receptor signaling, and how molecular interactions in the orthosteric site couple to the G protein binding site can have a significant impact on the development of novel drugs. For example, compared to full agonists, drugs with biased signaling properties may have reduced side effects associated with alternative pathways (49).

We focus on the β2 adrenergic receptor (β2AR), a prototypical GPCR with adrenaline

(epinephrine) as its cognate ligand. The β2AR can couple to both Gs proteins, which induces a downstream intracellular cyclic adenosine monophosphate (cAMP) response, and arrestins, which regulate kinase activation and endocytosis (50). Since the β2AR is a common drug target that has been structurally resolved at high resolution, it is an excellent candidate to study with computational methods.

A mixture of methods, ranging from the molecular level to clinical studies, are important to improve our understanding of receptor activation. The biochemical and pharmaceutical scope of this thesis mainly revolves around this problem, and attempts to shed light on how ligands modulate the β2AR’s conformational ensemble. We do so by analyzing small molecular rearrangements, termed microswitches, throughout the receptor. Examples of microswitches include the displacement of transmembrane helix 5 (TM5) near the orthosteric site, the rotameric switch of residues I121(3.40) and F282(6.44) (the numbers in parentheses refer to the Ballesteros–Weinstein residue numbering scheme) (51), in the transmembrane domain, and the side chain distance between the two tyrosines Y219(5.58) and Y326(7.53) (Fig. 2a). Series of microswitches form allosteric pathways and constitute the basis for ligand-specific signaling. Because of its structural similarity to recently resolved class A receptor structures (10), many learnings derived from the β2AR are likely transferable to other GPCRs.

14 3) Molecular dynamics (MD) simulations 3.1) Principles of MD

Molecular dynamics (MD) is a computational method that simulates the movement of particles. Given an initial configuration of particle position and velocities, the particles will follow trajectories defined by the system’s equations of motion. For systems with many, often hundreds of thousands, particles, the equations of motion cannot be solved analytically. MD finds a numerical solution by computing the interactions between particles at a specific point in time, then performing numerical integration—often referred to as taking a timestep—to update the particle positions. The algorithm is performed iteratively to reach the timescales of interest (Fig. 3a). A long timestep will allow the system to evolve using less computational power, but at the cost of a higher numerical error. To prevent the error from growing uncontrollably, the timestep must be shorter than the fastest oscillation in the system (52).

In classical atomistic MD simulations, the method used throughout this thesis, atoms are represented as particles with a fixed mass and charge. The interaction forces between particles are provided by a so-called . The force field specifies the close-range bonded interactions, such as the intermolecular bonds, and non-bonded forces, including electrostatic and van der Waals interactions. For an MD simulation to model the evolution of a real physical system, the forces must be accurate, yet they must be fast to evaluate numerically. The most renown method for modeling short-range pair-wise intermolecular 12 6 forces is arguably the Lennard-Jones potential, 푈(푟) = 4ϵ( (σ/푟) − (σ/푟) ),where 푟 is the interparticle distance andσ and ϵ are particle-specific parameters roughly corresponding to the size of the particles and the strength of the interaction. An important term in the long-range interaction between charged particles is the Coulomb potential, 푈(푟) ∼ 푞 푞 /푟, where 푞 and 푞 are the magnitudes of the particle charges. After decades 1 2 1 2 of development, force fields such as CHARMM (used for the work in this thesis) (53), AMBER (54) and OPLS-AA (55) approximate the energetics for many biological systems with decent accuracy. Since force fields define much of the physics in a simulation, they are an important factor for the success of MD.

The basic algorithm outlined in Fig. 3a simulates an ensemble of particles at fixed energy. The MD algorithm is often modified to mimic more realistic physical conditions, such as the isothermal–isobaric NpT ensemble with constant pressure and temperature, or the canonical NVT ensemble in a simulation box of fixed volume. For example, a thermostat can modulate the particle velocities and ensure that the distribution of velocities follows the Maxwell , thus introducing thermal fluctuations into the system. Thermostats are often , and as a result individual particle trajectories are not deterministic.

15 Figure 3: Principles of Molecular Dynamics (MD). (a) A simplified flow-chart of the MD algorithm. (b) Cartoon illustrating the distance between two particles in a system with periodic boundary conditions. The distance used to compute the interparticle force is not necessarily the distance within a box (red dashed arrow). The black solid arrow represents the shortest distance to across periodic boundaries.

Despite its relatively simple underlying principles, developing an accurate and scalable implementation of MD is extremely challenging. The open-source software GROMACS (56) is one of the fastest and most popular MD packages, and the simulation engine for all projects in this thesis.

3.2) Conformational state populations and free energy landscapes

As mentioned in the previous paragraph, a single MD trajectory is often stochastic in nature, and depends on the initial conditions. The purpose of MD is, however, usually not to generate perfect atom trajectories, but to sample configurations from the correct physical ensemble. A system at zero temperature will relax into a conformational state that corresponds to a local minimum energy of the potential energy surface, 푈. At higher temperatures, where , S, becomes a driving factor, the system will escape its initial state due to thermal fluctuations. Given a sufficiently long trajectory, the system will visit all accessible states at random, regardless of the initial conditions. We say that the system is ergodic, and the free energy landscape governs the outcome of the simulation, which is

16 the thermal equivalent of the potential energy surface. The free energy at temperature푇 is defined as:

퐺 = 푈 − 푇푆.1

To obtain the reversible work to transition from a state 푠 to a state푠 , one computes the 푖 푗 corresponding difference in free energy,

∆퐺 = 퐺(푠 ) − 퐺(푠 ). 푗 푖

The relative probability to visit state a state푝(푥) is given by the Boltzmann relationship,

푝(푠 )/푝(푠 ) = exp(∆퐺 /푘 푇), 푖 푗 퐵 where푘 is the Boltzmann constant. 퐵

A correct estimate of푝(푠), or equivalently, of ∆퐺 (푠), requires an ergodic trajectory from which we can compute the population of state푠 relative to all other states. The timescales necessary to reach ergodic sampling is system-dependent and is ultimately governed by the kinetics of the systems—how fast it diffuses on the free energy surface and the height of the free energy barriers separating different states. Note that the systems considered in this thesis are governed by the laws of classical physics and hence sample a continuous free energy landscape. The definition of a state is not necessarily straightforward, and data-driven methods may be necessary to identify state boundaries (Chapter 5 and Paper 3 and 4).

Assuming that the states are well-defined, there are multiple ways to compute푝(푠) from an MD trajectory. The most straightforward option is to count the number of simulation snapshots in one state, divided by the total number of simulation snapshots. Another approach is to count the transitions between all states and construct a transition probability matrix푇. The entries푇 are estimated from the normalized number of 푖푗 transitions,푁 , between the states i and j: 푖푗

푇 = 푁 /∑ 푁 . 푖푗 푖푗 푖푘 푘

The first eigenvector of the transition matrix is the stationary probability distribution,ρ, and its entries provide the occupation probability of the individual states, i.e.ρ = 푝(푠 ) . The 푖 푖 first eigenvalue of 푇 is always unity, whereas the higher eigenvalues provide information about the kinetics of the system, such as the mean first passage time between states. For a physical system at equilibrium, the transition matrix obeys detailed balance so that:

1 See Fig. 4b and 10 for examples of free energy surfaces.

17 ρ 푇 = ρ 푇 . 푖 푖푗 푗 푗푖

In practice, the detailed balance condition may be violated by insufficient sampling. To circumvent this issue, the count Matrix,푁, can be replaced by its symmetrized version (57),

푇 푁' = 1/2 * (푁 + 푁 ).

The two approaches to estimate푝(푠) are equivalent when applied to an ergodic trajectory with multiple transitions between states. The transition-based methods, however, also work for datasets of multiple trajectories starting from different states.

3.3) Limitations and challenges of MD

Without an accurate force field, the output of an MD simulation is of little value. All classical force fields are, however, simplifications of reality (58). For example, they neglect quantum effects, and, with a few exceptions such as the Drude force field (59), the polarizability of molecules. Force fields based on more complex physical models require more computational resources and are thus non-viable options for practical applications. An off-the-shelf force field does not provide parameters for all types of molecules. To include, for example, a novel ligand in a simulation, the molecule has to be parameterized (60), which in itself may be a hurdle. In short, there is no universal force field, and researchers have to select the best approximation for their systems that also allows them to reach the timescales of interest.

Atomistic protein simulations typically take timesteps on the femtosecond timescale. It thus requires one trillion timesteps to reach the millisecond timescale, at which many biological processes occur. Despite software improvements and the use of Graphical Processing Units (GPUs) (56) to increase the throughput of MD, these timescales are hardly obtainable using commodity hardware, even for small systems. High performance computing, where the computations run in parallel on several nodes, and the development of special purpose hardware (61), has further pushed the limits of MD. State-of-the-art supercomputers are not accessible to everyone, and even with special purpose hardware at hand, simulations are rarely long enough for an accurate estimate of the free energy landscape. Even if an unforeseen dramatic increase in performance would occur, it is always possible to expand the size and complexity of the system, thereby increasing the computational cost per simulation timestep. Thus, the timescale issue will probably always remain.

As the output from MD simulation grows, an arguably overlooked aspect of many simulation protocols is the post-processing methods for analyzing such amounts of data. A system with tens to hundreds of thousands of atoms is extremely high dimensional. The number of pairwise interactions grows quadratically with the number of particles, and

18 these interactions occur at every snapshot in time. Thus, extracting useful information and distinguishing signal from noise in an MD trajectory is no easy feat.

The methodological development of this thesis, primarily the work published in Paper 1 and 2, aims to derive results more efficiently than brute-force MD, and to analyze the results in a data-driven manner. As a result, less computational power is needed to perform simulation studies of biomolecular systems, and more information can be extracted from the resulting MD trajectories.

3.4) Simulations of membrane proteins

Simulations of biological systems require a careful setup to mimic in vivo conditions. A prerequisite to capture protein conformational change is a starting structure (Chapter 2.1). Many structures have an artifact of some sort that needs to be adjusted for. Examples include reverting a mutation, modeling a flexible domain not resolved in the experimental structure, or setting the protonation state of titratable groups.

A protein must also be embedded in a proper environment. The solvent is typically modeled by a relatively large number of water molecules and ions at physiological concentrations. Modeling the membrane bilayer may be more or less complicated. In Paper 3 and 4 we used a homogenous POPC (short for: 1-palmitoyl-2-oleoyl-sn-glycero- 3-phosphocholine) bilayer (Fig. 1b) (62), which arguably does not reflect a real cell membrane. Lipid molecules have been shown to modulate protein function (63, 64), and hence the membrane composition should not be overlooked. In Paper 2, we provide a method that further demonstrates the membrane as a link in transmembrane allosteric communication.

MD simulations usually run within a box of finite size. Periodic boundary conditions enforce an atom that leaves the box to appear on the other side (Fig. 3b). For performance reasons, the system should contain as few atoms as possible, yet it has to be large enough to avoid artifacts induced by the artificial boundary conditions. In practice, the size of the system grows with the size of the protein. Our simulations of the β2 adrenergic receptor contained approximately 70 000 atoms, whereas larger proteins, such as transporters and ion channels, usually amount to hundreds of thousands of atoms.

Given a properly configured system, MD evolves it over time. Provided enough time, the protein will transition between its different functional states. However, the slow diffusivity of the membrane and the timescales involved in transmembrane communication, the limitations of MD outlined in the previous section apply in particular to membrane proteins. Thus, a simple push-and-play approach to simulating the molecular dynamics of membrane proteins is infeasible for most applications.

19 4) Extending and improving enhanced sampling protocols 4.1) Types of enhanced sampling

Enhanced sampling techniques have emerged to reduce the computational cost of standard MD. With enhanced sampling, a bias is introduced into the system to accelerate exploration of the conformational ensemble (2). Theoretically correct results, corresponding to those that would be derived by a sufficiently long standard MD trajectory, are typically obtained during post-processing.

There are numerous enhanced sampling methods serving different purposes, using a variety of approaches to bias the simulations. Many popular methods modulate the interatomic forces with an artificial bias potential. If the bias is adaptive and incremental, the system remains approximately at equilibrium throughout the simulation, and the height of the bias potential corresponds to free energy barriers. The accelerated weight histogram method (AWH) (65, 66), accessible in GROMACS, is one method using this approach to enhance sampling. Metadynamics (67, 68) and adiabatic bias MD (69), both part of the PLUMED (70) software package, are others. Out-of-equilibrium techniques drive the system toward a certain target state, and can provide free energy estimates via the Jarzynski’s equation (52, 71). Examples include targeted and steered MD, where a relatively strong external force either pushes a protein toward a target structure or along a certain degree of freedom.

Many proteins transition between two known states, for example, the active and inactive states of a GPCR. Certain enhanced sampling methods are specifically designed to study such transitions. Examples include transition path sampling (72, 73) and the string method with swarms of trajectories (3, 10, 74), which will be described in the next section. These methods are typically computationally inexpensive, since they only derive a one-dimensional path in a high-dimensional conformational landscape. A limitation is that they only find one transition path, whereas proteins may proceed via several physiologically relevant pathways.

The methods discussed so far require a bias, or a pathway, to be tracked along one or many collective variables (CVs), also referred to as reaction coordinates. Because of the restraining interatomic forces and steric hindrance, the evolution of an MD trajectory occurs on a manifold of much lower dimensionality than all the degrees of freedom in the system. The ideal CVs span this manifold and describe the position of the system on it, and reflect the system’s slowest degrees of freedom. With a proper choice of CVs, enhanced sampling will efficiently aid the system to overcome free energy barriers and

20 accelerate exploration of the conformational ensemble. An improper choice of CVs leads to poor convergence or incorrect results, since the omitted slow degrees of freedom are not sampled properly. Many enhanced sampling techniques need to explore all CVs to reach convergence. Because the computational cost increases with every biased degree of freedom, such methods do typically not converge for over three CVs unless the system is small. Even a good choice of CVs may thus need to be compressed into a smaller set to function in practice.

To compute the correct kinetic rates for transitions between states and the state populations typically requires an extensive set of simulation data. Markov state models (MSMs) (75, 76), has been the most popular and successful framework for this purpose. Usually, MSMs are constructed from shorter independent simulations, which allows for massive parallelization computations (17). The final model contains all conformational states linked by their transition rates.

Note that the methods also vary in their complexity. Some, such as Steered MD, are relatively simple. Others are more complicated but show great promise, such as the -based Boltzmann generators (77). Many enhanced sampling techniques do not require a CV, for example, temperature replica exchange methods (52, 78), where a system exchanges conformations with a replica at higher temperature that has an increased probability of crossing free energy barriers. Just as for CV-based enhanced sampling, limitations apply to replica exchange methods as well (79). For example, every replica incurs a computational cost, while only the room temperature replica samples the correct ensemble. A thorough review of all enhanced sampling methods is out of scope for this thesis. Instead, we dedicate the rest of this chapter to the methods used in Paper 3 and 4.

4.2) The string method with swarms of trajectories

The string method with swarms of trajectories, originally developed by Pan and Roux in 2008 (3), finds the most probable transition path (the string) between two endpoints in a space spanned by a set of CVs. Provided an initial pathway, several short trajectories are launched from several points distributed along the string. A set of short trajectories starting from the same points is called a swarm (hence the name of the method), and their average drift will follow the gradient of the free energy landscape. In one iteration of the method, the points are displaced according to the drift, giving a new string closer to the most probable transition path. To prevent all points from drifting toward the closest basin of the free energy landscape, the string must be reparameterized to keep the points evenly distributed along the string. Finally, the configurations are equilibrated at the reparameterized points’ before new swarms are launched. This procedure is carried out

21 iteratively (Fig. 4a and b) until convergence, at which point the string does not drift between subsequent iterations.

Figure 4. (a) Reparameterization of the string. From the swarm trajectories (black curves in i) originating from their corresponding starting points (black circles in i) on the string (dashed pink line), a drift can be computed (gray arrow in ii) and the points be displaced to their starting positions for the next iteration (black circles in ii). (b) The string method with swarms of trajectories finds the most probable transition path between two fixed endpoints, as illustrated here for a simple two-well model. To increase the resolution of the minimum free energy path as a minimal computational cost, the number of points on the string, and from which the swarms are launched, is gradually incremented and the lengths of the arcs on the strings are set to maximize the transitions between neighboring points. Light gray areas are of lower free energy, whereas darker shades represent a higher free energy. The black area has not been explored by the string and is thus of unknown free energy. Subfigures (b) and (c) have been adapted from Paper 3.

In general, trying to estimate long timescale dynamics from short trajectories leads to limitations (80). One shortcoming of the string method is that it can only identify one out of multiple activation pathways. The short trajectories cannot drift across high free energy barriers, and thus the converged string depends on the initial guess. To remedy this, the initial pathway should be chosen with consideration, by for example targeting intermediate states identified experimentally. Ideally, one should also converge several pathways from different starting conditions. The swarms of trajectories method also shares many limitations with other CV-based methods. However, unlike many other enhanced sampling based techniques, the string is inherently one-dimensional and its convergence is not limited to the choice of a few CVs.

Since the swarm trajectories capture the local gradient of the free energy landscape, it is possible to estimate the full free energy landscape within the string’s vicinity. By discretizing the system along some variable, one can construct a transition matrix from the

22 swarm trajectories start- and endpoints, and derive the stationary probability distribution along the pathway. Only swarms from the final iterations, when the string samples an equilibrium distribution, should be included, while trajectories sampled during the first iterations must be discarded (74). Although the transition matrix’ stationary probability distribution is the same at any timescale, the trajectories are too short to estimate the kinetics of the system. To capture the timescales between metastable states, the points can be used as starting configurations for longer simulations analyzed with an MSM framework (81). By itself, the swarms of trajectories method provides insights into both the molecular mechanism and the energetics governing the transitions between states.

4.3) Kinetically trapped state sampling

In certain cases, we wish to only sample a stabilized state accessible starting from a protein structure, while avoiding exploration of the entire conformational ensemble. In Paper 4, we have developed a CV-based method for this purpose. Despite its long name, kinetically trapped state sampling (or simply single state sampling) is a simple method. It is a close relative to the swarms of trajectories method, but with just one point on the string and with longer swarm trajectories. Single state sampling can also be seen as a sophisticated equilibration method.

A single swarm is launched from a starting structure and allowed to evolve for a fixed period of time. First, the center point of the swarm’s endpoint coordinates, 푐, and the distance,푟 , from every endpoint,푖, to the center, is computed in CV space. Next, a weight, 푖

|푟 −푐|2 푖 푤 = exp − 2 푖 ()푑 is assigned to every replica, 푖. The replicas are extended with푤 /∑ 푤 copies each. The 푖 푗 푗 total number of trajectories is kept fixed, and thus trajectories near the center point will be multiplied while outlier trajectories will be terminated. This process is carried out iteratively. Convergence is reached when the trajectories diffuse around a stationary center point, thoroughly sampling a single state.

Single state sampling is not designed to explore the entire conformational landscape of a protein, nor to estimate the overall stability of states. Unlike a single MD simulation, the outcome is robust to the stochasticity of MD. It is an excellent choice for analyzing the effect of perturbations on a starting structure, such as the binding of different ligands to a receptor. Like the swarms of trajectories method, however, it requires good CVs to function—a problem we will inquire in the next chapter.

23 5) Data-driven analysis of MD trajectories

As MD simulations reach longer timescales, and researchers study systems of ever-growing size, the output trajectories increase in size too. Ordinary simulations nowadays generate terabytes of data that need to be analyzed. Seeing that the human mind is poor at interpreting high-dimensional data, mere visual inspection of the atom coordinates is not a viable standalone post-processing strategy. In geology, epidemiology and many other areas, a map is a powerful tool for displaying data in an easy-interpretable image (Fig. 5a). Unfortunately, within the field of biomolecular simulations, the maps are not well-defined. Projecting the results along some intuitively chosen variable, such as the root-mean-square deviation (RMSD) with respect to an experimental structure or a scientist’s favorite interresidue distance, may lead to other important degrees of freedom being overlooked and potentially introduce a human bias when interpreting data.

This is far from the only field struggling with processing large amounts of data—it is a challenge across many technological and scientific areas. Data-driven methods, and machine learning in particular, have emerged as a popular approach to leverage big data in nearly every scientific domain (4–7). In the following sections, we describe how data-driven analysis can be applied to molecular simulations. Although machine learning has come to prominence for its predictive power, it can also enhance our understanding of high-dimensional datasets.

5.1) Dimensionality reduction methods for visualization

2 A system with푁 particles can form푁 pairwise interactions and has3푁 degrees of freedom. However, as described in Chapter 4, the evolution of a biomolecular trajectory occurs on a manifold of much lower dimensionality.

There are various dimensionality reduction techniques for projecting high-dimensional data onto a space spanned by a few compressed variables. Established methods rely on different assumptions about hidden structures in the data, and their success depends to a large extent on the problem at hand. One popular method is Principal Components Analysis (PCA; Fig. 5b) (82), which during training identifies an orthogonal basis of principal components (PCs). The PCs, which are linear transformations of the input variables (also referred to as the input features or simply features in this context), explain the maximum amount of variance in the data (Fig. 6a). To perform dimensionality reduction, the input features are projected onto the first PCs. Another example is Multidimensional scaling (MDS; Fig. 5c) (83), which aims to preserve the distance between points in the compressed space and the original input space. Unlike PCA, MDS is a nonlinear method, and the resulting transformation from the input variables to the compressed coordinates is a complex mathematical expression. More details and

24 applications for the previously mentioned methods can be found in Paper 1 and 4. A thorough review of all existing dimensionality reduction methods is unfortunately out of scope for this thesis.

Figure 5: (a) Maps are real-world examples of dimensionality reduction visualization techniques. The color of a country represent the number of confirmed COVID-19 cases as of March 17th 2020 (84). (b) and (c) Dimensionality reduction techniques applied to β2AR’s conformational ensembles. Each point represents a simulation snapshot, colored according to the ligand bound to the receptor. Red dashed lines highlight regions where agonist ligands cluster. The features are computed as the inverse closest heavy atom distances between residues. (b) Principal component analysis (PCA) projection onto the first two principal components (PCs), and (c) multidimensional scaling (MDS). Sub-figures (b) and (c) have been adapted from Paper 4.

Dimensionality reduction techniques are powerful visualization tools for MD trajectories. In Fig. 5b and c, we see how the states automatically emerge by dimensionality reduction of

β2AR simulations. However, the plots provide little molecular details. Even though states emerge, dimensionality reduction alone reveals little information about their molecular

25 properties. For this purpose, we need to identify which features are considered important by the dimensionality reduction algorithms.

5.2) Finding important molecular features with machine learning

Machine Learning methods can identify important molecular details in a system. Exactly which features are considered important depends on the choice of method, and the choice of methods is governed by the problem statement at hand. Supervised learning methods are relevant when every simulation frame can be assigned a class label. The labels could, for example, represent protein states, or the class of ligand bound to a receptor (Fig. 5b, c and 6c). Supervised learning models learn to predict these labels from the input features. Similar to the dimensionality reduction techniques introduced in the previous section, unsupervised learning methods (Fig. 6a) find ensemble properties based on an assumption of some hidden structure in the data, and are thus useful if the simulation frames cannot be labelled. Unsupervised learning can, for example, describe protein conformational change in an MD trajectory.

The emerging important features also depend on the choice of machine learning model, and what technique we apply to make it interpretable. A simple, yet powerful, supervised method for estimating the importance of a feature,푠 , is to compute the Kullback-Leibler (KL) divergence (85), also referred to as the relative entropy,

푃(푠) 퐷(푃||푄) = ∑ 푃(푠) log 푄(푠) , 푠∈푆 between two distributions푃(푠) and 푄(푠) (Fig. 6b). Having a high KL divergence means that푠 discriminates between the classes and is thus considered important. Another supervised method that is more complex than the KL divergence, yet has a simple foundation, is the Random Forest (RF) classifier (86), which makes a prediction based on the consensus of many randomly instantiated decision trees. The features often used to split the underlying decisions trees are considered important (87, 88). PCA, introduced in the previous section, is an example of unsupervised learning. From the PCs and the eigenvalues, it is straightforward to compute the features’ contribution to the variance.

Artificial neural networks can be used for both supervised and unsupervised learning. A neural network model comprises a series of connected layers (Fig. 6c). Every node in the network performs a linear transformation of nodes in the previous layer and feeds it through a non-linear activation function. The shape of the layers, and the choice of activation function are static hyperparameters of the network’s topology, whereas the weight of the linear transformations are tuned during training.

26 Figure 6: Examples of machine learning models. (a) Principal component analysis (PCA) converts the input features to an orthogonal set of linearly uncorrelated variables called principal components (PCs) through an orthogonal transformation. The first component covers the largest variance possible. The feature importance is taken to be the components’ coefficients of a feature times the variance covered by the components (i.e. the eigenvalues). (b) The Kullback-Leibler (KL) divergence (also called relative entropy) computes the difference between two distributions P(x) and Q(x) along every individual feature. We compute the KL divergence between one class and all other classes as an indication of the importance of a specific feature. (c) A multilayer perceptron (MLP) is a feedforward artificial neural network with fully connected layers. Important features can be derived using layer-wise relevance propagation (LRP). (d) Example of LRP applied to an image of a black box. The colored areas in the rightmost figure highlight the pixels identified as important by a neural network. Subfigures (a)-(c) have been adapted from Paper 1.

A classical example of a supervised neural network is the fully-connected multilayer perceptron (MLP) classifier (Fig. 6c) (85). Two examples of unsupervised networks are the popular autoencoder (89), which learns to compress and reconstruct the input features via a low-dimensional hidden layer, and the Restricted Boltzmann machine (RBM) (90), a generative single-layer network based on a . Despite the last decade’s rising interest in deep learning, neural networks have a notorious reputation for functioning

27 like black boxes (4–6, 91). As a result, much research is dedicated to making these black boxes transparent (92, 93). Layer-wise relevance propagation (LRP) (93) is one such technique, originally designed to identify pixels used by neural networks in image classification problems (Fig. 6c and d). In short, the method tracks the values passing through nodes during a prediction, then propagates these values backwards through the network to the input features, giving an estimate of their importance.

Given a set of MD simulation snapshots to analyze, the raw data needs to be converted to a suitable format before being fed to a machine learning model. This preprocessing step is commonly known as feature extraction, a critical aspect of any machine learning problem. A straightforward approach to extract features from an MD simulation is to append all atoms’ Cartesian coordinates into one data point per frame. Hence, if a trajectory comprises푁 atoms and퐿 frames, the input data will contain퐿 samples and3푁 features. For practical applications, the interresidue contact map is often a better choice of features, since the Cartesian coordinates are sensitive to system translations and rotations. The dataset can be reduced by, for example, discarding the bilayer and the solvent from a membrane protein simulation, or skipping simulation snapshots to decrease the time correlation between samples.

The features derived from an MD trajectory may also be an ensemble property, such as the time average of some variable or a property of the free energy landscape. This approach is, to a very small extent, explored in Paper 4, where we performed linear regression on the expectation value of microswitches from different simulations. Many models, including neural networks, are sensitive to scaling of the features. The input data is thus often transformed by removing the mean and scaling to unit variance, or scaling every feature by its mean and max value to an interval between zero and one (94). Another common limitation among machine learning models is that some, such as the RBM used in Paper 1, are better at handling discrete over continuous features. Continuous features may thus be discretized during preprocessing (94). In short, the feature extraction procedure is application- and model-dependent, yet some approaches are generally more fruitful than others, which is something we explore in Paper 1.

5.3) Allosteric network analysis

A protein is in constant motion at the atomistic level, and its residues may easily be perturbed from their local equilibrium positions. Changes in an allosteric site can cause a cascade of perturbations through the system, resulting in an allosteric effect in a different protein domain. This flow of information from one site to another is not a completely random process, but follows certain allosteric pathways (95). An MD trajectory may contain useful information about protein allostery, even if a related allostery-driven

28 conformational change, such as GPCR activation triggered by agonist binding, is not observed within the simulation itself (95, 96).

Figure 7: A conceptual network where every node represents a protein residue. Once the network is built, we can extract allosteric pathways (blue) between source (orange) and sink (red) nodes by employing network analysis. Adapted from Paper 2.

It is often helpful to model a protein structure as a network, or a graph, of connected residues. A node in the network corresponds to a residue, and the edges between nodes represent the connection between residues. The adjacency matrix,퐴, where every entry (푖, 푗) denotes the edge from node푖 to 푗, encodes all the information about the network. If two residues are unconnected, the corresponding entry in the adjacency matrix is zero, whereas a high value signifies a strong interaction.

There are different methods for constructing the adjacency matrix from an MD trajectory, and the choice of method can have a significant effect on the network’s properties. An edge is often based on two residues’ spatial proximity or their correlation in terms of movement. A residue interaction network combines both these aspects by defining the entries of the adjacency matrix as,

퐴 = 퐶 푀 , 푖푗 푖푗 푖푗 where퐶 is the contact map and푀 is the positional mutual information between two 푖푗 residues. A straightforward choice of퐶 is a sparse binary matrix with퐶 set to unity only 푖푗

29 for residues within a certain cutoff (often 4.5 Å for heavy atoms (97)) of each other. Although a sparse binary matrix is easy to handle computationally, especially for large networks, it leaves out much information. Another approach is to make퐶 proportional to 푖푗 −1 the inverse interresidue distance,퐶 ∼ 푟 , as was done for feature selection in Paper 1, 푖푗 푖푗 3 and 4. A more elegant solution, which was the choice of method in Paper 2, is to let 퐶 푖푗 2 2 2 decline exponentially above a certain cutoff distance, 퐶 = exp((푐 − 푟 )/2σ ),where 푐 is 푖푗 푖푗 the cut-off and σ governs the rate of decline.

The mutual information estimates the amount of information that can be inferred about one residue provided the position of another residue (98, 99)2 . First, we compute the Shannon entropy,

퐻 = − ∫ ρ (푟)ρ (푟)푑푟, 푖푗 푖 푗 whereρ is a residue centroid’s fluctuation density around an equilibrium position. The 푖 mutual information is then given by the expression: 푀 = 퐻 + 퐻 − 퐻 . For more details, 푖푗 푖 푗 푖푗 including a discussion on how to estimateρ, we refer the reader to Paper 2 or related publications (98–101).

To perform allosteric network analysis, we infer the information flow3, between a source and an allosterically coupled sink site. For this purpose, it is useful to construct the network Laplacian,

퐿 − 퐷퐴, where퐷 is a diagonal matrix constructed by summing over the rows of the adjacency matrix. Thus, the diagonal entries of the Laplacian contain the total amount of information flowing into a node. The off-diagonal entries measure the information flowing from one node toward another one. Many useful properties can be extracted from the Laplacian or variants of it. Since the Laplacian encodes all the information flowing in a graph, we can analyze it to find allosteric pathways conducting signals from the source to the sink nodes (Fig. 7). To learn more about the details of this algorithm, we refer the reader to Paper 2 and previous publications on this topic (18, 102, 103). Combined with the methods outlined in the previous chapters, allosteric network analysis can thus shed light on the molecular mechanisms governing protein function.

2 Early analysis of residue interaction networks considered correlations instead of mutual information (101). 3 Also called current flow, since the governing equations are analogous to those of the current flow in electrical networks, although such naming easily leads to confusion when used in the context of conducting ion channels.

30 6) Summary of papers 6.1) Paper 1—Molecular Insights from Conformational Ensembles via Machine Learning

The aim of this paper was to investigate how machine learning (ML) models of varying complexity can aid human analysis and interpretation of biomolecular simulations. We evaluated the unsupervised methods (PCA, the autoencoder and the RBM) and the supervised methods (computing the KL divergence, the Random Forest and the MLP classifier) introduced in Chapter 5.2 in our analysis. Key results include a protocol for data-driven analysis (Box 1) which we followed to study three different biological systems:

Calmodulin, the β2 adrenergic receptor, and the Kv1.2 ion channel’s voltage-sensor. To construct a protocol for choosing the best ML model, we needed to quantify the different methods’ performance, which demanded a system with a known ground truth. We constructed a toy model that mimicked conformational changes typical to macromolecular systems. Given an initial configuration of atoms, a trajectory was generated by adding weak random perturbations to all atomic coordinates. Functional states were designed by displacing certain atoms from their equilibrium positions. An ML model’s performance was evaluated from two aspects: its capability to ignore irrelevant atoms, i.e. suppressing noise and providing a strong signal-to-noise ratio, and its ability to find all important features in the data, not just a subset of them (Fig. 8a and b).

As expected, since machine learning tools are based on different algorithms and assumptions about the data, the computed importance profiles differed too. Nevertheless, the methods gave remarkably similar results at their best performance. We found that the choice of input features heavily affected the results, especially for linear models (Box 1). Methods such as KL divergence and PCA could identify the conformational displacement in the toy system when using internal interatomic distances as features (Fig. 8b). By merely considering Cartesian coordinates, these methods performed poorly and failed to identify several important features (Fig. 8a), while a supervised neural network identified all important atoms in both cases. In a sense, the neural network learned the internal distances from the raw xyz-coordinates. These results demonstrate the power of neural networks, yet the most important finding was probably that the choice of input features matters. When different models point to the same features in your data, the odds are that they have learned something truly significant (Box 1).

31 Figure 8: (a–b) Box plots of the model performance using either Cartesian coordinates (a), or the full set of inverse interatomic distances (b) as input features sampled over different instances of the toy model. A high accuracy at finding all atoms signifies that every displaced atom has been identified as important and that other atoms have low importance. A high accuracy at ignoring irrelevant atoms signifies that only displaced atoms, although not necessarily all of them, have been marked important. The best performing set of hyperparameters found after benchmarking every method have been used. RAND stands for random guessing. The boxplots show the median (orange horizontal line), the interquartile range (box), the upper and lower whiskers (vertical lines), as well as the outliers (circles). (c)

Active state β2AR structure with important residues identified by PCA. The G-protein binding site undergoes the most significant conformational change upon activation, especially TM6 and the NPxxY motif at the bottom of TM7. The figure has been adapted from Paper 1.

Real biomolecular simulations are poor candidates for a quantitative evaluation of the models’ performance, since there is no known ground truth about the residues’ actual

32 importance. Instead, in the second half of the paper, we tested our protocol qualitatively on systems that we were highly acquainted with, and verified that we can use ML to identify domains important for protein function.

First, we tested all six models on a dataset containing conformational states of the calcium-bound protein Calmodulin’s C-terminal domain. All methods successfully identified physiologically relevant residues, including conserved sites and those involved in binding to target proteins. The supervised methods gave consistent and easily interpretable results, while the results from unsupervised learning were slightly noisy.

Next, we studied the effect of ligand binding and the activation mechanism of the β2 adrenergic receptor (Chapter 2.4 and Fig. 2)4 . The dataset originated from enhanced sampling simulations of the receptor’s activation path and will be further analyzed in Paper 3 and 4. We used unsupervised learning (PCA) to identify two hallmarks of GPCR activation (Fig. 8c): the displacement of TM6 and the twist of the NPxxY motif. Next, we trained a Random Forest classifier (86, 87) to distinguish agonist-bound from a ligand-free receptor conformations, with simulation snapshots from the receptor’s entire activation path. The classifier primarily pinpointed residues known to interact with ligands, but also allosteric effects in the G protein-binding site. The GPCR analysis shows how unsupervised and supervised learning can provide different insights, and that the proper choice of method depends on the question at hand.

Finally, we used supervised learning to track the response of the Kv1.2 ion channel’s voltage-sensor to a change in transmembrane potential. The dataset consisted of five previously reported states: one resting, one activated and three intermediate states. By training a Random Forest classifier to discriminate between these states, we found residues that couple to an electric potential. As expected, the top-ranked features were positively charged residues in the S4 segment. Interestingly, these residues were not equally important in every state. The most important residues were always localized near the so-called hydrophobic plug, the region in the VSD in which the electric field is the strongest.

We did not explore the importance of interactions between the protein and its environment, although features such as residue hydration numbers could easily be included in our framework. Instead, the methods captured environmental effects not directly included in the dataset. For the β2 adrenergic receptor, the supervised learning model clearly picked up changes in the orthosteric site, without incorporating the ligand molecule in the input features. In the VSD analysis, we captured charged residues sensitive to an electric field, although neither atom charges, nor any information about the field was included in the dataset. These findings show that the methods actually ‘learn’ the truly important features

4 In the second part of the paper, the authors all analyzed one system they were particularly

familiar with. My contribution was that of the β2 adrenergic receptor.

33 of a system and are not just producing results out of a magical black box. We can leverage this powerful approach prospectively to demystify complex MD simulations.

1. Identify the problem to investigate 2. Decide if you should use supervised or unsupervised machine learning (or both) a. The best choice depends on what data is available and the problem at hand b. If you chose unsupervised learning, consider also clustering the simulation frames to label them and perform supervised learning 3. Select a set of features and scale them a. For many processes, protein internal coordinates are adequate. To reduce the number of features, consider filtering distances with a cutoff b. Consider other features that can be expressed as a function of internal coordinates you suspect to be important for the process of interest (dihedral angles, cavity or pore hydration, ion or ligand binding, etc.) 4. Chose a set of ML methods to derive feature importance a. To quickly get a clear importance profile with little noise, consider RF or KL for supervised learning. RF may perform better for noisy data. b. For unsupervised learning, consider PCA, which is relatively robust when conducted on internal coordinates c. To find all important features, including those requiring nonlinear transformations of input features, also use neural network based approaches such as MLP. This may come at the cost of more peaks in the importance distribution d. Decide if you seek the average importance across the entire dataset (all methods), the importance per state (KL, a set of binary RF classifiers or MLP), or the importance per single configuration (MLP, RBM, AE) e. Chose a set of hyperparameters which gives as reasonable trade off between performance and model prediction accuracy 5. Ensure that the selected methods and hyperparameter choice perform well under cross-validation 6. Average the importance per feature over many iterations 7. Check that the distribution of importance has distinguishable peaks 8. To select low-dimensional, interpretable CVs for plotting and enhanced sampling, inspect the top-ranked features 9. For a holistic view, average the importance per residue or atom and visualize the projection on the 3d system 10. If necessary, iterate over steps 3-9 with different features, ML methods and hyperparameters

Box 1: Checklist for interpreting molecular simulations with machine learning. Copied from Paper 1.

34 6.2) Paper 2—Network analysis reveals how lipids and other cofactors influence membrane protein allostery

It is relatively straightforward to follow the steps outlined in Chapter 5.3 to construct a residue interaction network of an isolated protein and analyze interprotein allostery. However, membrane proteins interact with, and are modulated by, molecules in the surrounding solvent and membrane. Together with ligands and other cofactors, these molecules affect protein allostery, yet it is challenging to incorporate them in the analysis. Paper 3 presents an extended version of the network analysis framework that takes cofactors into account. We evaluated the method on three membrane proteins: the voltage-gated ion channel KCNQ1, which is critical for repolarization of the heart’s action potential; the β2 adrenergic receptor introduced in Chapter 2.4; and the bacterial pH-gated ion channel KcsA, which is frequently used as a model for more complex eukaryotic potassium channels.

A cofactor, for example, a retained water molecule part of an allosteric pathway, might be swapped for another identical cofactor without disrupting the protein’s function. To account for this high redundancy of cofactors, we solved a combinatorial optimization problem to reflect that any cofactor of the same type can play the same role, in each frame of a simulation trajectory. Larger molecules such as lipids also needed to be split into several network nodes to better account for their orientation. For membrane proteins, the resulting pathways typically revealed a high flow of information within the protein, and a weaker, yet substantial, flow through the nearest layers of embedding lipid molecules (Fig. 9c).

The methods introduced in Paper 1 and 2 act on the same kind of datasets, but have several notable differences. In Paper 1, the transformation of atomic coordinates to features enabled us to feed MD data into machine learning models. In Paper 2, preprocessing involved constructing an allosteric network. The allosteric network finds a communication pathway between two specific endpoints, whereas in Paper 1 we sought important features with a high signal-to-noise ratio. Although the two approaches may highlight similar residues (11, 15), they reveal residues that may be important for different reasons and, as such, provide complementary insights.

Similar to the outline in Paper 1, we applied the method to three biological systems of interest. We saw that the voltage-gated ion channel KCNQ1 partially communicated voltage sensitivity to the channel gate via the membrane. By applying information flow analysis to simulations of the pH-gated ion channel KcsA, we found that binding of 1,2-Dioleoyl-sn-glycero-3-phospho-rac-(1-glycerol) (DOPG) lipids affected allosteric signaling, in particular in the protein’s open state. Additionally, in line with expectations, we found that residues known to be important for the channels’ functions partake in the allosteric pathways.

35 Figure 9: (a) Structure of active/Nb-bound β2AR showing important structural features, and sources and sinks. The yellow arrows show cartoon pathways, highlighting the direction between the source and sink residues. (b) State- and ligand-dependent allosteric efficiency (blue: protein, yellow: protein + agonist, black: protein + agonist + full membrane). Left column: inactive state, and right: active/Nb-bound state. Upper row: apo β2AR, and bottom row: agonist-bound β2AR. (c) Average information flow through each protein residue and lipid interactors (information flow >0.012) projected onto the first trajectory frame. Red signifies larger information flow values. (d) Difference in allosteric efficiency of networks built by including different components of the β2AR systems. Dark red: comparing protein/agonist/membrane networks to protein networks, and red: comparing protein/agonist networks to protein networks. Averages and standard error margins are obtained by splitting the simulations into three parts. Copied from Paper 2.

36 For the β2 adrenergic receptor, we compared four single state sampling simulations (Chapter 4.3): with or without an agonist in the orthosteric site, and with or without a G-protein-mimicking nanobody in the G protein binding site (Fig. 9). The two nanobody-bound simulations sampled an active receptor state, whereas the nanobody-free simulations sampled an inactive, intermediate-like state. Agonist-interacting residues in the orthosteric site were assigned as source, and residues in contact with a G protein acted as sink (Fig. 9a).

We found that the closest layer of the membrane played a prominent part in the allosteric pathway between the two sites, especially for the apo receptor (Fig. 9b-c). The agonist-bound, nanobody-free receptor showed a relatively strong information flow within the protein, and a lesser involvement of allosteric lipids. This indicates that agonist binding focuses the allosteric signal transduction within the protein core while its absence lets the signal transit via the environment. The nanobody-bound receptor coupled to the membrane, regardless if the agonist was bound to the protein. Hence, nanobody binding dampens the relative allosteric effect of ligand binding (Fig. 9c and d), possibly because the conformational rearrangements triggered by intracellular binding dominates over those induced by agonist binding. A more thorough investigation of agonist binding to the nanobody-free receptor follows in Paper 3 and 4.

37 6.3) Paper 3—Energy Landscapes Reveal Agonist Control of G Protein-Coupled Receptor Activation via Microswitches

In this paper we studied the transition between the β2 adrenergic receptor’s active and inactive states with atomistic MD. The primary results of the paper were obtained using the string method with swarms of trajectories (Chapter 4.2). Starting from the active state structure 3P0G (104), we initiated the string from inactivation events observed in long timescale MD (44), and derived the conformational ensemble along the receptor’s most probable transition path. Three systems were considered: 1) the receptor in its apo state, 2) the agonist-bound receptor with the highly conserved residue D79(2.50) in its deprotonated (ionized) form, and 3) the agonist-bound receptor with D79(2.50) in its protonated state.

Although the paper focuses on biological insights, we also developed simulation and analysis methods. For example, we mitigated some of the string method’s most common drawbacks. We presented an optimization that adaptively chooses the number of trajectories in a swarm, and a reparametrization algorithm that enables exchange of configurations between string points. Relatively fast orthogonal degrees of freedom were further accounted for by assessing the convergence of the average string over many iterations. The string diffused for several iterations to assess convergence and to sample the activation path thoroughly. Every string ran for 300 iterations, corresponding to a few microseconds of accumulated simulation time. The short swarm trajectories, which had no biased force applied to them, were used to construct a transition matrix (Chapter 3.2), and hence derive the free energy landscape along microswitches.

As with all CV-based enhanced sampling methods, the success of a simulation depends on the choice of CVs. For this reason, we analyzed a long-timescale MD simulation containing spontaneous receptor deactivation events (44). Spectral clustering analysis of this trajectory identified three distinct receptor states: an active, an intermediate, and an inactive state. Remarkably, the states were almost identical to those observed by Dror et al. (44) using a set of variables common to GPCR activation studies. By training a MLP classifier to learn which state a simulation frame belonged to, we could use layer-wise relevance propagation (Chapter 5.2) to find features good at discriminating between receptor states. The top-five distances were selected as CVs in the swarms of trajectories method. By smoothing the deactivation trajectories frames along these CVs, we derived a reasonable approximation of the activation path, which we then used as an initial string. Two shorter substrings that better sampled the active and inactive sub-states were further included in the analysis (Fig. 10).

38 Figure 10: Free energy landscapes projected onto the first two CVs used to optimize the minimum free energy path. The minimum free energy pathways were optimized (a) in the absence and (b) in the presence of the bound agonist BI-167107. Characteristic events occurring during activation are marked on the string. The active and inactive labels mark the regions close to the active and inactive structures (PDB entries 3P0G (104) and 2RH1 (31), respectively). Copied from Paper 3.

From the swarms of trajectories simulations, we derived the mechanistic order of events and free energy along the receptor’s activation pathway (Fig. 10). Next, we projected the free energy landscapes along CVs describing common GPCR microswitches, which allowed us to identify parts of the protein that were strongly or allosterically coupled to the orthosteric site. In general, agonist binding was loosely coupled to large-scale conformational changes near the G protein binding site, such as the outward movement of TM6, but strongly coupled to closer microswitches in the transmembrane region. We pinpointed the transmembrane domain around the Connector motif as being the inflection point below which the strong coupling to agonist binding was broken.

In the apo simulations, we observed a sodium ion in the cavity surrounding D79(2.50) when the receptor was in its inactive state. Following binding to the orthosteric site residue D113(3.32) in the intermediate state, the sodium ion exited the receptor to the extracellular domain. This finding suggests that sodium binding may play a role in receptor modulation, in agreement with previous studies (105–107). Protonation of D79(2.50) shifted the free energy landscapes toward the active state, which indicated that this residue may be protonated in the active state, but not in the inactive state, when water molecules entered the surrounding cavity.

Finally, we explored the transferability of our method and results to other class A GPCRs. A protonated D79(2.50) further induced an active NPxxY motif, similar to conformations

39 observed in other class A GPCR structures. Analysis of every class A receptor with resolved active and inactive structures revealed that our CVs were able to discriminate between all active and inactive GPCR states. Our approach may thus enable efficient sampling of other class A GPCRs’ activation pathways.

40 6.4) Paper 4—Identification of ligand-specific G-protein coupled receptor states and prediction of downstream efficacy via data-driven modeling

In this paper, we studied activation of the β2AR bound to a dataset of six ligands found in experimental receptor structures (Fig. 2b), including several clinically approved drugs (108). This publication too involved methodological improvements, primarily the development of the single state sampling method (Chapter 4.2) and the associated dimensionality technique analysis (Fig. 5b and c). Paper 4 also followed the protocol of Paper 1 (Box 1) to identify molecular features important for ligand-induced allosteric effects. The results otherwise build on those of Paper 3, and, although they are two stand-alone publications, the reader is recommended to read these two papers in order.

We used the swarms of trajectories method to compute free energy landscapes along the ligand-receptor complexes’ most probable activation pathways. In line with expectations, the agonists (BI-167107, Adrenaline and partial agonist Salmeterol) shifted the free energy profiles to active-like states, whereas the antagonists and the inverse agonists (Alprenolol, Timolol and Carazolol) favored inactive-like states. The expectation value of orthosteric and transmembrane microswitches were strongly correlated to experimental values of receptor activity, measured as the cellular cAMP response (109) (Fig. 11a and b).

Microswitches near the G protein binding site showed little predictive power for downstream signaling. Instead, analysis of the ligands’ respective single state sampling simulations (Chapter 4.3) revealed a great heterogeneity of states in this region. Dimensionality reduction methods generated low-dimensional visualizations (Fig. 5b and c) where every ligand-receptor complex formed a distinct β2AR state. The agonists also clustered together in these plots. Backed by these findings, we identified interresidue distances that could discriminate between agonist and non-agonist receptor states. Two notable hotspots protruded (Fig. 11c): one at the extracellular end of TM5 near the orthosteric site, and another around the NPxxY motif, which showed a high heterogeneity of states, even among agonists.

Combining the insights from the activation pathways with the conformational differences between states, we arrived at a conceptual model for ligand control of receptor activation (Fig. 11d). Ligands interact with microswitches near the orthosteric site. Local interactions propagate to microswitches near the G protein-binding site, thus forming a distinct receptor signaling state. Our approach to study receptor activation synthesizes much of the work performed during the writing of this thesis. The study can act as a template to analyze other ligands and proteins, and might be used in the development of novel drugs.

41 Figure 11: (a)-(b) Correlation between experimental values of downstream cAMP signaling and the expectation value of different microswitches for the receptor bound to different ligands, (a) for the TM5 bulge, and (b) for the Connector. Red dashed lines highlight the clustering of the agonist-bound structures. (c) Residues identified to be important for classification of the equilibrated active-like states, comparing agonists to non-agonists. (d) Conceptual model describing allosteric communication between the hotspots. Adapted from Paper 4.

42 7) Outlook

The scientific disciplines that this work builds on have undergone continuous improvements and achieved a series of breakthroughs during the years leading up to this thesis. Multiple GPCR structures have been solved, MD simulation software has become faster, and exascale supercomputers are soon a reality. In the future, it is reasonable to expect software and hardware to continue to push the limits of what can be achieved using MD simulations. Because of the high dimensionality of MD data, machine learning-based methods are likely to establish themselves as standard tools for in silico studies of biomolecular systems. Recent work in machine learning, and deep learning in particular (77, 110), shows great promise to assist mining of a protein’s landscape. However, it is unclear today if such methods can reach their full potential when applied to larger biomolecular systems. Regardless, as more structures are solved and the ever-growing output of MD simulations is made publicly available (48), the challenges outlined in the previous sections will remain.

In a nutshell, this thesis introduces methodological improvements and guidelines, and applies them to analyze the β2 adrenergic receptor’s activation mechanism. All methods can be further improved, however. First, the associated software could be better documented and made available via common packaging systems. Other examples of improvements follow: the CVs used in the enhanced sampling methods may be automatically updated between iterations by deriving the most important features of the newly generated trajectories. The discretization technique we used to construct the transition matrix in Paper 3 and 4 is quite primitive. An alternative approach would be to discretize the system into microstates by clustering the swarm trajectories. In Paper 2, we incorporated the membrane in an allosteric pathway analysis. Bound cofactors that don’t exchange could be included in the analysis protocol outlined in Paper 1 without difficulties, but further work is required to account for protein interactions with the mobile bulk solvent.

Several aspects of the β2AR’s system configuration should also be improved. For example, the membrane can be better modeled to mimic the β2AR’s native environment (Chapter 3.4). Constant-pH simulations may also shed light on the proper protonation state of titratable residues and verify if the protonation of D79(2.50) is indeed state-dependent.

To confirm the expectations on our approach to study GPCR activation, future simulations should be carried out in conjunction with experimental studies. Hopefully, future prospective studies following the work in Paper 4 will contribute to the development of novel drug candidates with biased signaling properties. The entire field of GPCR molecular biology will benefit if we can determine which learnings from the prototypical receptors are transferable to other, lesser studied, systems. We expect our methods to be useful in studies of other receptors and even other classes of membrane proteins. Our approach plays well together with the increasing computational capacities and the rising

43 availability of protein structures, and can be used to improve our understanding of many biomolecular systems.

44 Acknowledgements

First and foremost, I wish to thank my advisors, Lucie Delemotte and Erik Lindahl, for the opportunity to pursue a PhD at Science for Life Laboratory. Lucie, you host a fantastic work environment. Besides being an excellent scientist, you are a very considerate person. I am deeply grateful for your counseling these past years. Erik, thanks for making Molecular Biophysics Stockholm what it is. Your energy and multi-tasking ability is incredibly inspiring.

Thanks to past and current members of Delemotte’s army for all the good times we shared in the lab and on special occasions such as conferences and Christmas parties. Annie, Tyler, Sammy, Marina, Sarah, Lea, Koushik, Sergio, Ahmad, Yue, and all the students that joined the group for shorter periods—it was a pleasure to be around you every day. A special credit to Annie for paving the way for me all these years, and to Sergio for taking care of the army’s code.

Jens Carlsson, your knowledge and high scientific standard made this work what it is. I learned a lot by collaborating with you, Pierre and Nicolas. And thank you for taking me with you to some really fun conferences. I’m excited to see what your collaboration with Lucie will lead to in the future.

Berk, Catherine and Michele, I enjoyed teaching with you a lot. Also, thank you, Berk, for encouraging me to work on the GROMACS code.

Reba, many thanks for your engagement during journal clubs, and thanks for organizing so many nice social events. You make our community a very inclusive environment.

Other members of Molecular Biophysics Stockholm, including Dari, Yuxuan, Urska, Marie, Björn, Stefan, Linnea, Artem, Paul, Szilard, Joe, Viveca, Stephanie, Petter, Alessandra, Christan, Christian, Ali and Magnus—I really appreciated our discussions. You’re all amazing.

I wish to thank our neighbors on γ3, and other members of KTH’s biophysics group, for constituting a great interdisciplinary group. Andreas, I’m so glad that our paths crossed again.

Thanks to the team at Alex Therapeutics for what we achieved so far, and for the journey we have ahead of us. A special shout-out goes to John for being supportive of my PhD studies and for sharing this wild ride with me.

Last but not least, I’m grateful for my family. Mum and dad, for bringing me up and being excited about my research, even the technical details. Filippa, I’m so happy I got to work in the same place as you. I hope we resume our habit of lunching together once the pandemic has passed. Anders, thanks for keeping me in shape with our innebandy session. I can’t wait to let you, August and Ville kick my ass the next time we play together.

45 Mathilde, thank you for being a parental role model to us rookie twin parents. It means more than you know. I hope we get to see you, Joaquin (the world’s best vice-captain), Lill-Jocke, Babba, Ana and Nico soon. Ivar and Monica, I’m thankful for your support in my endeavors. Your everyday help and consideration does not go by unnoticed.

No one deserves my acknowledgements more than my extraordinary wife, Anna. Your love and support is more than anyone can possibly deserve. The best thing that happened to me during the years carrying out this thesis—by light-years of margin—was the day Alice and Ellie were born. The twins are lucky to have the world’s best mother. I myself am extremely grateful for having such wonderful girls in my life.

Oliver Fleetwood

Småland

Midnight, 11 April 2021

46 Bibliography

1. Chu, X., L. Gan, E. Wang, and J. Wang. 2013. Quantifying the topography of the intrinsic energy landscape of flexible biomolecular recognition. Proc Natl Acad Sci USA. 110: E2342-51. 2. Harpole, T.J., and L. Delemotte. 2018. Conformational landscapes of membrane proteins delineated by enhanced sampling molecular dynamics simulations. Biochim. Biophys. Acta Biomembr. 1860: 909–926. 3. Pan, A.C., D. Sezer, and B. Roux. 2008. Finding transition pathways using the string method with swarms of trajectories. J. Phys. Chem. B. 112: 3432–3440. 4. Ching, T., D.S. Himmelstein, B.K. Beaulieu-Jones, A.A. Kalinin, B.T. Do, G.P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, M.M. Hoffman, W. Xie, G.L. Rosen, B.J. Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A.E. Carpenter, A. Shrikumar, J. Xu, E.M. Cofer, and C.S. Greene. 2018. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface. 15. 5. Zitnik, M., F. Nguyen, B. Wang, J. Leskovec, A. Goldenberg, and M.M. Hoffman. 2019. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Inf. Fusion. 50: 71–91. 6. Akay, A., and H. Hess. 2019. Deep learning: current and emerging applications in medicine and technology. IEEE J. Biomed. Health Inform. 23: 906–920. 7. Chen, H., O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke. 2018. The rise of deep learning in drug discovery. Drug Discov. Today. 23: 1241–1250. 8. Fleetwood, O., M.A. Kasimova, A.M. Westerlund, and L. Delemotte. 2020. Molecular Insights from Conformational Ensembles via Machine Learning. Biophys. J. 118: 765–780. 9. Westerlund, A.M., and L. Delemotte. 2018. Effect of Ca2+ on the promiscuous target-protein binding of calmodulin. PLoS Comput. Biol. 14: e1006072. 10. Fleetwood, O., P. Matricon, J. Carlsson, and L. Delemotte. 2020. Energy Landscapes Reveal Agonist Control of G Protein-Coupled Receptor Activation via Microswitches. Biochemistry. 59: 880–891.

11. Fleetwood, O., J. Carlsson, and L. Delemotte. 2021. Identification of ligand-specific G-protein coupled receptor states and prediction of downstream efficacy via data-driven modeling. elife. 10. 12. Wehmeyer, C., and F. Noé. 2018. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. J. Chem. Phys. 148: 241703. 13. Sittel, F., and G. Stock. 2018. Perspective: Identification of collective variables and metastable states of protein dynamics. J. Chem. Phys. 149: 150901.

47 14. Sultan, M.M., G. Kiss, D. Shukla, and V.S. Pande. 2014. Automatic selection of order parameters in the analysis of large scale molecular dynamics simulations. J. Chem. Theory Comput. 10: 5217–5223. 15. Westerlund, A.M., O. Fleetwood, S. Pérez-Conesa, and L. Delemotte. 2020. Network analysis reveals how lipids and other cofactors influence membrane protein allostery. J. Chem. Phys. 153: 141103. 16. Fernández-Mariño, A.I., T.J. Harpole, K. Oelstrom, L. Delemotte, and B. Chanda. 2018. Gating interaction maps reveal a noncanonical electromechanical coupling mode in the Shaker K+ channel. Nat. Struct. Mol. Biol. 25: 320–326. 17. Kohlhoff, K.J., D. Shukla, M. Lawrenz, G.R. Bowman, D.E. Konerding, D. Belov, R.B. Altman, and V.S. Pande. 2014. Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways. Nat. Chem. 6: 15–21. 18. Botello-Smith, W.M., and Y. Luo. 2019. Robust determination of protein allosteric signaling pathways. J. Chem. Theory Comput. 15: 2116–2126. 19. Salzberg, S.L. 2018. Open questions: How many genes do we have? BMC Biol. 16: 94. 20. Zaccai, N.R., I.N. Serdyuk, and J. Zaccai. 2017. Methods in molecular biophysics: structure, dynamics, function for biology and medicine. Cambridge University Press. 21. Isberg, V., B. Vroling, R. van der Kant, K. Li, G. Vriend, and D. Gloriam. 2014. GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res. 42: D422-5. 22. Lesk, A. 2014. Introduction to Bioinformatics. 4th ed. Oxford, United Kingdom: Oxford University Press. 23. Senior, A.W., R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A.W.R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D.T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis. 2020. Improved protein structure prediction using potentials from deep learning. Nature. 577: 706–710. 24. Ishikawa, H., K. Kwak, J.K. Chung, S. Kim, and M.D. Fayer. 2008. Direct observation of fast protein conformational switching. Proc Natl Acad Sci USA. 105: 8619–8624.

25. Manglik, A., and B. Kobilka. 2014. The role of protein dynamics in GPCR function: insights from the β2AR and rhodopsin. Curr. Opin. Cell Biol. 27: 136–143. 26. Kofuku, Y., T. Ueda, J. Okude, Y. Shiraishi, K. Kondo, M. Maeda, H. Tsujishita, and I. Shimada. 2012. Efficacy of the β₂-adrenergic receptor is determined by conformational equilibrium in the transmembrane region. Nat. Commun. 3: 1045. 27. Liu, J.J., R. Horst, V. Katritch, R.C. Stevens, and K. Wüthrich. 2012. Biased signaling pathways in β2-adrenergic receptor characterized by 19F-NMR. Science. 335: 1106–1110.

48 28. Manglik, A., T.H. Kim, M. Masureel, C. Altenbach, Z. Yang, D. Hilger, M.T. Lerch, T.S. Kobilka, F.S. Thian, W.L. Hubbell, R.S. Prosser, and B.K. Kobilka. 2015. Structural Insights into the Dynamic Process of β2-Adrenergic Receptor Signaling. Cell. 161: 1101–1111. 29. Gregorio, G.G., M. Masureel, D. Hilger, D.S. Terry, M. Juette, H. Zhao, Z. Zhou, J.M. Perez-Aguilar, M. Hauge, S. Mathiasen, J.A. Javitch, H. Weinstein, B.K. Kobilka, and S.C. Blanchard. 2017. Single-molecule analysis of ligand efficacy in β2AR-G-protein activation. Nature. 547: 68–73. 30. Hauser, A.S., M.M. Attwood, M. Rask-Andersen, H.B. Schiöth, and D.E. Gloriam. 2017. Trends in GPCR drug discovery: new agents, targets and indications. Nat. Rev. Drug Discov. 16: 829–842. 31. Cherezov, V., D.M. Rosenbaum, M.A. Hanson, S.G.F. Rasmussen, F.S. Thian, T.S. Kobilka, H.-J. Choi, P. Kuhn, W.I. Weis, B.K. Kobilka, and R.C. Stevens. 2007. High-resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science. 318: 1258–1265. 32. Hanson, M.A., V. Cherezov, M.T. Griffith, C.B. Roth, V.-P. Jaakola, E.Y.T. Chien, J. Velasquez, P. Kuhn, and R.C. Stevens. 2008. A specific cholesterol binding site is established by the 2.8 A structure of the human beta2-adrenergic receptor. Structure. 16: 897–905. 33. Masureel, M., Y. Zou, L.-P. Picard, E. van der Westhuizen, J.P. Mahoney, J.P.G.L.M. Rodrigues, T.J. Mildorf, R.O. Dror, D.E. Shaw, M. Bouvier, E. Pardon, J. Steyaert, R.K. Sunahara, W.I. Weis, C. Zhang, and B.K. Kobilka. 2018. Structural insights into binding specificity, efficacy and bias of a β2AR partial agonist. Nat. Chem. Biol. 14: 1059–1066. 34. Rasmussen, S.G.F., B.T. DeVree, Y. Zou, A.C. Kruse, K.Y. Chung, T.S. Kobilka, F.S. Thian, P.S. Chae, E. Pardon, D. Calinski, J.M. Mathiesen, S.T.A. Shah, J.A. Lyons, M. Caffrey, S.H. Gellman, J. Steyaert, G. Skiniotis, W.I. Weis, R.K. Sunahara, and B.K. Kobilka. 2011. Crystal structure of the β2 adrenergic receptor-Gs protein complex. Nature. 477: 549–555. 35. Ring, A.M., A. Manglik, A.C. Kruse, M.D. Enos, W.I. Weis, K.C. Garcia, and B.K. Kobilka. 2013. Adrenaline-activated structure of β2-adrenoceptor stabilized by an engineered nanobody. Nature. 502: 575–579.

36. Wacker, D., G. Fenalti, M.A. Brown, V. Katritch, R. Abagyan, V. Cherezov, and R.C. Stevens. 2010. Conserved binding mode of human beta2 adrenergic receptor inverse agonists and antagonist revealed by X-ray crystallography. J. Am. Chem. Soc. 132: 11443–11445. 37. Imai, S., T. Yokomizo, Y. Kofuku, Y. Shiraishi, T. Ueda, and I. Shimada. 2020. Structural equilibrium underlying ligand-dependent activation of β2-adrenoreceptor. Nat. Chem. Biol. 16: 430–439.

49 38. Ueda, T., Y. Kofuku, J. Okude, S. Imai, Y. Shiraishi, and I. Shimada. 2019. Function-related conformational dynamics of G protein-coupled receptors revealed by NMR. Biophys. Rev. 11: 409–418. 39. Miao, Y., and J.A. McCammon. 2016. Graded activation and free energy landscapes of a muscarinic G-protein-coupled receptor. Proc Natl Acad Sci USA. 113: 12162–12167. 40. Niesen, M.J.M., S. Bhattacharya, and N. Vaidehi. 2011. The role of conformational ensembles in ligand recognition in G-protein coupled receptors. J. Am. Chem. Soc. 133: 13197–13204. 41. Shan, J., G. Khelashvili, S. Mondal, E.L. Mehler, and H. Weinstein. 2012. Ligand-dependent conformations and dynamics of the serotonin 5-HT(2A) receptor determine its activation and membrane-driven oligomerization properties. PLoS Comput. Biol. 8: e1002473. 42. Bhattacharya, S., and N. Vaidehi. 2010. Computational mapping of the conformational transitions in agonist selective pathways of a G-protein coupled receptor. J. Am. Chem. Soc. 132: 5205–5214. 43. Li, J., A.L. Jonsson, T. Beuming, J.C. Shelley, and G.A. Voth. 2013. Ligand-dependent activation and deactivation of the human adenosine A(2A) receptor. J. Am. Chem. Soc. 135: 8749–8759. 44. Dror, R.O., D.H. Arlow, P. Maragakis, T.J. Mildorf, A.C. Pan, H. Xu, D.W. Borhani, and D.E. Shaw. 2011. Activation mechanism of the β2-adrenergic receptor. Proc Natl Acad Sci USA. 108: 18684–18689. 45. Provasi, D., M.C. Artacho, A. Negri, J.C. Mobarec, and M. Filizola. 2011. Ligand-induced modulation of the free-energy landscape of G protein-coupled receptors explored by adaptive biasing techniques. PLoS Comput. Biol. 7: e1002193. 46. Hu, X., Y. Wang, A. Hunkele, D. Provasi, G.W. Pasternak, and M. Filizola. 2019. Kinetic and thermodynamic insights into sodium ion translocation through the μ-opioid receptor from molecular dynamics and machine learning analysis. PLoS Comput. Biol. 15: e1006689. 47. Tikhonova, I.G., B. Selvam, A. Ivetac, J. Wereszczynski, and J.A. McCammon. 2013. Simulations of biased agonists in the β(2) adrenergic receptor with accelerated molecular dynamics. Biochemistry. 52: 5593–5603. 48. Rodríguez-Espigares, I., M. Torrens-Fontanals, J.K.S. Tiemann, D. Aranda-García, J.M. Ramírez-Anguita, T.M. Stepniewski, N. Worp, A. Varela-Rial, A. Morales-Pastor, B. Medel-Lacruz, G. Pándy-Szekeres, E. Mayol, T. Giorgino, J. Carlsson, X. Deupi, S. Filipek, M. Filizola, J.C. Gómez-Tamayo, A. Gonzalez, H. Gutiérrez-de-Terán, and J. Selent. 2020. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nat. Methods. 17: 777–787. 49. Smith, J.S., R.J. Lefkowitz, and S. Rajagopal. 2018. Biased signalling: from simple

50 switches to allosteric microprocessors. Nat. Rev. Drug Discov. 17: 243–260. 50. Jean-Charles, P.-Y., S. Kaur, and S.K. Shenoy. 2017. G Protein-Coupled Receptor Signaling Through β-Arrestin-Dependent Mechanisms. J. Cardiovasc. Pharmacol. 70: 142–158. 51. Ballesteros, J.A., and H. Weinstein. 1995. [19] Integrated methods for the construction of three-dimensional models and computational probing of structure-function relations in G protein-coupled receptors. In: Receptor Molecular Biology. Elsevier. pp. 366–428. 52. Lindahl, Hess, and V.D. Spoel. 2021. GROMACS 2021 Manual. Zenodo. . 53. Huang, J., S. Rauscher, G. Nawrocki, T. Ran, M. Feig, B.L. de Groot, H. Grubmüller, and A.D. MacKerell. 2017. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat. Methods. 14: 71–73. 54. DA Case, K. Belfon, I. Ben-Shalom, S.R. Brozell, and D. Cerutti. 2020. Amber 2020. . 55. Robertson, M.J., J. Tirado-Rives, and W.L. Jorgensen. 2015. Improved Peptide and Protein Torsional Energetics with the OPLSAA Force Field. J. Chem. Theory Comput. 11: 3499–3509. 56. Abraham, M.J., T. Murtola, R. Schulz, S. Páll, J.C. Smith, B. Hess, and E. Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 1–2: 19–25. 57. Beauchamp, K.A., G.R. Bowman, T.J. Lane, L. Maibaum, I.S. Haque, and V.S. Pande. 2011. Msmbuilder2: modeling conformational dynamics at the picosecond to millisecond scale. J. Chem. Theory Comput. 7: 3412–3419. 58. Guvench, O., and A.D. MacKerell. 2008. Comparison of protein force fields for molecular dynamics simulations. Methods Mol. Biol. 443: 63–88. 59. Lamoureux, G., and B. Roux. 2003. Modeling induced polarization with classical Drude oscillators: Theory and molecular dynamics simulation algorithm. J. Chem. Phys. 119: 3025–3039. 60. Lee, J., X. Cheng, J.M. Swails, M.S. Yeom, P.K. Eastman, J.A. Lemkul, S. Wei, J. Buckner, J.C. Jeong, Y. Qi, S. Jo, V.S. Pande, D.A. Case, C.L. Brooks, A.D. MacKerell, J.B. Klauda, and W. Im. 2016. CHARMM-GUI Input Generator for NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM Simulations Using the CHARMM36 Additive Force Field. J. Chem. Theory Comput. 12: 405–413. 61. Shaw, D.E., J.C. Chao, M.P. Eastwood, J. Gagliardo, J.P. Grossman, C.R. Ho, D.J. Lerardi, I. Kolossváry, J.L. Klepeis, T. Layman, C. McLeavey, M.M. Deneroff, M.A. Moraes, R. Mueller, E.C. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, S.C. Wang, and K.J. Bowers. 2008. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM. 51: 91.

51 62. Klauda, J.B., R.M. Venable, J.A. Freites, J.W. O’Connor, D.J. Tobias, C. Mondragon-Ramirez, I. Vorobyov, A.D. MacKerell, and R.W. Pastor. 2010. Update of the CHARMM all-atom additive force field for lipids: validation on six lipid types. J. Phys. Chem. B. 114: 7830–7843. 63. Chavent, M., A.L. Duncan, and M.S. Sansom. 2016. Molecular dynamics simulations of membrane proteins and their interactions: from nanoscale to mesoscale. Curr. Opin. Struct. Biol. 40: 8–16. 64. Guixà-González, R., M. Javanainen, M. Gómez-Soler, B. Cordobilla, J.C. Domingo, F. Sanz, M. Pastor, F. Ciruela, H. Martinez-Seara, and J. Selent. 2016. Membrane omega-3 fatty acids modulate the oligomerisation kinetics of adenosine A2A and dopamine D2 receptors. Sci. Rep. 6: 19839. 65. Lindahl, V., J. Lidmar, and B. Hess. 2014. Accelerated weight histogram method for exploring free energy landscapes. J. Chem. Phys. 141: 044110. 66. Lidmar, J. 2012. Improving the efficiency of extended ensemble simulations: the accelerated weight histogram method. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 85: 056708. 67. Barducci, A., G. Bussi, and M. Parrinello. 2008. Well-tempered metadynamics: a smoothly converging and tunable free-energy method. Phys. Rev. Lett. 100: 020603. 68. Sutto, L., S. Marsili, and F.L. Gervasio. 2012. New advances in metadynamics. WIREs Comput Mol Sci. 2: 771–779. 69. Marchi, M., and P. Ballone. 1999. Adiabatic bias molecular dynamics: A method to navigate the conformational space of complex molecular systems. J. Chem. Phys. 110: 3697. 70. Tribello, G.A., M. Bonomi, D. Branduardi, C. Camilloni, and G. Bussi. 2014. PLUMED 2: New feathers for an old bird. Comput. Phys. Commun. 185: 604–613. 71. Jarzynski, C. 1997. Nonequilibrium Equality for Free Energy Differences. Phys. Rev. Lett. 78: 2690–2693. 72. Bolhuis, P.G., D. Chandler, C. Dellago, and P.L. Geissler. 2002. Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu. Rev. Phys. Chem. 53: 291–318. 73. Dellago, C., P.G. Bolhuis, F.S. Csajka, and D. Chandler. 1998. Transition path sampling and the calculation of rate constants. J. Chem. Phys. 108: 1964. 74. Lev, B., S. Murail, F. Poitevin, B.A. Cromer, M. Baaden, M. Delarue, and T.W. Allen. 2017. String method solution of the gating pathways for a pentameric ligand-gated ion channel. Proc Natl Acad Sci USA. 114: E4158–E4167. 75. Prinz, J.-H., H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J.D. Chodera, C.

52 Schütte, and F. Noé. 2011. Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 134: 174105. 76. Harrigan, M.P., M.M. Sultan, C.X. Hernández, B.E. Husic, P. Eastman, C.R. Schwantes, K.A. Beauchamp, R.T. McGibbon, and V.S. Pande. 2017. Msmbuilder: statistical models for biomolecular dynamics. Biophys. J. 112: 10–15. 77. Noé, F., S. Olsson, J. Köhler, and H. Wu. 2019. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science. 365. 78. Sugita, Y., and Y. Okamoto. 1999. Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314: 141–151. 79. Rhee, Y.M., and V.S. Pande. 2003. Multiplexed-replica exchange molecular dynamics method for protein folding simulation. Biophys. J. 84: 775–786. 80. Zimmerman, M.I., J.R. Porter, X. Sun, R.R. Silva, and G.R. Bowman. 2018. Choice of adaptive sampling strategy impacts state discovery, transition probabilities, and the apparent mechanism of conformational changes. J. Chem. Theory Comput. 14: 5459–5475. 81. Pan, A.C., and B. Roux. 2008. Building Markov state models along pathways to determine free energies and rates of transitions. J. Chem. Phys. 129: 064107. 82. Tipping, M.E., and C.M. Bishop. 1999. Probabilistic Principal Component Analysis. J. Royal Statistical Soc. B. 61: 611–622. 83. Borg, I., and P.J.F. Groenen. 2005. Modern Multidimensional Scaling. 2nd ed. New York, NY: Springer New York. 84. Wikimedia Commons. 2020. File:COVID-19 Outbreak World Map (35).svg. . 85. Bishop, C.M. 2006. Pattern recognition and machine learning. . 86. Ho, T.K. 1995. Random decision forests. Proceedings of 3rd international conference on document analysis and recognition. 1: 278–282. 87. Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984. Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. 88. Louppe, G. 2014. Understanding Random Forests: From Theory to Practice. arXiv. . 89. Hinton, G.E., and R.R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science. 313: 504–507. 90. Smolensky, P. 1986. Information Processing in Dynamical Systems: Foundations of Harmony Theory. . 91. Camacho, D.M., K.M. Collins, R.K. Powers, J.C. Costello, and J.J. Collins. 2018. Next-Generation Machine Learning for Biological Networks. Cell. 173: 1581–1592.

53 92. Olah, C., A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev. 2018. The building blocks of interpretability. Distill. 3. 93. Montavon, G., W. Samek, and K.-R. Müller. 2018. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 73: 1–15. 94. Pedregosa, F., G. Varoquaux, and A. Gramfort. 2011. Scikit-learn: Machine learning in Python. JMLR. 12: 2825–2830. 95. del Sol, A., C.-J. Tsai, B. Ma, and R. Nussinov. 2009. The origin of allosteric functional modulation: multiple pre-existing pathways. Structure. 17: 1042–1050. 96. Tsai, C.-J., A. del Sol, and R. Nussinov. 2008. Allostery: absence of a change in shape does not imply that allostery is not at play. J. Mol. Biol. 378: 1–11. 97. Yuan, C., H. Chen, and D. Kihara. 2012. Effective inter-residue contact definitions for accurate protein fold recognition. BMC Bioinformatics. 13: 292. 98. Lange, O.F., and H. Grubmüller. 2006. Generalized correlation for biomolecular dynamics. Proteins. 62: 1053–1061. 99. Cortina, G.A., and P.M. Kasson. 2016. Excess positional mutual information predicts both local and allosteric mutations affecting beta lactamase drug resistance. Bioinformatics. 32: 3420–3427. 100. Westerlund, A.M. 2020. Deciphering conformational ensembles and communication pathways in biomolecules. Stockholm: A&C Black Publishers Ltd., 2020. 60. 101. Sethi, A., J. Eargle, A.A. Black, and Z. Luthey-Schulten. 2009. Dynamical networks in tRNA:protein complexes. Proc Natl Acad Sci USA. 106: 6620–6625. 102. Newman, M.E.J. 2005. A measure of betweenness centrality based on random walks. Soc. Networks. 27: 39–54. 103. Freeman, L.C. 1977. A Set of Measures of Centrality Based on Betweenness. Sociometry. 40: 35. 104. Rasmussen, S.G.F., H.-J. Choi, J.J. Fung, E. Pardon, P. Casarosa, P.S. Chae, B.T. Devree, D.M. Rosenbaum, F.S. Thian, T.S. Kobilka, A. Schnapp, I. Konetzki, R.K. Sunahara, S.H. Gellman, A. Pautsch, J. Steyaert, W.I. Weis, and B.K. Kobilka. 2011. Structure of a nanobody-stabilized active state of the β(2) adrenoceptor. Nature. 469: 175–180. 105. Miller-Gallacher, J.L., R. Nehmé, T. Warne, P.C. Edwards, G.F.X. Schertler, A.G.W. Leslie, and C.G. Tate. 2014. The 2.1 Å resolution structure of cyanopindolol-bound β1-adrenoceptor identifies an intramembrane Na+ ion that stabilises the ligand-free receptor. PLoS ONE. 9: e92727. 106. Katritch, V., G. Fenalti, E.E. Abola, B.L. Roth, V. Cherezov, and R.C. Stevens. 2014.

54 Allosteric sodium in class A GPCR signaling. Trends Biochem. Sci. 39: 233–244. 107. Vickery, O.N., C.A. Carvalheda, S.A. Zaidi, A.V. Pisliakov, V. Katritch, and U. Zachariae. 2018. Intracellular Transfer of Na+ in an Active-State G-Protein-Coupled Receptor. Structure. 26: 171-180.e2. 108. Woo, T.M., and M.V. Robinson. 2015. Pharmacotherapeutics for advanced practice nurse prescribers. 4th ed. F. A. Davis Company. 109. van der Westhuizen, E.T., B. Breton, A. Christopoulos, and M. Bouvier. 2014. Quantification of ligand bias for clinically relevant β2-adrenergic receptor ligands: implications for drug taxonomy. Mol. Pharmacol. 85: 492–509. 110. Degiacomi, M.T. 2019. Coupling molecular dynamics and deep learning to mine protein conformational space. Structure. 27: 1034-1040.e3.

55