A Mass Cytometry Application of a Community Structure Generating Algorithm Derived from a Combination of Leiden and Girvan-Newma
Total Page:16
File Type:pdf, Size:1020Kb
A mass cytometry application of a community structure generating algorithm derived from a combination of Leiden and Girvan-Newman methods by Elyse Levens Thesis April 5, 2021 Table of Contents Abstract……………………………………………………………………………………………………………………………………..1 1 Community structure generating algorithms…………………………………………………………………………..1 1.1 Modularity……………………………………………………………………………………………………………….3 1.2 The Louvain, Leiden, and Girvan-Newman algorithms………………………………………………5 2 The biology and physics of mass cytometry…………………………………………………………………………….8 3 Analysis process…………………………………………………………………………………………………………………….16 4 The underlying math and physics of TOFMS………………………………………………………………………….17 5 The structure of community structure generating algorithms……………………………………………….19 6 The graph theory underlying the Leiden algorithm……………………………………………………………….20 7 The graph theory underlying the Girvan-Newman algorithm………………………………………………..25 8 Results and discussion…………………………………………………………………………………………………………..28 9 Bibliography………………………………………………………………………………………………………………………….35 Levens 1 Abstract Mass cytometry is a biological tool used to identify different cell types based on their mass. Many different community structure generating algorithms have been developed by biologists and mathematicians with the goal of sorting populations of cells into clusters of distinct cell types. This thesis project will analyze the physical and mathematical processes behind mass cytometry and two clustering algorithms: the Leiden algorithm and the Girvan- Newman algorithm. In this project, R and Python coding languages are used to analyze mass cytometry data generated from a population of neurons from a wild-type mouse line. The two clustering algorithms will be combined to analyze the optimization of sorting a population of nodes into correct identities. 1 Community structure generating algorithms Community structure, in the context of complex networks, is a fundamental concept in many fields, including math, physics, and biology. The importance of identifying community structures includes the recognition of varying properties of networks, predicting potentially unobserved connections, understanding the relationship between network function and topology, and creating map data. In a network (for example, a neural network), communities are identified as groups of nodes in which the connections between nodes are much denser than the connections between the rest of the network. Community structure is used in mass cytometry: a biological tool that identifies large volumes of populations of cells over multiple time points such as before birth (embryonic day 13.5 (E13.5), E14.5, E15.5, E16.5, etc.) and after birth (postnatal day 1 (P1), P2, P3, P4 etc.). Levens 2 There are many different algorithms by which community structures can be identified. Each developed algorithm attempts to address shortcomings of the other algorithms. Two contrasting algorithms are the Leiden algorithm and the Girvan-Newman algorithm. The Leiden algorithm is relatively new—it was developed in 2019 by optimizing the process involved in the Louvain method.3 Both the Louvain and Leiden algorithms maximize modularity, which optimizes the clustering process. The Leiden algorithm is popularly used in mass cytometry due to its method of randomly picking nodes once to aggregate in different partitions. This allows for stronger connections to be identified, whereas the Louvain algorithm repeatedly picks the same nodes to re-sort them into different clusters, which encourages clusters with weak connections between nodes or even disconnected nodes. The Leiden algorithm isn’t perfect, however, because while the aggregation process is optimized, it can still generate clusters that have weak internal connections, which will be explored later in this paper. The Girvan-Newman algorithm sorts nodes hierarchically based on their lineage. The Girvan-Newman algorithm removes edges of a graph (a group of nodes connected by edges) based on the strength of the connections on the edge. If the connections are weak, the edge is removed. This process is repeated until a single node (or cell) is left. The issue with the Girvan-Newman algorithm is that the resultant lineage is dependent upon the labelling of the nodes in a graph. Essentially, the input of the Girvan- Newman algorithm isn’t a graph, but the labels of the nodes in a graph.19 This means that there is the possibility of multiple outputs for a single input due to the algorithm’s capability of rearranging nodes (and, in turn, rearranging labels). Levens 3 Minimizing modularity generates fewer clusters, which merges small clusters of distinct identity together in larger clusters based on loose similarities between the multiple identities. In order to optimize modularity to generate clusters of distinct identity, we must maximize modularity. In other words, we want to generate more clusters (maximize modularity) as opposed to less clusters (minimize modularity). Optimization of modularity is NP-hard, which means that it is just as difficult as any nondeterministic polynomial time (NP) problem. NP is a complexity class in theoretical computer science, and a problem is considered to be in the NP class if it can be solved in polynomial time by a non-deterministic Turing machine2. Since creating the perfect clustering algorithm is essentially impossible, I will attempt to approach the complete optimization of modularity by combining the Leiden and Girvan-Newman algorithms. 1.1 Modularity Modularity involves the relationship between modules, or, in our case, community structures. Maximizing modularity increases the possibility for the generation of well-defined clusters, or, “good” clusters. The generation of “good” clusters relies on improving the relationship between correlating modules and the correct identity of the nodes in those modules. An example of a “good” cluster is a cluster that includes only nodes of the same identity as all of the other nodes in the cluster. The Leiden algorithm, which is derived from the Louvain algorithm, maximizes the modularity of detected community structures. The Louvain algorithm, while originally designed to maximize modularity, also can maximize other quality functions, such as the Constant Potts Model.3 Quality functions describe the quality of a network development/description/identification process. The quality function I will discuss in Levens 4 this project is modularity. By optimizing modularity and generating “good” clusters, biologists can better rely on mass cytometry to identify clusters of different types of cells. Modularity represents the connections between modules (or clusters/community structures). With respect to mass cytometry, modularity represents the connections/similarities between clusters. We further specify the similarities between individual cells and exclude the distant connections/similarities to optimize the clusters (or modules) that are generated. Modularity can be represented as: 2 1 퐾푐 푄 = ∑ (푒푐 − γ ) 2푚 푐 2푚 where our community is denoted , is the normalization term, represents the number of 1 푐 2푚 푒푐 edges in community , represents the probability that a random edge is in community , 2 퐾푐 and is the resolution푐 parameter.2푚 The resolution parameter affects the amount of community푐 structures훾 that are produced: a higher results in more community structures while a lower results in fewer community structures.훾 The represents the total number of edges in the 훾 graph (which is not a graphical plot of data, 푚but instead a group of vertexes connected by edges) and represents the degrees of all nodes in community summed together.1,3 This 푐 representation퐾 of modularity is used by community-generating algorithms푐 whose purpose is to maximize modularity. Maximizing means optimizing the connections between and within generated communities for use in 푄mass cytometry to output specific clusters of a single identity of cell. Essentially, maximizing means finding the best relationship between and the numbers of edges and degrees푄 in a graph. An example of an optimized clusteringγ algorithm Levens 5 output would be if only progenitor cells are clustered to each other with strong connections, and the connections between progenitor cells and other types of cells are very weak and/or nonexistent. Modularity is meant to complement and control the complexity of a network. Integrated and mixed communities in a network are difficult to discern from each other, and so modularity allows for communities (or, modules) to be compartmentalized as distinct from one another in order to decrease the complexity of a network.20 The complex network of neurons is incredibly difficult to work with if we are unable to discern which types of cells develop at which time points. The modularization of the complex network of neurons allows us to more easily discern which types of cells develop at specified time points because the communities of cells are separated distinctly with little to no overlap or mixed communities. 1.2 The Louvain, Leiden, and Girvan-Newman algorithms The Louvain method uses local movement of nodes and the aggregation of similar nodes. This process is repeated until there are no more refinements to be made, as shown in Figure 1.3 Levens 6 Figure 1: The Louvain algorithm. Source: [3] Traag, et al. The challenge with the Louvain algorithm is that the method of the movement of nodes potentially disconnects nodes from communities in which the nodes actually belong. There is also the chance that a node from one community is moved