A mass cytometry application of a community structure generating algorithm derived from a

combination of Leiden and Girvan-Newman methods

by

Elyse Levens

Thesis

April 5, 2021

Table of Contents

Abstract……………………………………………………………………………………………………………………………………..1

1 Community structure generating algorithms…………………………………………………………………………..1

1.1 Modularity……………………………………………………………………………………………………………….3

1.2 The Louvain, Leiden, and Girvan-Newman algorithms………………………………………………5

2 The biology and physics of mass cytometry…………………………………………………………………………….8

3 Analysis process…………………………………………………………………………………………………………………….16

4 The underlying math and physics of TOFMS………………………………………………………………………….17

5 The structure of community structure generating algorithms……………………………………………….19

6 The graph theory underlying the Leiden algorithm……………………………………………………………….20

7 The graph theory underlying the Girvan-Newman algorithm………………………………………………..25

8 Results and discussion…………………………………………………………………………………………………………..28

9 Bibliography………………………………………………………………………………………………………………………….35

Levens 1

Abstract

Mass cytometry is a biological tool used to identify different cell types based on their mass. Many different community structure generating algorithms have been developed by biologists and mathematicians with the goal of sorting populations of cells into clusters of distinct cell types. This thesis project will analyze the physical and mathematical processes behind mass cytometry and two clustering algorithms: the Leiden algorithm and the Girvan-

Newman algorithm. In this project, R and Python coding languages are used to analyze mass cytometry data generated from a population of neurons from a wild-type mouse line. The two clustering algorithms will be combined to analyze the optimization of sorting a population of nodes into correct identities.

1 Community structure generating algorithms

Community structure, in the context of complex networks, is a fundamental concept in many fields, including math, physics, and biology. The importance of identifying community structures includes the recognition of varying properties of networks, predicting potentially unobserved connections, understanding the relationship between network function and topology, and creating map data. In a network (for example, a neural network), communities are identified as groups of nodes in which the connections between nodes are much denser than the connections between the rest of the network. Community structure is used in mass cytometry: a biological tool that identifies large volumes of populations of cells over multiple time points such as before birth (embryonic day 13.5 (E13.5), E14.5, E15.5, E16.5, etc.) and after birth (postnatal day 1 (P1), P2, P3, P4 etc.). Levens 2

There are many different algorithms by which community structures can be identified.

Each developed algorithm attempts to address shortcomings of the other algorithms. Two contrasting algorithms are the Leiden algorithm and the Girvan-Newman algorithm. The Leiden algorithm is relatively new—it was developed in 2019 by optimizing the process involved in the

Louvain method.3 Both the Louvain and Leiden algorithms maximize modularity, which optimizes the clustering process.

The Leiden algorithm is popularly used in mass cytometry due to its method of randomly picking nodes once to aggregate in different partitions. This allows for stronger connections to be identified, whereas the Louvain algorithm repeatedly picks the same nodes to re-sort them into different clusters, which encourages clusters with weak connections between nodes or even disconnected nodes. The Leiden algorithm isn’t perfect, however, because while the aggregation process is optimized, it can still generate clusters that have weak internal connections, which will be explored later in this paper. The Girvan-Newman algorithm sorts nodes hierarchically based on their lineage. The Girvan-Newman algorithm removes edges of a graph (a group of nodes connected by edges) based on the strength of the connections on the edge. If the connections are weak, the edge is removed. This process is repeated until a single node (or cell) is left. The issue with the Girvan-Newman algorithm is that the resultant lineage is dependent upon the labelling of the nodes in a graph. Essentially, the input of the Girvan-

Newman algorithm isn’t a graph, but the labels of the nodes in a graph.19 This means that there is the possibility of multiple outputs for a single input due to the algorithm’s capability of rearranging nodes (and, in turn, rearranging labels). Levens 3

Minimizing modularity generates fewer clusters, which merges small clusters of distinct identity together in larger clusters based on loose similarities between the multiple identities.

In order to optimize modularity to generate clusters of distinct identity, we must maximize modularity. In other words, we want to generate more clusters (maximize modularity) as opposed to less clusters (minimize modularity). Optimization of modularity is NP-hard, which means that it is just as difficult as any nondeterministic polynomial time (NP) problem. NP is a complexity class in theoretical computer science, and a problem is considered to be in the NP class if it can be solved in polynomial time by a non-deterministic Turing machine2. Since creating the perfect clustering algorithm is essentially impossible, I will attempt to approach the complete optimization of modularity by combining the Leiden and Girvan-Newman algorithms.

1.1 Modularity

Modularity involves the relationship between modules, or, in our case, community structures. Maximizing modularity increases the possibility for the generation of well-defined clusters, or, “good” clusters. The generation of “good” clusters relies on improving the relationship between correlating modules and the correct identity of the nodes in those modules. An example of a “good” cluster is a cluster that includes only nodes of the same identity as all of the other nodes in the cluster. The Leiden algorithm, which is derived from the

Louvain algorithm, maximizes the modularity of detected community structures. The Louvain algorithm, while originally designed to maximize modularity, also can maximize other quality functions, such as the Constant Potts Model.3 Quality functions describe the quality of a network development/description/identification process. The quality function I will discuss in Levens 4

this project is modularity. By optimizing modularity and generating “good” clusters, biologists can better rely on mass cytometry to identify clusters of different types of cells.

Modularity represents the connections between modules (or clusters/community structures). With respect to mass cytometry, modularity represents the connections/similarities between clusters. We further specify the similarities between individual cells and exclude the distant connections/similarities to optimize the clusters (or modules) that are generated.

Modularity can be represented as:

2 1 퐾푐 푄 = ∑ (푒푐 − γ ) 2푚 푐 2푚 where our community is denoted , is the normalization term, represents the number of 1 푐 2푚 푒푐 edges in community , represents the probability that a random edge is in community , 2 퐾푐 and is the resolution푐 parameter.2푚 The resolution parameter affects the amount of community푐 structures훾 that are produced: a higher results in more community structures while a lower results in fewer community structures.훾 The represents the total number of edges in the 훾 graph (which is not a graphical plot of data, 푚but instead a group of vertexes connected by edges) and represents the degrees of all nodes in community summed together.1,3 This

푐 representation퐾 of modularity is used by community-generating algorithms푐 whose purpose is to maximize modularity. Maximizing means optimizing the connections between and within generated communities for use in 푄mass cytometry to output specific clusters of a single identity of cell. Essentially, maximizing means finding the best relationship between and the numbers of edges and degrees푄 in a graph. An example of an optimized clusteringγ algorithm Levens 5 output would be if only progenitor cells are clustered to each other with strong connections, and the connections between progenitor cells and other types of cells are very weak and/or nonexistent.

Modularity is meant to complement and control the complexity of a network. Integrated and mixed communities in a network are difficult to discern from each other, and so modularity allows for communities (or, modules) to be compartmentalized as distinct from one another in order to decrease the complexity of a network.20 The complex network of neurons is incredibly difficult to work with if we are unable to discern which types of cells develop at which time points. The modularization of the complex network of neurons allows us to more easily discern which types of cells develop at specified time points because the communities of cells are separated distinctly with little to no overlap or mixed communities.

1.2 The Louvain, Leiden, and Girvan-Newman algorithms

The Louvain method uses local movement of nodes and the aggregation of similar nodes. This process is repeated until there are no more refinements to be made, as shown in

Figure 1.3 Levens 6

Figure 1: The Louvain algorithm. Source: [3] Traag, et al.

The challenge with the Louvain algorithm is that the method of the movement of nodes potentially disconnects nodes from communities in which the nodes actually belong. There is also the chance that a node from one community is moved to another community on the basis that it is more similar to the second community, whereas in reality, that node was a bridge between two or more similar groups of nodes that form one community.3 This is a direct issue for modularity, and shows that the use of the Louvain algorithm does not maximize modularity in the way that it should.

The Leiden algorithm uses the local movement of nodes, refinement of the generated partitions, and the aggregation of the network based on the refined partitions (which is developed via the creation of an initial partition using a non-refined partition). These steps are Levens 7 repeated until there are distinct community structures that cannot be refined further, as shown in Figure 2.3

Figure 2: The Leiden algorithm. Source: [3] Traag, et al.

The Girvan-Newman algorithm was developed in 2002. It uses the concept of

“betweenness” to generate community structures. The Girvan-Newman algorithm calculates the amount of “betweenness” in a group of nodes and removes the edge with the highest

“betweenness”. Then, the “betweenness” of the remaining nodes is recalculated, and the process is repeated until there are no more edges. The result is a graph that shows the relationships between nodes4,5.

Another flaw with algorithms that maximize modularity involves the “resolution limit”.

The resolution limit of modularity takes small, independent community structures and clusters Levens 8 them in large community structures, which loses a significant amount of node identity. The

Louvain algorithm does nothing to correct for the resolution limit, and thus the Leiden algorithm was developed with the intent to correct that flaw as well as to discourage disconnected clusters from being identified.

2 The biology and physics of mass cytometry

Mass cytometry, also known as CyTOF (cytometry by time of flight), is a combination of two technologies: and . Flow cytometry is a tool that uses microbial cell light scatter and fluorescence properties to detect and characterize microbial cells6. The term “microbial cells” is a generalized description of a microscopically-scaled, living thing. Some examples of microbial cells are bacteria, archaea, fungi, and protists7. Mass spectrometry is a tool that typically consists of three components: an ionization source, a mass analyzer, and an ion detection system. These components are utilized to measure the mass-to- charge ratio of molecules8. The mass-to-charge ratio (MCR) expression is9:

mass-to-charge ratio . 푚푎푠푠 표푓 푐푎푡푖표푛 표푟 푎푛푖표푛 = 푐ℎ푎푟푔푒 표푓 푐푎푡푖표푛 표푟 푎푛푖표푛 The mass cytometry process has multiple steps, which are identified in Figure 3: Levens 9

Figure 3: The steps of mass cytometry analysis. Source: Elyse Levens

The first step is data collection, which involves collecting the samples of cells that are going to be run during the CyTOF assay. The second step is the running of the collected samples. After the samples are run through the CyTOF assay, the data are normalized and manually gated.

Existing data are used to normalize the generated data set and the gating process is not intended to exclude relevant data points. The manual gating process ensures that the next step,

UMAP (Uniform Manifold Approximation Projection) analysis, does not include data points that are obvious outliers. The gated data is processed via R scripts that generate data plots called

UMAPs. The UMAPs are generated via community structure generating algorithms (usually the

Leiden algorithm). Using the data from the UMAP analysis, violin plots are created. Violin plots indicate how much a marker of a type of cell is expressed in specific cluster populations, as shown in Figure 4: Levens 10

Figure 4: An example of a “good” violin plot. Source: Elyse Levens

The y-axis represents the number of cells that express the specific marker and the x-axis represents the average expression for all of those cells. The above violin plot represents what would be considered a “good” plot, because most of the cells are in the middle of the expression range. A “bad” violin plot would be evident if the upper and lower hump were closer to the lower side of the expression range, as shown in Figure 5 below:

Figure 5: An example of a “bad” violin plot. Source: Elyse Levens Levens 11

The last step is silhouette score analysis, which is the process of testing whether the clusters generated in the UMAP analysis step are “good”. To do this, the sample data is run through a silhouette score R script. A UMAP is then generated that looks similar to a heat map. When a node has a silhouette score of 1, that node has been placed in the correct cluster and similar nodes are connected to and/or surrounding that node. If a node has a silhouette score of -1, it is most likely disconnected from the cluster in which it was placed. The UMAP data plots generated from the silhouette analysis are not separated by cluster and only show a gradient from connectedness to disconnectedness. Figure 6 shows an example of a UMAP:

Figure 6: UMAP analysis of mouse samples. Source: Austin Keeler, Ph.D. (Deppmann Lab, University of Virginia) Levens 12

The run in Figure 6 identified 17 clusters, indicated by different colors. Figure 7 shows the silhouette score analysis of the UMAP data:

Figure 7: Silhouette score analysis of mouse samples. Source: Austin Keeler, Ph.D. (Deppmann Lab, University of Virginia)

The scores range from -1 to 1, with blue indicating a silhouette score near -1 and yellow indicating a silhouette score near 1. The significantly larger population in the upper left has pockets of silhouette scores near 1 that are surrounded by some green and some blue. This means that the silhouette analysis is unsure if those cells are similar to their neighboring cells, and thus it outputs a lower silhouette score ranging from 0 to -1. Levens 13

The result of all of these analyses is a map of all of the different types of cells present at different time points. These cells include stem cells, T (thymus) cells, mechanoreceptors, proprioceptors, tumor cells, NK (natural killer) cells, progenitor cells, glia, and others. The issue with needing to gather all of these types of cells together for each time point (e.g. E12.5, P2, etc.) in order to figure out what populations are present at which time point is that, if the community-generating algorithm incorrectly clusters cells together, it is extremely difficult to determine which types of cells are in those clusters of mixed identities. An algorithm that always correctly places proprioceptors with other proprioceptors and progenitors with other progenitors, etcetera, is the objective. While the Leiden algorithm is an improvement over the

Louvain algorithm, Leiden still produces mixed clusters of cells of unknown identities. By combining Leiden and Girvan-Newman algorithms, I hope to minimize or eliminate the amount of the mixed/unknown clusters so that mass cytometry analysis produces “good” clusters of clear identity.

The specific type of mass spectrometry used in mass cytometry is time-of-flight mass spectrometry (TOFMS). Mass cytometry can generally be described as the combination of flow cytometry and TOFMS. Where flow cytometry uses photons of light from fluorochromes to identify samples of cells and develop flow cytometry data plots10, mass cytometry uses heavy metal ion tags11. The heavy metal ion tags are identified via another type of mass spectrometry called inductively coupled plasma mass spectrometry (ICP-MS), which uses inductively coupled plasma to ionize the particles. Inductively coupled plasma is composed of a pure gas, generally argon, and, via an induction coil, energy is coupled to the gas to form the plasma (or, the ionization source)14,25. Other gases used less frequently in inductively coupled plasma are Levens 14 helium and nitrogen25. Inductively coupled plasma decomposes a sample of molecules into its elements, which are then transformed into ions14,18. The coupling process is completed via moving a high frequency electric current through an induction coil, such as a Tesla coil, which generates an oscillating magnetic field. A spark from the Tesla coil starts the ionization of the argon. The ions resulting from this initial spark interact with the oscillating magnetic field

(called collision excitation) and generate more energy for the continual ionization of argon. The continuous collision excitation correlates with a steep temperature increase. The plasma is formed quickly with this temperature increase and the subsequent high electron density25.

The physical process used in time-of-flight mass spectrometry is as follows: particles in the sample are ionized and the ionized particles are sent through a mass filter. The ions are then accelerated to a higher kinetic frequency before being introduced to a hexapole collision cell. Note that the hexapole collision cell is the most commonly used reaction/collision cell in

ICP-MS, and its main components are 6 metal rods in a circular pattern (Fig. 8).

Figure 8: Hexapole collision cell with six metal rods and ion stability region (represented in blue). Source: [24] “Collision/Reaction Cells…” In the hexapole collision cell, an application of radio frequency (RF) voltage to the metal rods creates electromagnetic fields that gathers ions of a particular mass in the center of the rods13.

From Ohm’s Law, which states that where is voltage, is current, and is resistance,

푉 = 퐼푅 푉 퐼 푅 Levens 15

and Ampere’s Law, which states that , where is the magnetic field, is a portion

0 of the line integral of the closed path ∫around퐵 푑푙 = 휇, and퐼 is the퐵 permeability of free 푑푙space, we can

0 see that the magnetic field and voltage are directly퐵 correlated휇 via the current. Provided that the resistance does not change, an increase in voltage causes an increase in current, which causes an increase in magnetic field. Thus, the RF voltage application to the hexapole rods creates magnetic fields.

After the RF voltage application, a collision gas is released, which causes collision induced dissociation12. Via transfer ion optics, the ion beam is configured to be as parallel to the collision cell as possible and then sent into another vacuum space. There, the time of flight of the ion through the vacuum space determines its mass13. Figure 9 is a representation of a time-of-flight mass spectrometer13:

Figure 9: A diagram of a time-of-flight mass spectrometer. Source: [13] “Time-of-Flight Mass Spectrometry.”

In summary, mass cytometry is the combination of ICP-MS, TOFMS, and flow cytometry.

Levens 16

3 Analysis process

I first ran experimental data from a mass cytometry project that is currently being completed by Dr. Austin Keeler in the Deppmann Lab located at the University of Virginia using the Leiden algorithm and, subsequently, using the Girvan-Newman algorithm. I analyzed the math behind the two methods and combined them to analyze the optimization of modularity.

The optimization of the clustering process was calculated using silhouette analysis, which generates silhouette coefficients that indicate whether a node is closely related to its neighboring nodes. As previously discussed, a silhouette coefficient near -1 indicates that the sample is disconnected from its assigned cluster, and a silhouette coefficient near 1 indicates that the sample is connected to its assigned cluster. With an increasing number of clusters, the silhouette coefficient decreases.15,16

This analysis is directly useful to the biology community because, if the combination of the Leiden and Girvan-Newman algorithms provides “good” community structure information, mass cytometry will be inherently more useful. It isn’t well known which community structure generating algorithms produce better clusters. My work offers a comparison between the algorithms as well as a combined algorithm with, potentially, wide application throughout the biology community. Currently, mass cytometry analysis involves one community-structure generating script (usually the Leiden script). However, with the added information provided by the Girvan-Newman method, analysis of distinct clusters will become much clearer. I analyzed the mathematics behind the physical properties of mass cytometry to better communicate the mathematical processes involved in each community structure generating script as well as to provide foundational information. Levens 17

4 The underlying math and physics of TOFMS

As previously stated, mass cytometry can be described as the combination of TOFMS, ICP-MS, and flow cytometry. The physics behind ICP-MS was explored in section 2, and I will expand upon the physics of TOFMS in this section. Time-of-flight mass spectrometry (TOFMS) applies many physics concepts in order to work well as a biological tool. The first step of TOFMS is to ionize the sample in an ionization chamber, as previously described. The next step is for the ionized particles to be fired into what is called the “drift region”. This occurs via exposure to a strong electrical field in an acceleration chamber. In the drift region, the ions will have kinetic energy proportional to their charge such that , where is the potential difference in the acceleration퐾 chamber. Now, for 푧an ion with 퐾mass = 푈푧 , we have푈 that the velocity of that ion is . Let the length of the drift region be . For푚 the kinetic energy 푣 , 2푈푧 1 2 replace푣 = with√ 푚 to obtain 퐿 퐾 = 2 푚푣 퐿 푣 푣 = 푡

2 1 퐿 퐾 = 푚 ( ) 2 푡

2 푚퐿 퐾 = 2 2푡

2 2 푚퐿 푡 = 2퐾 . 푚 푡 = 퐿√2퐾 Then, since , the time it takes for an ion to reach the detector is . This is 푚 퐾 = 푈푧 푡 = 퐿√2푈푧 important to note, because this equation indicates that the time it takes for an ion to reach the detector is proportional to , which describes the scanning process that develops the desired result: maps (or graphs)√푚/푧 of vertices that cluster in different patterns over time.17 Levens 18

A challenge of non-conventional TOFMS is the overlapping of vertices/nodes from different time points. In conventional TOFMS, there is an adequate amount of time between scans so that the fastest ion from the subsequent scan does not overlap the slowest ion from the previous scan. After the ions are accelerated through the acceleration chamber and reach the detector, a continuous electrical signal is generated. The continuous signal is then sampled into a discrete electrical signal, which is referred to as a scan. The goal of taking thousands of these scans is to estimate the spectrum. An estimate of the spectrum using a multitude of scans is called an √acquisition푚/푧 . Acquisitions are used to √generate푚/푧 the graphs of clustered nodes.

To expand the previous concept of allowing enough time between scans for zero overlap, let the time it takes to generate an acquisition be bounded below by the

푎푐푞푢푖푠푖푡푖표푛 number of scans collected for each acquisition such(푇 that )

(푁)

푇푎푐푞푢푖푠푡푖표푛 ≥ 푁 ∗ 푇푠푐푎푛 . 퐿 ≥ 푁 ∗ √2푈 (√푚/푧푚푎푥 − √푚/푧푚푖푛) where is the time it takes for each scan to be completed.17

푇푠푐푎푛 Now, define the time difference for two ions hitting the detector as

퐿 푡2 − 푡1 = (√푚2/푧2 − √푚1/푧2) √2푈 Levens 19

Then, is directly correlated to . As increases, increases, which means

푎푐푞푢푖푠푖푡푖표푛 푎푐푞푢푖푠푖푡푖표푛 that the퐿 speed of the scans decreases.푇 Also, as 퐿 increases, the푇 distance between ions increases, which improves the accuracy of the detection퐿 of the mass of each ion.17

5 The structure of community structure generating algorithms

As previously discussed, the Leiden algorithm has three steps: the movement of nodes into partitions, partition refinement, and the aggregation of the refined networks. The process of the local movement of nodes involves creating multiple queues of every node in the data set and analyzing each node’s potential for increasing the quality function. Each queue represents the nodes in one community. Nodes are added to their queue in a random order. If the movement of a node into a new community structure increases the quality function, then the node is removed from the queue and placed in the aforementioned community structure. After that node is placed in a different community, all of the nodes nearby that are not already in the queue of the original community are placed in that queue. If the quality function will not be increased by the removal of the node to another community, then that node is removed from the queue and remains in its current community. When the queues are empty, the movement process is complete.3

After the nodes are moved, the initial partitions are created. A refined partition

refined is created by setting refined to a partition in which푃 all nodes are in their own community.

푃This type of partition is called푃 a singleton partition. After this singleton partition is generated, nodes are merged with their neighbors. We now have many different subcommunities of merged nodes in refined. There are some nodes that are not merged with their neighbors

푃 Levens 20 because they are the only node in their individual community, and these nodes are merged with another community. These merging actions are done only in the original partition, . The result of the refinement process is that the initial community structures in have been 푃split

3 into other community structures in refined. 푃

푃 The merging process is intricate and “good” merging actions are crucial to developing

“good” community structures. This is why the Leiden algorithm does not merge nodes with communities that decrease the quality function modularity. The merging actions that can increase modularity on any scale (small or large) are the only merging actions that are considered. Beyond the requirement that the quality function (modularity) is increased on some level, the merging actions are random. If the Leiden algorithm merged nodes with clusters that generate the largest increase in modularity instead of picking randomly, the refined partition will not be optimized. The randomness allows for each community in the initial partition to be explored without potential loss of identity.3

6 The graph theory underlying the Leiden algorithm

First, graph theory concepts and definitions must be introduced in order to understand the underlying code for the Leiden algorithm. As previously explained, a graph is a network of vertexes and edges. A vertex is also referred to as a node. An edge is a connection between two nodes. Let be a graph where represents the number of nodes of and

represents퐺 = the(푉, number 퐸) of edges of 푛. For = |the푉| Leiden algorithm, we will assume that퐺 is푚 =

|always퐸| unweighted. An unweighted graph퐺 has edges that do not have values associated 퐺with them.21 The reason why must be unweighted is because, if edges are assigned values, the

퐺 Levens 21

Leiden algorithm will use those edge values to sort nodes as opposed to using the nodes’ potential for increasing modularity to sort nodes. Let denote a partition where

1 푟 represents the number of communities in . Now,푃 = {퐶 let, … , 퐶 } be a community of nodes

푖 푟(or, = a|푃 set| of nodes) where our nodes of are the푃 union of all퐶 nodes⊆ 푉 in , or . We

푖 푖 푖 푖 must also define, for injective purposes,푉 that퐶 for all . 3 퐶 푉 =∪ 퐶

퐶푖 ∩ 퐶푗 = ∅ 푖 ≠ 푗 The purpose of the Leiden algorithm is to maximize the quality function of modularity.

Define as a quality function. is represented in different ways based on the quality 퐻function(퐺, 푃) that is being defined.퐻 For(퐺, example, 푃) the quality function for the Constant Potts

Model (CPM), is represented differently from the quality function of modularity. The quality function for modularity is represented thusly:

훾 ||퐶|| 퐻(퐺, 푃) = ∑ [퐸(퐶, 퐷) − ( )] 퐶∈푃 2푚 2 where represents the quantity of edges in both community and community .

퐸 is(퐶, given 퐷) by such that 퐶 . In the matrix, the퐷

퐸element(퐶, 퐷) is the퐸(퐶,퐷) recursive = size |{(푢,푣) of set ∈ C. 퐸(퐺) The recursive 푢size ∈ 퐶,of 푣a ∈set 퐷}| represents the number of elements||퐶|| in a set, defined by:

,

||푆|| = ∑||푠|| for , where if is not a set. As defined previously, is the resolution parameter.

In other푠 ∈ 푆 words, ||푠|| = 1uses푠 edge quantity and the resolution parameter훾 to assign a quality to

. 3 퐻(퐺, 푃)

푃 Levens 22

The Leiden algorithm relies on aggregate graphs. Define as the base graph and as ′ the aggregate graph. Let the nodes of be the communities in 퐺partition such that 퐺 ′ . In other words, is made퐺 up of communities rather than individual푃 nodes (Fig. ′ ′ 푉10).(퐺 ) = 푃 퐺

Figure 10: Example of a base graph and its aggregate graph. Source: Elyse Levens

By definition of the aggregate graph , the edges of are multi-edges. Multi-edges are ′ ′ defined as multiple edges between two퐺 nodes (Fig. 11퐺).

Figure 11: A multigraph with multi-edges. Source: Elyse Levens

A graph with multi-edges is called a multigraph. consists of aggregated communities from ′ the base graph , and, thus, the number of edges퐺 between two nodes in is equivalent to the ′ 퐺 퐺 Levens 23 number of edges between nodes in the two corresponding communities in . In other words, for two nodes in , there are a certain number of nodes. In , nodes are communities퐺 from . ′ ′ Therefore, there 퐺are the same number of edges between two퐺 nodes in (or, two communities퐺 ′ in ) as there are between the two corresponding communities in (or,퐺 two nodes in ). The ′ edges퐺 in are denoted as such that 퐺 where퐺 ′ .퐺 Since the edges in퐸(퐺′) are = multi-edges, {(퐶, 퐷) (푢,푣)∈퐸(퐺),푢 is called a multiset. ∈퐶,푣3 ∈퐷}

퐶, 퐷 ∈ 푃 퐺′ 퐸(퐺′) Define such that as the singleton partition of . A singleton partition is a partition푃′ = {{푣} that consists푣 of ∈ a 푉(퐺′)} single node with no edges. Thus, the퐺′ partition has a single node in it. Now, in order to ensure that the quality function of modularity gives reliable푃′ results for both base and aggregate graphs, define .3

퐻(퐺, 푃) = 퐻(퐺′, 푃′) For the movement phase of the Leiden algorithm, define as the partition when the node is moved to the community . could also be described푃(푣 ↦ 퐶) as a function that maps to . It is푣 important to compute the quality퐶 푃 function again after the initial movement, and the푣 change퐶 in quality function is relevant to discovering if the new placement increased or decreased the quality function. Define

△ 퐻푃(푣 ↦ 퐶) = 퐻(푃(푣 ↦ 퐶)) − 퐻(푃) as the change in quality function for some . When moving a set of nodes as opposed to individual nodes , we define 푃. This movement of a set of nodes푆 is not a common

푃 movement in the푣 Leiden algorithm,△ 퐻 (however,푆 ↦ 퐶) a mass, consecutive movement of nodes can be considered as a movement of a set of nodes. In particular, when moving a set of nodes to a new

(or, empty) community, we write . This is used when the Leiden algorithm

△ 퐻푃(푆 ↦ ∅) Levens 24 identifies a new cell identity and creates a new cluster of cells with similar identities.

Sometimes nodes are not similar to any existing clusters, so the node is moved to a new set, or an empty set notated by . 3

∅ When a community consists of two sets and , we have that and

1 2 1 2 . In other words,퐶 let and be disconnected푆 푆 such that there푆 are∪ no 푆 =edges 퐶

1 2 1 2 푆between∩ 푆 = nodes ∅ in the separate sets.푆 Now,푆 define and .

푃 1 푃 2 This allows for a partition to always be improved△ by 퐻 splitting(푆 ↦ ∅ a) community> 0 △ 퐻 into(푆 sets↦ ∅ with) >0 connected nodes because푃 the change in quality function is always positive. When퐶 a set of nodes is mapped to an empty set, an entirely new community is created. Therefore, when and

1 2 are mapped to empty sets, respectively, two new communities are being generated.푆 Since we푆 have defined for the movement of these sets (which were split from an original, common community) that the change in modularity is always positive, the Leiden algorithm will always consider splitting a community to improve modularity.3 This is beneficial to increasing the silhouette score because if the Leiden algorithm didn’t consider splitting communities, some nodes of separate identity might not be identified as separate. The Leiden algorithm will observe a node in the original community with the same amount of scrutiny as any other node, and there will be less of a chance for the Leiden algorithm to identify a new cluster. When splitting a community, new relations are formed between nodes that the Leiden algorithm hadn’t previously considered. Thus, splitting communities is beneficial to increasing modularity and the silhouette score.

Levens 25

7 The graph theory underlying the Girvan-Newman algorithm

The Girvan-Newman algorithm was proposed with the purpose of optimizing general hierarchical clustering methods. Traditional hierarchical clustering was completed via calculating the number of connections a node has and sorting nodes of similar connections together. Traditional clustering is lacking in this way, because nodes that only have a single connection (or, edge) could still be a part of a community. Traditional hierarchical clustering only observes that a node with a single connection has a small number of connections, and will cluster it in a singleton community as opposed to clustering the node in its proper community.

Thus, the Girvan-Newman method seeks to overcome this flaw.4

Instead of using the number of connections to cluster nodes, the Girvan-Newman algorithm utilizes the distance between nodes to remove the edges with the largest distance. In other words, the Girvan-Newman algorithm removes the edges with the most “betweenness”.

In traditional hierarchical clustering, graphs are created by adding nodes with large amounts of connections to communities. In Girvan-Newman clustering, graphs are created by removing edges with large distances (or, weak connections).

In 1977, Linton Freeman explored and defined vertex betweenness (or, node betweenness).22 Freeman defined vertex betweenness centrality as the number of closest connections (or, shortest paths) between the original vertex and its connected vertices. Girvan and Newman applied Freeman’s vertex betweenness centrality theory to edges. Girvan and

Newman defined edge betweenness centrality as the number of closest connections between pairs of vertices that are connected to the original edge. Girvan and Newman’s reasoning Levens 26 behind this definition is that, for multiple shortest paths between pairs of vertices, all such paths are weighted equivalently in order for every path to be considered equally. For a graph of communities with few nodes in them, the shortest paths between vertices will be inside the communities. Thus, the edges with the highest betweenness will be between communities instead of between nodes4 (Fig. 12).

Figure 12: A graph with two communities. The edge with the highest betweenness centrality is identified. Source: Elyse Levens.

The step process outlined by Girvan and Newman is: (1) for all edges in the graph, the betweenness will be calculated; (2) identify and remove the edge with the highest betweenness; (3) after obtaining the new graph with the removed edge, recalculate all betweenness values for every edge; (4) repeat step 2 and onwards until there are no more edges to remove. In 2001, Newman proposed a method of calculating betweenness that is later used in the Girvan-Newman algorithm. Newman utilizes breadth-first search in the calculation process of betweenness. First, define the shortest path between vertices to be called a geodesic. Note that there can be multiple geodesics between a pair of vertices. Newman’s Levens 27 breadth-first search algorithm is as follows: say we have two vertices, and . First, we let vertex have a distance of zero, as in, is a distance of zero away from푖 itself.푗 Let . Now, for any푗 vertex that has distance and푗 is connected to any vertex with an unassigned0 → 푑 distance, let the푘 distance of be 푑 . Let . Then, define푙 as the predecessor of .

In other words, is considered푙 the푑+ “original” 1 푑 +vertex. 1 → 푑Continue to assign푘 distance to 푙 vertices until there푘 are no more vertices without assigned distances. From here,푑 the + 1geodesic from vertex푙 to vertex can be obtained by starting at and following edges to the predecessor(s)푖 of until푗 is found. If there is only one predecessor푖 of , then that is the geodesic of to . 푖If there푗 are multiple predecessors of , then each path푖 to the different predecessors푖 on푗 the way to must be followed in order푖 to obtain the geodesic(s). This is

Newman’s breadth-first search푗 algorithm.23

The step process outlined by Newman for calculating betweenness is as follows: (1) using Newman’s breadth-first search algorithm, the geodesic to some vertex is obtained. (2)

Let every vertex be assigned a variable , with an initial value of 1. (3) Systematically,푗 will

푘 푘 be added to the corresponding푘 variable of푏 the farthest vertex from to the closest vertex푏 from . For having multiple predecessors, is divided between푘 the푗 predecessor vertices. 푘(4)

푘 After 푗every 푘vertex has gone through the process푏 in step 3, the values will represent their

푘 respective geodesics to vertex . Now, the are added to the 푏pre-existing distance values

푘 corresponding to each respective푗 vertex. Then,푏 for all possible values of vertex , the aforementioned addition of to pre-existing distance푛 values are calculated. The푗 final values

푘 are the betweenness values 푏of the vertices.23

Levens 28

8 Results and discussion

In this section, a data set of approximately 2,000 neurons was used. The data set was run through the Leiden algorithm for clustering, and a silhouette score analysis was completed to test the “correctness” of the clustering. The Leiden-clustered data set was then run through the Girvan-Newman algorithm, and the number of definitive clusters was identified. The number of clusters determined by the Leiden algorithm was compared to the number of clusters determined by the sequential algorithms to conclude whether the two different methods identify the same clusters.

I split the 2,000 neurons into two sub-communities: one community that expressed TrkB and one community that expressed TrkC. Both TrkB and TrkC are neuronal markers. The entire data set included samples from embryonic day 11 (E11) to postnatal day 4 (P4). This means that there are both immature and mature neurons in the data set. For every neuronal cell type, there will be mature and immature neurons. After running the original data set of neurons expressing TrkB through the UMAP analysis and the Leiden clustering, a UMAP is obtained (Fig.

13). Levens 29

Figure 13: UMAP Analysis of TrkB neuron community. Eleven clusters were identified. Source: Elyse Levens

In the TrkB community, eleven clusters were identified. This means that eleven different neuron types were identified by the Leiden algorithm. The silhouette score analysis of the clusters in Figure 13 showed an approximate range of silhouette scores from to - (Fig.

14). Note that a silhouette score closer to 1 indicates correct clustering while 0.4a silhouette0.4 score closer to -1 indicates incorrect clustering. Levens 30

Figure 14: Silhouette score analysis of TrkB neuron community. Source: Elyse Levens

There are several merged communities in the silhouette score analysis in Figure 14 that show incorrect/disconnected clustering, as expected. By observation, it is easy to see that the silhouette score is higher in the middle of the clusters defined in Figure 13.

After running the data set from Figure 13 through the Girvan-Newman algorithm, I obtain four nodes. All other edges have been removed, and the script recognizes four nodes as the predecessor nodes (Fig. 15). Levens 31

Figure 15: Girvan-Newman analysis of TrkB community data from Figure 13. Source: Elyse Levens

This means that the Girvan-Newman algorithm observed all nodes and edges in the data set run through the Leiden algorithm and sequentially removed edges with the highest betweenness until no more edges could be removed. The values displayed on the nodes in the graph in

Figure 15 are the y-values from the corresponding UMAP (Fig. 13). Although there are four predecessor nodes, there are only two distinct y-values. This is because there are immature and mature neurons of each cell type. So, for the y-value of 4.28163975663705, there are two predecessor nodes because, on the UMAP, there are two nodes that overlay each other, with one node representing a mature cell and the other node representing an immature cell.

Following the same argument for the y-value of 7.59049321051421, there are two predecessor nodes with this value because they are of the same neuron cell type, where one node is immature and the other node is mature.

This sequential analysis of the Girvan-Newman algorithm followed by the Leiden algorithm concludes that there are two distinct clusters in our TrkB neuron community. All nodes in the TrkB community delineate to the four nodes shown in Figure 15, which are, in fact, only two distinct neuron cell types. This means that there are two distinct identities in this Levens 32 community of neurons. Thus, there are fewer clusters generated by the combined algorithms than there are generated by only the Leiden algorithm.

In the TrkC community, eleven clusters were identified (Fig. 16).

Figure 16: UMAP Analysis of TrkC neuron community. Eleven Clusters were identified. Source: Elyse Levens

This means that, in the TrkC community, there are eleven different types of neurons. The silhouette score analysis revealed a similar conclusion, by observation (Fig. 17). Levens 33

Figure 17: Silhouette score analysis of TrkC neuron community. Source: Elyse Levens

Similar to the analysis of the TrkB community, the silhouette score is higher closer to the middle of the identified clusters in the TrkC community UMAP analysis.

After running the data set from Figure 16 through the Girvan-Newman algorithm, four nodes are obtained (Fig. 18).

Figure 18: Girvan-Newman analysis of TrkC community data from Figure 15. Source: Elyse Levens Levens 34

This means that the Girvan-Newman algorithm removed all edges with highest betweenness and identified these four predecessor nodes. Following the same reasoning for the TrkB community of neurons, there are only two distinct y-values. This means that for the two nodes with the y-value of 7.14902056711624, one is mature and the other is immature. The same is said for the two nodes with the y-value of -1.6379974636. Thus, via the combined algorithms, the TrkC community has two distinct neuron identities (or, two clusters). Once more, the combined algorithms identify fewer clusters than the Leiden algorithm by itself.

In conclusion, the Leiden algorithm detects differences between nodes that the Girvan-

Newman algorithm does not. The combined algorithms identify a smaller number of clusters compared to the Leiden algorithm. This can indicate one of two situations: (1) that the combined algorithms cut important edges or (2) that the combined algorithms better sort mismatched nodes into their correct community, thus ridding the UMAP of unneeded cluster identities. Let us consider the TrkB community. The Leiden algorithm identified eleven distinct cluster identities. The silhouette score analysis of those eleven clusters identified a lot of distant connections, or, in other words, a lot of edges with high betweenness. For the first possible situation, this means that there must be more than eleven cluster identities. In order to increase the silhouette score for the TrkB data set, there needs to be more cluster identities for the mismatched nodes to fall into and identify with. However, for the second possible situation, lots of edges with high betweenness means that there must be less than eleven cluster identities. In order to increase the silhouette score, there needs to be fewer cluster identities and more accurate clustering into two identities as opposed to eleven identities. Levens 35

I conclude that this combined method of the Leiden and the Girvan-Newman algorithms has revealed further issues with only using the Leiden algorithm. It was already known that the

Leiden algorithm occasionally clustered cell types incorrectly. By utilizing another clustering script, I identified that the Leiden algorithm generates the incorrect number of clusters. The focus of current biologists and mathematicians is improving modularity in clustering scripts by sorting nodes into the correct clusters. However, if the existing clustering scripts aren’t generating the correct number of clusters in the first place, we need to approach optimizing modularity from a completely different direction. In order to better optimize modularity, clustering scripts must generate the correct number of clusters before sorting nodes into those clusters. Further advances in this line of research should include a focus on utilizing multiple clustering algorithms to first optimize the generation of clusters as opposed to the sorting of nodes.

9 Bibliography

[1] Modularity. www.cs.cmu.edu/~ckingsf/bioinfo-lectures/modularity.pdf.

[2] “Nondeterministic Polynomial Time.” DeepAI, 17 May 2019, deepai.org/machine-learning-

glossary-and-terms/nondeterministic-polynomial-time.

[3] Traag, V. A., et al. “From Louvain to Leiden: Guaranteeing Well-Connected

Communities.” Scientific Reports, vol. 9, no. 1, 26 Mar. 2019, 10.1038/s41598-019-

41695-z.

[4] Girvan, M., and M. E. J. Newman. “Community Structure in Social and Biological

Networks.” Proceedings of the National Academy of Sciences, vol. 99, no. 12, 11 June

2002, pp. 7821–7826, 10.1073/pnas.122653799. Levens 36

[5] Anderson, Rodney. “The Insight to the Girvan-Newman Algorithm: ‘Detecting Communities

in Network Systems’.” Poster Presentation.

scholarexchange.furman.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&h

ttpsredir=1&article=1527&context=furmanengaged

[6] Brehm-Stecher, B. F. “Flow Cytometry.” ScienceDirect, Academic Press, 1 Jan. 2014,

www.sciencedirect.com/science/article/pii/B9780123847300001270.

[7] “Microbial Cells.” Imedpub, www.imedpub.com/scholarly/microbial-cells-

journals-articles-ppts-list.php#:~:text=Microbial%20cell%20is%20a%20pathogenic.

[8] “What Is Mass Spectrometry?” Broad Institute, 13 Sept. 2010,

www.broadinstitute.org/proteomics/what-mass-spectrometry.

[9] Gunawardena, Gamini. “Mass-to-Charge Ratio.” Chemistry LibreTexts, 27 Nov. 2015,

chem.libretexts.org/Bookshelves/Ancillary_Materials/Reference/Organic_Chemistry_Gl

ossary/Mass-to-Charge_Ratio.

[10] Bonnevier, Jody, et al. “Physics of a Flow Cytometer.” Flow Cytometry Basics for the Non-

Expert, SpringerLink, 2018, pp. 13–24.

[11] “Mass Cytometry.” Institute for Immunity,Transplantation and Infection,

iti.stanford.edu/himc/mass-cytometry.html.

[12] “What Is the Function of the Collision Gas (CAD) on QTRAP® System

Instruments?” Sciex, sciex.com/support/knowledge-base-articles/what-is-the-function-

of-cad-on-qtrap-instruments#:~:text=For%20MS%2FMS%2Dtype%20scans.

[13] “Time-of-Flight Mass Spectrometry.” Agilent Technologies, 2011,

www.agilent.com/cs/library/technicaloverviews/Public/5990-9207EN.pdf. Levens 37

[14] “Inductively Coupled Plasma Mass Spectrometry (ICP-MS) Information –

US.” Thermofisher, www.thermofisher.com/us/en/home/industrial/spectroscopy-

elemental-isotope-analysis/spectroscopy-elemental-isotope-analysis-learning-

center/trace-elemental-analysis-tea-information/inductively-coupled-plasma-mass-

spectrometry-icp-ms-information.html.

[15] “Selecting the Number of Clusters with Silhouette Analysis on KMeans Clustering.” Scikit

Learn,

scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html.

[16] Pedregosa, Fabian, et al. “Scikit-Learn: Machine Learning in Python.” Journal of Machine

Learning Research, vol. 12, no. 85, 2011, pp. 2825–2830,

jmlr.csail.mit.edu/papers/v12/pedregosa11a.html.

[17] Ibrahimi, Morteza, et al. “Accelerated Time-of-Flight Mass Spectrometry.” IEEE

Transactions on Signal Processing, vol. 62, no. 15, Aug. 2014, pp. 3784–3798,

arxiv.org/pdf/1212.4269.pdf, 10.1109/tsp.2014.2329644.

[18] Bulska, Ewa, and Barbara Wagner. “Quantitative Aspects of Inductively Coupled Plasma

Mass Spectrometry.” Philosophical Transactions of the Royal Society A: Mathematical,

Physical and Engineering Sciences, vol. 374, no. 2079, 28 Oct. 2016, p. 20150369,

10.1098/rsta.2015.0369.

[19] Despalatović, Ljiljana, et al. Community Structure in Networks: Girvan-Newman Algorithm

Improvement. May 2014.

[20] Ethiraj, Sendil K., and Daniel A. Levinthal. “Modularity and Innovation in Complex

Systems.” SSRN Electronic Journal, vol. 50, no. 2, 2003, 10.2139/ssrn.459920. Levens 38

[21] “Graphs - Terminology and Representation.”

www.radford.edu/~nokie/classes/360/graphs-terms.html.

[22] Freeman, Linton C. “A Set of Measures of Centrality Based on Betweenness.” Sociometry,

vol. 40, no. 1, Mar. 1977, p. 35, 10.2307/3033543.

[23] Newman, M. E. J. “Scientific Collaboration Networks. II. Shortest Paths, Weighted

Networks, and Centrality.” Physical Review E, vol. 64, no. 1, 28 June 2001,

10.1103/physreve.64.016132.

[24] “Collision/Reaction Cells in ICP-MS: Cell Design Considerations for Optimum Performance

in Helium Mode with KED.” Agilent Technologies, Inc., 8 June 2010.

[25] “Inductively Coupled Plasma.” HiQ, Linde Worldwide, 2012, hiq.linde

gas.com/en/analytical_methods/inductively_coupled_plasma.html.