Approximation Algorithms for Connected Maximum Coverage Problem for the Discovery of Mutated Driver Pathways in Cancer ✩ ∗ Dorit S

Information Processing Letters 158 (2020) 105940 Contents lists available at ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Approximation algorithms for connected maximum coverage problem for the discovery of mutated driver pathways in cancer ✩ ∗ Dorit S. Hochbaum 1, Xu Rao ,1 Department of IEOR, Etcheverry Hall, Berkeley, CA 94709, USA a r t i c l e i n f o a b s t r a c t Article history: This paper addresses the connected maximum coverage problem, motivated by the Received 28 October 2019 detection of mutated driver pathways in cancer. The connected maximum coverage Received in revised form 14 February 2020 problem is NP-hard and therefore approximation algorithms are of interest. We provide Accepted 24 February 2020 here an approximation algorithm for the problem with an approximation bound that Available online 26 February 2020 strictly improves on previous results. A second approximation algorithm with faster run Communicated by R. Lazic time, though worse approximation factor, is presented as well. The two algorithms are Keywords: then applied to submodular maximization over a connected subgraph, with a monotone Approximation algorithms submodular set function, delivering the same approximation bounds as for the coverage Connected maximum coverage maximization case. Bounded radius subgraph © 2020 Elsevier B.V. All rights reserved. 1. Introduction cer patients and identify their individual list of gene mutations. Each mutation in the list is thus associated with We address here the connected maximum coverage a subset of patients whose profiles include this mutation. problem. Given a collection of subsets, each associated The goal is to “explain” the cancer with up to k muta- with a node in a graph, and an integer k, the goal is to find tions, where explanation means that these mutations cover up to k subsets the union of which is of largest cardinality jointly the largest possible collection of patients. These se- so that the respective nodes in the graph form a connected lected mutations are considered to be the driver mutations subgraph. This problem generalizes the maximum cover- that are most associated with the specific type of cancer. age problem which is to find k subsets that jointly cover The connectivity requirement is motivated by the belief the most elements. that cancer is a disease of pathways [1]. Pathways are con- The motivation for connected maximum coverage prob- nected subnetworks in a large gene interaction network. lem is the detection of mutations associated with cancer. Therefore, the goal is to identify mutations that not only The search for driver mutations in cancer has accelerated deliver large coverage of the set of patients, but also form with recent advances in sequencing technologies. These a connected subnetwork in the interaction network. technologies enable measurements in large cohorts of can- In the graph representation of the driver mutations’ detection problem, each node of the graph corresponds to a protein and its associated gene (mutation), and thus, ✩ Research supported in part by NSF award No. CMMI-1760102. also associated with a subset of patients whose profiles in- Corresponding author. * clude the mutation. The edges of the graph represent the E-mail addresses: [email protected] (D.S. Hochbaum), pairwise protein-to-protein interactions. The detection of k [email protected] (X. Rao). 1 Department of Industrial Engineering and Operations Research, Uni- driver mutations is then the maximum coverage problem versity of California, Berkeley, USA. on this graph. https://doi.org/10.1016/j.ipl.2020.105940 0020-0190/© 2020 Elsevier B.V. All rights reserved. 2 D.S. Hochbaum, X. Rao / Information Processing Letters 158 (2020) 105940 Formally, the input to an instance of the connected number of edges, and n =|V | for the number of nodes in maximum coverage problem is a graph G = (V , E), where the graph. each node v ∈ V be associated with a set S v ⊂ P (pos- We let the distance between two nodes in the graph, sibly empty), for P a universal set, and an integer k. The v1 ∈ V and v2 ∈ V be the shortest path length between goal is to find a subset V ⊂ V of cardinality less than or the two nodes where all the edge weights are of unit equal to k, that induces a connected subgraph in G such length. We denote this distance by d(v1, v2). It is easy to | | that v∈V S v is maximized. find the distance between node v1 and all other nodes in The connected maximum coverage problem is easily V in O (m) time using breadth first search (BFS). The level seen to be NP-hard as it is a special case of the maximum of each node in the BFS tree rooted in v1 is its distance coverage problem [2]. The maximum coverage problem was from v1. Additional details about BFS are provided in the studied by Hochbaum and Pathria [3] who showed that the next subsection. problem is NP-hard and devised a simple greedy algorithm − = with an approximation factor of 1 1/e 0.632121.... This Definition 1. The eccentricity (v) of a node v ∈ V is the approximation factor is the best-possible attained in poly- longest distance between v and any other node in V . nomial time for maximum coverage problem under the assumption that P = NP [4]. The eccentricity of node v in G can be found using BFS Problems closely related to the connected maximum from v, where (v) is equal to the depth of the BFS tree coverage problem include the problem of finding a k-nodes (the deepest level in the tree). connected subgraph that maximizes the sum of weights of the nodes. This problem was studied by Hochbaum and Definition 2. The radius of a graph is the minimum ec- Pathria in [5] who showed that this problem is in general centricity among all nodes in the graph, minv∈V (v). The NP-hard, but solvable in polynomial time for a tree graph. ∗ argument of this minimum is attained for node v called Seufert et al. (2010) [6] addressed this NP-hard connected ∗ the center of the graph. (v ) is then the radius of the maximum sum of weights problem (on general graphs) graph. and provided an 1/(5(1 + ))-approximation algorithm. Vandin et al. [7]studied the connected maximum coverage problem and provided an approximation algorithm. Let the radius of graph G be denoted by R(G). Note that Bomersbach et al. [8]formulated the connected maximum the center of the graph is not necessarily unique. ≤ coverage problem as an integer programming problem and It is easy to see that R(G) n/2 and that this inequality used a branch and cut algorithm to get exact solutions. is tight when the graph is an n node path and the number Both of these papers that address the connected maximum of nodes n is even. We therefore conclude that, coverage problem are motivated by the search for driver mutations in cancer. Lemma 1. The radius of the optimal connected subgraph on k- ≤ Our main contribution here is the improvement of the nodes, rOPT, satisfies rOPT k/2. approximation factor for the connected maximum coverage problem, with an algorithm we call the Bounded In order to find the radius and center of a graph we use Radius Algorithm. The approximation factor achieved by breadth-first-search (BFS). The procedure creates the BFS- Vandin et al. [7]is 1/(crOPT), where c = (2e − 1)/(e − 1) = tree rooted at the node called “root”. The longest distance 2.581976... and rOPT is the radius of the optimal solu- from the root to another node is attained for a leaf node. tion subnetwork. The approximation factor of our Bounded Hence, to find the eccentricity of node v it suffices to eval- Radius Algorithm is max{(1 − 1/e)(1/rOPT − 1/k) , 1/k}, uate the distance from the root v of each one of the leaves which is demonstrated to always improve on the factor (the level of each leaf), and to take the maximum distance. of 1/(crOPT). We also provide an alternative approximation To find the radius of the graph it is sufficient to initiate BFS algorithm, the Greedy Path Algorithm, with approximation from each node of the graph, compute the respective ec- factor max{(1 − 1/e)(1/R − 1/k) , 1/k} where R is the ra- centricity and take the minimum. The complexity of BFS is dius of the input graph G. The Greedy Path Algorithm is known to be O (m) and therefore, finding the radius and always worse than the Bounded Radius Algorithm in terms the center of a graph takes at most O (nm) steps for a of approximation bound, but always better in terms of run- graph on n nodes and m edges. ning time. The parent of a node v in a rooted tree T is denoted by Another contribution here is the generalization of the p(v). connected maximum coverage to a connected submodu- The algorithms presented in this paper make calls to a lar maximization problem. It is shown here that our two subroutine that merges two sets and measure the size of approximation algorithms apply also to the connected the union. There are various algorithms of different com- submodular maximization, for monotone submodular set plexities for finding set unions and the size of the unions, functions, delivering the same approximation bounds as that depend on different assumptions on how the input is for the connected maximum coverage problem. given. Here, we assume that for any v ∈ V , subset S v is represented by an array of ordered indexes of length |S v |, 2.

Approximation Algorithms for Connected Maximum Coverage Problem for the Discovery of Mutated Driver Pathways in Cancer ✩ ∗ Dorit S

Fast Semidifferential-Based Submodular Function Optimization

On Interval and Circular-Arc Covering Problems

Approximate Submodularity and Its Applications: Subset Selection, Sparse Approximation and Dictionary Selection∗

Fast and Private Submodular and K-Submodular Functions

Maximizing Non-Monotone Submodular Functions

Submodular Function Maximization Andreas Krause (ETH Zurich) Daniel Golovin (Google)

Maximization of Non-Monotone Submodular Functions

1 Introduction to Submodular Set Functions and Polymatroids

Partial Sublinear Time Approximation and Inapproximation for Maximum

Concave Aspects of Submodular Functions

(05) Polymatroids, Part 1

On Bisubmodular Maximization