Quick viewing(Text Mode)

AN EXACT ALGORITHM for STABLE INSTANCES of the K-MEANS PROBLEM with PENALTIES in FIXED-DIMENSIONAL EUCLIDEAN SPACE

AN EXACT ALGORITHM for STABLE INSTANCES of the K-MEANS PROBLEM with PENALTIES in FIXED-DIMENSIONAL EUCLIDEAN SPACE

JOURNAL OF INDUSTRIAL AND doi:10.3934/jimo.2021122 MANAGEMENT OPTIMIZATION

AN EXACT ALGORITHM FOR STABLE INSTANCES OF THE k-MEANS PROBLEM WITH PENALTIES IN FIXED-DIMENSIONAL EUCLIDEAN SPACE

Fan Yuan Department of Operations Research and Information Engineering Beijing University of Technology, Beijing 100124, China Dachuan Xu∗ Beijing Institute for Scientific and Engineering Computing Beijing University of Technology, Beijing 100124, China Donglei Du Faculty of Management University of New Brunswick, Fredericton, NB Canada E3B 5A3, Canada Min Li School of Mathematics and Statistics Shandong Normal University, Jinan 250014, China

(Communicated by Wenxun Xing)

Abstract. We study stable instances of the k-means problem with penalties in fixed-dimensional Euclidean space. An instance of the problem is called α-stable if this instance exists a sole optimal solution and the solution keeps unchanged when distances and penalty costs are scaled by a factor of no more than α. Stable instances of clustering problem have been used to explain why certain heuristic algorithms with poor theoretical guarantees perform quite well in practical. For any fixed  > 0, we show that when using a common multi- swap local-search algorithm, a (1 + )-stable instance of the k-means problem with penalties in fixed-dimensional Euclidean space can be solved accurately in polynomial time.

1. Introduction. For many optimization problems, certain well-known heuristic algorithms perform much better than what their theoretical performance guarantee suggests. To explain this paradox, many existent works seek to study their sta- ble instances, a concept introduced in Bilu et al. [8] and Awasthi et al. [5]. An instance of a problem is called α-stable if its optimal solution is sole and remains unchanged even if the problem’s input parameters are scaled by at most an α factor. An instance of a clustering problem is called α-stable if this instance exist a sole optimal solution and this sole optimal solution remains unchanged after some dis- tances of this instance are scaled by α and distances between any two points can be

2020 Mathematics Subject Classification. Primary: 90C27; Secondary: 68W25. words and phrases. Local search, stable instance, k-means, approximation algorithm, fixed-dimensional Euclidean space. ∗ Corresponding author: Dachuan Xu.

1 2 FAN YUAN, DACHUAN XU, DONGLEI DU AND MIN LI scaled differently. The motivation to study stable instances is that, for some com- mon clustering problems, certain well-know heuristics are actually polynomial-time algorithms for α-stable instances of these problems, and hence offering one explana- tion of the aforementioned paradox as instances encountered in practice may very well be α-stable. Before we formally introduce the problem we are interested in this work, we review relevant literature. Clustering problems have been studied since the 1950’s in many fields of science: biomedical engineering, statistical science, medical engineering, computer science, physical information engineering, and more. According to the different objective functions, many different kinds of clustering problems are generated. The most common one is the k-means problem. This problem has been studied by many scholars due to its wide application value. In this problem we are given a set D of n points in d-dimensional Euclidean space Rd, and an integer k. Our target is to d pick a set of points f1,...,fk∈R with size k and assign them as center points. We want to minimize the sum of squared distances of each point to its nearest center point. However, the time complexity of selecting the center points from the Rd space is too large, so a lot of work studies the discrete version of the k-means problem. When the centres must be chosen from a given finite set F , we call it a discrete k-means problem. Matouˇsek[22] show that a discrete k-means problem can be transformed by a k-means problem with a small loss. It is well known that the discrete k-means problem is NP-hard for k = 2 in Rd when d is not a constant. Again, in the case of arbitrary k in R2 [2, 11, 21] the discrete k-means problem is NP-hard too. There are quite a lot of research results for this problem in literature. For arbitrary dimensions, Kanungo et al. [17] a 9 +  local search algorithm, and the best ratio 6.357 is given by Ahmadian et al. [1]. For fixed dimensions, a PTAS is given independently by Friggstad et al. [15] and Cohen-Addad et al. [10]. Our problem is closely related to the k-means problem with penalties in fixed dimension Euclidean space. The k-means problem with penalties is a natural gen- eralization of the classic k-means problem. Formally, the problem gives us a client point set D of n client points in d-dimensional Euclidean space Rd, a penalty cost pj > 0 for every client point j ∈ D. This problem also give us a positive integer k ≤ n that represents the size of the set of centers. Our target is to find a set of d k points f1,...,fk∈R to be the centers and find a client subset P ⊆ D to be the penalized client set so as to minimize the sum of squared distances of each client point in D\P to its nearest center and the sum of penalty costs for each client point in P . Tseng [23] proposed the penalized and weighted k-means problem with uni- form penalties. Zhang et al. [26] proposed the k-means problem with nonuniform penalties and give a (25 + )-approximation local search algorithm for this problem. The best result on this problem so far is a (19.849 + )-approximation primal-dual algorithm, which was proposed by Feng et al. [13]. Li et al. [19] design its approxi- mation algorithm by initializing the first clustering. Based on this algorithm, Li [18] presents a bi-criteria algorithm for k-means with penalties. Ji et al. [16] generalize the seeding algorithm to the variants of k-means problem with penalties. The main model studied in this work is the discrete k-means problem with penal- ties (k-MPWP). In this problem we are given a center point set F in Rd and a data point set D in Rd. We now formally define what a stable instance is for this prob- lem. Denote (F, D, η, p) as an instance of the k-means problem with penalties. η is AN EXACT ALGORITHM FOR STABLE INSTANCES 3 a metric distance between points in F ∪ D. A center point set S ⊆ F with | S |= k together with a penalized client set P ⊆ D is an solution of this problem. For any two points i, j ∈ Rd, we use η(i, j) to define the distance between them. For every set S ⊆ F , and every j ∈ D, denote η(S, j) := mini∈S η(i, j). The cost of solution S is X 2 X cost(S) := η(S, j) + pj. j∈D\P j∈P For this problem, our target is to choose a best S ⊆ F to minimize the cost(S). For any feasible center set S of the k-MPWP, the corresponding penalty set P is  2 always chosen as P = j ∈ D | pj < η(S, j) , implying that the penalty set P is 2 determined by the center set S. Hence if we set penalty cost pj > η(S, j) for all points j ∈ D, then the k-MPWP is degenerated to the classic k-means problem. In this situation the penalty set P is empty. Definition 1.1. (α-stability). For a given constant α ≥ 1, we call an instance I = (F, D, η, p) of the metric k-means problem with penalties is α-stable if this instance exist a sole optimal solution O. This solution O is still the sole optimal solution in 0 0 0 0 all other related instances I = (F, D, η , pj) with η(i, j) ≤ η (i, j) ≤ α · η(i, j) and 0 0 pj ≤ pj ≤ α · pj for all i, j ∈ F ∪ D. (The distance function η only needs to satisfy symmetry and does not need to satisfy the triangle inequality.) For a large number of clustering problems, many people have studied how to solve their stable instance. Their target is to find polynomial-time exact algorithms for small α. Awasthi et al. [6] show that for a 3-stable instance of the k-means problem, we can find the optimal solution of the instance√ in polynomial time. A few years later Balcan and Liang [7] show that for a (1+ 2)-stable instance of the k- means problems, we can find the optimal solution of the instance in polynomial time. Agelidakis et al. [3] show that for a 2-metric stable instance of the k-means problems, we can find the optimal solution of the instance in polynomial time. Friggstad et al. [14] study the stable instances of k-means problem in fixed-dimensional Euclidean space and prove that for any fixed  > 0, for a (1+)-stable instance of the k-means problem, the optimal solution can be obtained in polynomial time. In combinatorial optimization, a widely used technique is called Local search technique. Local search technique is used to solve many problems, such as k-means problems [15,26], arc routing problems [20,24], facility location problems [4,9,12,25], multicoloring hexagonal graphs [27], and. In this work, we focus on a discrete k- means problem with penalties in fixed-dimensional Euclidean space. Our results show that for any fixed  > 0, for a (1 + )-stable instance of this problem in fixed dimension Euclidean space (Rd, d is a fixed constant), the optimal solution of this instance can be obtained in polynomial time by using local search techniques. Then we extend the result in [14] for the k-means problem to the k-means problem with penalties. The rest of this paper is organized as follows. First, in section2, we present the algorithm. Then in section3, we present the analysis. Finally, in section4, we provide concluding remarks.

2. Algorithm. For given positive real number 0 and positive integer d, Suppose we are given a (1 + 0)-stable instance (F, D, η, p) of the k-means problem with penalties in Rd. From the definition, we know that this instance must exist a sole optimal solution O ⊆ F . This sole optimal solution O ⊆ F maintain unchanged 4 FAN YUAN, DACHUAN XU, DONGLEI DU AND MIN LI when distances between any two points in F ∪ D are scaled (non-uniformly) by at most a factor of (1 + 0). We use Algorithm1 to solve this instance. Algorithm1 below is a slightly modified ρ-swap local search algorithm. At every step, the algorithm chooses the solution that will bring the best reward. For the specific range of ρ, please see Theorem 3.1.

Algorithm 1 1: Choose an arbitrary subset S with k centers from F 2: While ∃ set S0 ⊆ F , |S0| = k, we have |S − S0| ≤ ρ and cost(S0) < cost(S); 3: do S ←− arg min cost(S0); S0⊆F,|S0|=k,|S−S0|≤ρ 4: Return S

We use O ⊆ F to denote the sole optimal solution of our problem. For all sets S ⊆ F , |S| = k, we can regard them as a feasible solution. Next we introduce some notations. • For all j ∈ D, we use σ(S, j) to denote the center point in S closest to j, and σ(O, j) to denote the center point in O closest to j. • Denote Xs = {j ∈ D : σ(O, j) ∈ O − S and σ(S, j) ∈ S − O}. • Denote ϕ(S) = P (η(j, σ(j, S))2 + η(j, σ(j, O))2). j∈Xs We also need the following concept. Definition 2.1. A subset S ⊆ F of cardinality |S| = k is called a good enough solution if we have that cost(O) + 2 · ϕ(S) ≥ cost(S).

3. Analysis. First, we show in Section 3.1 that Algorithm1 always returns a good enough solution. Then, we show in Section 3.2 that a good enough solution is also the sole optimal solution in a stable instance. At last, we show in Section 3.3 that on stable instances, Algorithm1 will stop in polynomial time.

3.1. Partition. The analysis of a local search algorithm needs a well-structured partition scheme corresponding to the swap operation and a proper assignment for each data point, in order to make a connection between the global optimal solution and the local optimal solution. In our partition scheme, define O as the global optimal solution. For any subset S ⊆ F of cardinality |S| = k, we define S0 = S − O and O0 = O − S. Extending the technique of Friggstad et al. [14], we can obtain a partition of O0 ∪ S0 to analyze the solution of any stable instance of the k-means problem with penalties. Next we recall the partition technique in [14]. Define the following notation:  η(i, S), if i ∈ O0;  0 Di = η(i, O), if i ∈ S ; 0, if i ∈ S ∩ O.

0 0 First, we using the following steps to make S and O sparse. Let S0 = ∅, Then 0 0 0 check i ∈ S in increasing order of Di: if η(i, S ) > ·Di then add i to S . Otherwise, 0 do not add i to S0 . Do the same procedure for O to obtain O0 . Next, for every i ∈ S0 we use φ(i) to denote the centre point in O0 nearest to i, and define φ(i) for i ∈ O0 to be the centre point in S0 nearest to i. AN EXACT ALGORITHM FOR STABLE INSTANCES 5

−1 Finally, for every i ∈ S0 S O0 with φ (i) 6= ∅ we use cent(i) to denote the center −1 point in φ (i) nearest to i. 0 Define T = {(cent(i), i): i ∈ φ(O ) and ·η(i, cent(i)) ≤ Di}. From [15] Lemma3 0 0 and [14] and we can prove that for any A ⊆ O ∪ S , if A ∩ {(cent(i), i)}= 6 ∅ for all 0 0 0 0 (cent(i), i) ∈ T . So for all points i ∈ O ∪ S we have η(i ,A) ≤ 5 · Di0 . ∗ 0 0 ∗ Define N = {(i , i) ∈ O × S : η(i, i ) ≤ Di/ and Di∗ ≥ Di}. According to the definition of T , we can know that T must exist and not be empty. N can be empty, but when N is empty, it has no effect on the third item of Theorem 3.1. More details about T and N can be find in [14, 15]. For any two sets U, W ⊆ F , we use U4W = (U ∪ W ) − (U ∩ W ) to denote the symmetric difference set operation in F . Theorem 3.1. (Structure Theorem [14,15]) For any  > 0, there exist a randomized algorithm and a constant ρ = ρ(, d) = 32 · (2d)8d · −36·d/. This algorithm can find a division π of O0 ∪ S0 and the following conclusions are established. • To all W ∈ π, | W ∩ O0 |=| W ∩ S0 |≤ ρ. • To all W ∈ π, (S04W ) ∩ {i, i∗}= 6 ∅ for all pair(i, i∗) ∈ T . • To all (i, i∗) ∈ N, Pr[i, i∗ lie in different parts of π] ≤ . This ρ is also the swap number ρ in Algorithm1, and the partition π corresponds to the ρ-swap of Algorithm1. More details about ρ can be found in [15]. 0 0 W From Theorem 3.1, we obtain a partition of O ∪ S . The cost change 4j of part W ∈ π corresponding to S can be calculated as follows

W 2 2 4j = η(((S − W ∩ S) ∪ (W ∩ O)), j) − η(S, j) .

When we swap a part W ∈ π we will get a new solution Snew = (S−W ∩S)∪(W ∩O)

W 2 2 4j = η((Snew), j) − η(S, j) . W This symbol 4j represents the change in the of the distance from j to the nearest center point in S and Snew. We use this partition scheme to prove the following Theorem 3.2. Theorem 3.2. For any S ⊆ F with |S| = k, suppose we have that cost(O) +  · ϕ(S) < cost(S), then we must have some S∼ ⊆ F . The size of S∼ is k and | S − S∼ |≤ ρ such that:  · ϕ(S) + cost(O) − cost(S) cost(S) + ≥ cost(S∼). k Proof. From Theorem 3.1, we can obtain a random partition π of O0 ∪ S0. If we can estimate the cost of the swap S → S4W for each part W ∈ π, then we can estimate P  the cost of Eπ W ∈π cost(S4W ) − cost(S) . In order to do this, we designate all ∗ 2 2 j ∈ D in different situations. First we define Cj = η(O, j) and Cj = η(S, j) . Let P denote the penalty set of a solution S and P ∗ denote the penalty set of global optimal solution O. We partition D into four subsets and give the reassignment scheme for j in each situation: Case 1: E1 := P ∩ P ∗. This is the set of points which are chosen as penalty points both in O and S. For each j ∈ P ∩ P ∗, among all the parts of π, we 6 FAN YUAN, DACHUAN XU, DONGLEI DU AND MIN LI

assign j as a penalty point. So the upper bound of j0s cost change in this case is X W 4j = pj − pj = 0. W ∈π Case 2: E2 := (D\P ) ∩ (D\P ∗). This is the set of points, none of the points in it is chosen as a penalty point either in O or S. This case has been listed in [14] Theorem 5, we add some explanation and reformulate it here. For each j ∈ (D\P ) ∩ (D\P ∗), we provide upper bounds on j0s cost change, depending on four sub-cases. ∗ Case 2.1: To all j have σ(S, j), σ(O, j) ∈ S ∩ O, the upper bound is Cj − Cj, because σ(S, j) never changed after each swap. Case 2.2: To all j have σ(S, j) ∈ S0 and σ(O, j) ∈ S ∩ O, the upper bound ∗ is Cj − Cj, because when exchanging the part W with σ(S, j) ∈ W , we can designate j to σ(O, j), and never move j in any other part W 0 6= W . Case 2.3: To all j have σ(S, j) ∈ S ∩ O and σ(O, j) ∈ O0, the upper bound ∗ is Cj − Cj, because when exchanging the part W with σ(O, j) ∈ W , we can designate j to σ(O, j), and never move j in any other part W 0 6= W . Case 2.4: To all j have σ(S, j) ∈ S0, σ(O, j) ∈ O0. We can know that these points are exactly the points j ∈ XS. Following similar analysis in Friggstad ∗ ∗ et al. [14], the upper bound for this case is Cj − Cj +  · (Cj + Cj). Overall, for Case 2, we can get the following upper bound on j0s cost change is X W ∗ ∗ 4j ≤ Cj − Cj +  · (Cj + Cj). W ∈π Case 3: E3 := (D\P ) ∩ P ∗. This is the set of points which are chosen as a penalty point in O but not in S. For every j ∈ (D\P ) ∩ P ∗, when σ(S, j) is swapped out related to one part W 1, we assign j as a penalty point. So the upper bound of j0s cost change in this situation is W 1 2 4j ≤ pj − η(S, j) . In other swaps, σ(S, j) remains open after the swap. So we can designate j to σ(S, j) in those swaps and j0s cost change is zero. In total, we have P W 2 4j ≤ pj − η(S, j) . W ∈π Case 4: E4 := (D\P ∗) ∩ P . This is the set of points which are chosen as a penalty point in S but not in O. For every j ∈ (D\P ∗) ∩ P , when σ(O, j) is swapped in related to one part W 2, we can designate j to σ(O, j). So the upper bound of j0s cost change in this situation is W 2 2 4j ≤ η(O, j) − pj. In other swaps, assign j as a penalty point. Thus, the cost change of j is zero. In total, we have P W 2 4j ≤ η(O, j) − pj. W ∈π The four subsets of D have the following connections from the properties of inter- section and union of sets. Claim 3.3. E2 ∪ E4 = D\P ∗; E2 ∪ E3 = D\P ; E1 ∪ E3 = P ∗; E1 ∪ E4 = P. AN EXACT ALGORITHM FOR STABLE INSTANCES 7

As we mentioned before, we want to estimate the cost of P  Eπ W ∈π cost(S4W ) − cost(S) We consider the cost change of the swap operation S → S4W for all W ∈ π. Note that the partition in Theorem 3.1 is a random partition. We can get the upper P  bound on Eπ W ∈π cost(S4W ) − cost(S) over the random choice of π.   X X W 0 ≤ Eπ  4j  j∈D W ∈π X ≤ η(O, j)2 − η(S, j)2 +  · ϕ(S) j∈E2 X X 2 X 2 + pj − pj + pj − η(S, j) + η(O, j) − pj j∈E1 j∈E3 j∈E4 X 2 X 2 X X = η(O, j) − η(S, j) + pj − pj j∈E2 j∈E2 j∈E1 j∈E1 X X 2 X 2 X + pj − η(S, j) + η(O, j) − pj +  · ϕ(S) j∈E3 j∈E3 j∈E4 j∈E4 X 2 X 2 X X ≤ η(O, j) − η(S, j) + pj − pj +  · ϕ(S), j∈D\P ∗ j∈D\P j∈P ∗ j∈P where the last inequality follows from Claim 3.3. Finally we get the following inequality: " # X Eπ cost(S4W ) − cost(S) ≤  · ϕ(S) + cost(O) − cost(S). W ∈π Obviously, we can find some π and some W ∈ π such that  · ϕ(S) + cost(O) − cost(S) cost(S4W ) − cost(S) ≤ | π |  · ϕ(S) + cost(O) − cost(S) ≤ , k where the last inequality is based on two facts: (1) the numerator is assumed to be negative in Theorem 3.2; and (2) | π |≤ k, because π is a partition of O0 ∪ S0.

The Theorem3.2 states that we will get a good enough solution when Algorithm 1 stops.

3.2. Good enough solution are optimal. Next we have to prove that if S is a good enough solution, then S is also the sole optimal solution in a stable instance.

Proof. Suppose we are given an instance I = (F, D, η, p), and this instance I is (1 + 0)-stable for our problem. We use S to denote a good enough solution for this instance and we use O to denote the optimal solution of I. Next we define the 0 0 distances η (i, j) and penalty cost Pj for all i ∈ F , j ∈ D. ( (1 + 0) · η(i, j), if i 6= σ(S, j); η0(i, j) = η(i, j), otherwise. 8 FAN YUAN, DACHUAN XU, DONGLEI DU AND MIN LI ( 0 pj, if j ∈ P ; pj = 0 (1 +  ) · pj, otherwise. Thus we have a new scaled instance I0 = (F, D, η0, p0) for our problem. Because instance I = (F, D, η, p) is a (1+0)-stable instance of our peoblem, so we can know that O is also the sole optimal solution in the new instance I0 = (F, D, η0, p0). For all S ∈ F with |S| = k, we define 0 X 0 2 X 0 cost (S) := η (S, j) + pj, j∈D\P j∈P 0 0 be the cost of S under distances η (i, j) and penalty cost pj. In the following, we prove S = O. First we divide the points in D into the following four parts. (1) E1 = P ∩ P ∗. (2) E2 = (D\P ) ∩ (D\P ∗), which are further partitioned into four parts: X1 = {j ∈ D : σ(S, j) ∈ S − O and σ(O, j) ∈ S ∩ O}, X2 = {j ∈ D : σ(S, j) ∈ S ∩ O and σ(O, j) ∈ O − S}, X3 = {j ∈ D : σ(S, j), σ(O, j) ∈ S ∩ O}, X4 = {j ∈ D : σ(S, j) ∈ S − O and σ(O, j) ∈ O − S}. (3) E3 = (D\P ) ∩ P ∗. (4) E4 = (D\P ∗) ∩ P . ∗ 2 2 We use the notation Cj = η(j, σ(O, j)) ,Cj = η(j, σ(S, j)) for convenient repre- sentation. By the definition of the instance I0 = (F, D, η0, p0) we have the following equations 0 X X X X cost (S) = cost(S) = Cj + Cj + pj + pj j∈E2 j∈E3 j∈E1 j∈E4 X ∗ X ∗ X ∗ X ∗ cost(O) = Cj + Cj + pj + pj . j∈E2 j∈E4 j∈E1 j∈E3 We also have 0 X 0 2 ∗ X 0 2 ∗ X ∗ cost (O) = (1 +  ) · Cj + min{(1 +  ) · Cj ,Cj } + Cj j∈X1 j∈X2 j∈X3 X 0 2 ∗ X 0 2 ∗ X ∗ X 0 2 ∗ + (1 +  ) · Cj + (1 +  ) · Cj + pj + (1 +  ) · pj . j∈X4 j∈E4 j∈E1 j∈E3 Note that X X X X X Cj ≤ Cj + Cj + Cj + pj j∈X4 j∈X1 j∈X4 j∈E3 j∈E4 X X X = cost(S) − Cj − Cj − pj. j∈X2 j∈X3 j∈E1 ∗ 1 ∗ 2 ∗ Then from the fact that pj = pj for j ∈ E , Cj ≤ Cj for j ∈ X and Cj = Cj for j ∈ X3, we have X X ∗ X ∗ X ∗ Cj ≤ cost(S) − Cj − Cj − pj . j∈X4 j∈X2 j∈X3 j∈E1 By definition of the good enough solution, we have the following equations cost(O) + 2 · ϕ(S) ≥ cost(S). AN EXACT ALGORITHM FOR STABLE INSTANCES 9

Then we have     X X ∗ X X ∗ X ∗ X ∗ Cj ≤ cost(O) + 2  Cj + Cj −  Cj + Cj + pj  j∈X4 j∈X4 j∈X4 j∈X2 j∈X3 j∈E1   X ∗ X ∗ X ∗ X X ∗ X ∗ = Cj + Cj + 2  Cj + Cj + Cj + pj . j∈X1 j∈X4 j∈X4 j∈X4 j∈E4 j∈E3 Rearranging the above inequality, we obtain   X 1 X X X X C ≤ C∗ + (1 + 2) C∗ + P ∗ + C∗ . j 1 − 2  j j j j  j∈X4 j∈X1 j∈X4 j∈E3 j∈E4 Since  is small, we have   X X ∗ X ∗ X ∗ X ∗ Cj ≤ (1 + 6)  Cj + Cj + Pj + Cj  . j∈X4 j∈X1 j∈X4 j∈E3 j∈E4

Now we prove cost(S) ≤ cost0(O).

cost(S) (1)   X ∗ X ≤ cost(O) + 2  Cj + Cj  j∈X4 j∈X4   X ∗ X ∗ X ∗ X ∗ X ∗ ≤ cost(O) + 2 Cj + 2(1 + 6)  Cj + Cj + pj + Cj  j∈X4 j∈X1 j∈X4 j∈E3 j∈E4 X ∗ X ∗ X ∗ X ∗ X ∗ = Cj + Cj + Pj + pj + 2 Cj j∈E2 j∈E4 j∈E1 j∈E3 j∈X4   X ∗ X ∗ X ∗ X ∗ +2(1 + 6)  Cj + Cj + pj + Cj  j∈X1 j∈X4 j∈E3 j∈E4 X ∗ X ∗ X ∗ X ∗ = [1 + 2(1 + 6)] Cj + Cj + Cj + [1 + 2 + 2(1 + 6)] Cj j∈X1 j∈X2 j∈X3 j∈X4 X ∗ X ∗ X ∗ +[1 + 2(1 + 6)] Cj + pj + [1 + 2(1 + 6)] pj j∈E4 j∈E1 j∈E3 X ∗ X  ∗ X ∗ X ∗ ≤ (1 + 6) · Cj + min (1 + 6) · Cj ,Cj + Cj + (1 + 6) · Cj j∈X1 j∈X2 j∈X3 j∈X4 X ∗ X ∗ X ∗ + (1 + 6) · Cj + pj + (1 + 6) · pj . j∈E4 j∈E1 j∈E3 We pick  such that (1 + 6) = (1 + 0)2. Finally we have cost0(O) ≥ cost0(S) = cost(S). Since the instance I = (F, D, η, p) is a (1 + 0)-stable instance and O is the sole optimal solution of this instance. We have that O is also the sole optimal solution 0 0 0 0 0 in other instance I = (F, D, η , p ) with distances η (i, j) and penalty cost pj. This shows that S = O. 10 FAN YUAN, DACHUAN XU, DONGLEI DU AND MIN LI

3.3. Polynomial time. Similar to Friggstad et al. [14], Algorithm1 can be proved that it terminates with a good enough solution in polynomial number of steps. Proof. Suppose S ⊂ F with |S| = k and S is not a good enough solution and S∼ is the set given by Theorem 3.2. Since S is not a good enough solution then we have : cost(O) + 2 · ϕ(S) < cost(S). For S∼ we have:  · ϕ(S) + cost(O) − cost(S) cost(S∼) − cost(O) ≤ + cost(S) − cost(O) k

cost(S) − cost(O) 1 < − + cost(S) − cost(O) = (1 − ) · (cost(S) − cost(O)) 2k 2k Next we can assume that the coordinates of all points in F S D are integers. With this assumption, we can easily calculate the number of iterations of Algorithm1. 2 Let ∇ = maxi∈F,j∈Dη(i, j) . Obviously we have that n · ∇ ≥ cost(S) − cost(O). This inequality holds because there are only n points summation in the cost func- tion. Obviously we have that for all S ⊂ F with |S| = k, cost(S) is an integer. Also we have that ln∇ is polynomial of the input. In the above section we know that if S is not a good enough solution, then we can find a solution S∼ where cost(S∼) is smaller than cost(S). Therefore, we know that Algorithm1 will stop with a good enough solution. Now we will prove that Algorithm1 will reach a good enough solution after 2k · ln(n∇) iterations. For contradiction, first we suppose that Algorithm1 still not encountered a good enough solution after K = [2k · ln(n∇)] iterations. We can define S0,S1, ..., SK as the first K sets produced by the algorithm. We use S0 to denote the initial set. For 1 ≤ i ≤ K, we know that our Algorithm1 selects the solution that can bring the most benefits every time. So we have: 1 cost(S ) − cost(O) ≤ (1 − ) · (cost(S ) − cost(O)) i+1 2k i Therefore, we have: 1 1 cost(S ) − cost(O) ≤ (1 − )K · (cost(S ) − cost(O)) ≤ (1 − )K · n∇ < 1 K 2k 0 2k

We have already stipulated that the cost of any solution is an integer, so cost(SK ) − cost(O) < 1 violates our setting. Now we have proved that Algorithm1 will stop in polynomial steps. Observe that each step in Algorithm1 takes |F |O(ρ) time. Under the discrete k-means setting −d 1 and using the technique of [22] we have that |F | is O(n · log  ) size. When ρ(, d) = 32 · (2d)8d · −36·d/ is a constant under the setting of fixed-dimensional Euclidean space, we have |F |O(ρ) is a polynomial of n. We have also proved that under stable conditions, a good enough solution is equal to the above-mentioned sole optimal solution. In summary, Algorithm1 will stop in polynomial time. At the same time, Algorithm1 gets the sole optimal solution. Therefore, we have shown that a natural multi-swap local-search algorithm (Al- gorithm1) can get the sole optimal solution for a (1 + )-stable instance of the k-means problem with penalties. Also we have that our Algorithm1 will stop in polynomial time. AN EXACT ALGORITHM FOR STABLE INSTANCES 11

4. Conclusion. In this paper, we study the stable instances of the k-means prob- lem with penalties. By using the technique of multi-swap local search, we prove that a stable instance of the k-means problem with penalties in Rd can be solved in polynomial time. Our main contribution can be summarized as new ideas. First of all, we provide a new redistribution scheme for each point in the algorithm analysis, and provide an upper bound of its cost change for each point. Secondly, we construct a scaled (1+0)-stable instance, which allows us to show a good enough solution in the stable instance, which is equal to the sole optimal solution. While our algorithm executes in polynomial time for all stable instances of the k- means problem with penalties, how to improve the time complexity is an interesting future research direction.

Acknowledgments. The first two authors are supported by National Natural Sci- ence Foundation of China (No. 11871081) and Beijing Natural Science Foundation Project No. Z200002. The third author is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) grant 06446, and Natural Science Foundation of China (Nos. 11771386, 11728104). The fourth author is supported by Higher Educational Science and Technology Program of Shandong Province (No. J17KA171) and Natural Science Foundation of Shandong Province (No. ZR2020MA029) of China.

REFERENCES

[1] S. Ahmadian, A. Norouzi-Fard, O. Svensson and J. Ward, Better guarantees for k-means and Euclidean k-median by primal-dual algorithms, 58th Annual IEEE Symposium on Founda- tions of Computer Science–FOCS, (2017), 61–72. [2] D. Aloise, A. Deshpande, P. Hansen and P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Machine Learning, 75 (2009), 245–248. [3] H. Angelidakis, K. Makarychev and Y. Makarychev, Algorithms for stable and perturbation- resilient problems, STOC’17–Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, ACM, New York, (2017), 438–451. [4] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala and V. Pandit, Local search heuristics for k-median and facility location problems, SIAM J. Comput., 33 (2004), 544–562. [5] P. Awasthi, A. Blum and O. Sheffet, Stability yields a PTAS for k-median and k-means clustering, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science–FOCS, (2010), 309–318. [6] P. Awasthi, A. Blum and O. Sheffet, Center-based clustering under perturbation stability, Inform. Process. Lett., 112 (2012), 49–54. [7] M. F. Balcan and Y. Liang, Clustering under perturbation resilience, SIAM J. Comput., 45 (2016), 102–155. [8] Y. Bilu and N. Linial, Are stable instances easy?, Combin. Probab. Comput., 21 (2012), 643–660. [9] M. Charikar and S. Guha, Improved combinatorial algorithms for the facility location and k- median problems, 40th Annual Symposium on Foundations of Computer Science (New York, 1999), (1999), 378–388. [10] V. Cohen-Addad, P. N. Klein and C. Mathieu, Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics, SIAM J. Comput., 48 (2019), 644–667. [11] P. Drineas, A. Frieze, R. Kannan, S. Vempala and V. Vinay, Clustering large graphs via the singular value decomposition, Machine Learning, 56 (2004), 9–33. [12] D. Du, X. Wang and D. Xu, An approximation algorithm for the k-level capacitated facility location problem, J. Comb. Optim., 20 (2010), 361–368. [13] . Feng, Z. Zhang, F. Shi and J. Wang, An improved approximation algorithm for the k-means problem with penalties, Proceedings of FAW , (2019), 170–181. 12 FAN YUAN, DACHUAN XU, DONGLEI DU AND MIN LI

[14] Z. Friggstad, K. Khodamoradi and M. R. Salavatipour, Exact algorithms and lower bounds for stable instances of Euclidean k-means, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, (2019), 2958–2972. [15] Z. Friggstad, M. Rezapour and M. R. Salavatipour, Local search yields a PTAS for k-means in doubling metrics, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), (2016), 365–374. [16] S. Ji, D. Xu, L. Guo, M. Li and D. Zhang, The seeding algorithm for spherical k-means clustering with penalties, Journal of Combinatorial Optimization, (2020, Accepted). [17] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman and A. Y. Wu, A local search approximation algorithm for k-means clustering, Comput. Geom., 28 (2004), 89–112. [18] M. Li, The bi-criteria seeding algorithms for two variants of k-means problem, J. Comb. Optim., (2020, Accepted). [19] M. Li, D. Xu, J. Yue, D. Zhang and P. Zhang, The seeding algorithm for k-means problem with penalties, J. Comb. Optim., 39 (2020), 15–32. [20] A.-Y. Liang and D. Lin, Crossover iterated local search for SDCARP, J. Oper. Res. Soc. China, 2 (2014), 351–367. [21] M. Mahajan, P. Nimbhorkar and K. Varadarajan, The planar k-means problem is NP-hard, Proceedings of WALCOM , 5431 (2009), 274–285. [22] J. Matouˇsek, On approximate geometric k-clustering, Discrete Comput. Geom., 24 (2000), 61–84. [23] G. C. Tseng, Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data, Bioinformatics, (2007), 2247–2255. [24] H. Yang, F. Li, D. Yu, Y. Zou and J. Yu, Reliable data storage in heterogeneous wireless sensor networks by jointly optimizing routing and storage node deployment, Tsinghua Science and Technology, 26 (2021), 230–238. [25] D. Ye, L. Mei and Y. Zhang, Strategy-proof mechanism for obnoxious facility location on a line, Proceedings of COCOON , 9198 (2015), 45–56. [26] D. Zhang, C. Hao, C. Wu, D. Xu and Z. Zhang, Local search approximation algorithms for the k-means problem with penalties, J. Comb. Optim., 37 (2019), 439–453. [27] Y. Zhang, F. Y. L. Chin and H. Zhu, A 1-local asymptotic 13/9-competitive algorithm for multicoloring hexagonal graphs, Algorithmica, 54 (2009), 557–567. Received June 2020; 1st revision October 2020; 2nd revision May 2021; early access July 2021. E-mail address: [email protected] E-mail address: [email protected] E-mail address: [email protected] E-mail address: [email protected]