EWCG 2006, Delphi, March 27–29, 2006

In-Place Randomized Slope Selection

Henrik Blunck∗ Jan Vahrenhold†

Abstract In addition to theoretical considerations, one reason for investigating space-efficient is that they Slope selection, i.e. selecting the slope with rank k have the potential of using the different stages of hier- n among all 2 lines induced by a collection of points, archical memory, e.g., caches, to a much higher degree P results in a widely used robust estimator for line- of efficiency. Another motivation, especially for de- fitting. In this paper, we demonstrate that it is possi- signing algorithms for statistical data analysis, comes ble to perform slope selection in expected (n log2 n) from the recently increased interest in sensor networks O · time using only constant extra space in addition to the where small-scale computing devices are used to col- space needed for representing the input. lect large amounts of data. Since the memory of such sensor devices usually is very limited, and since trans- 1 Introduction mitting data is much more costly than local computa- tion, it is desirable to process as much data as possible Computing a line estimator, i.e., fitting a line to a locally before transmitting (intermediate) results. collection of n data points p1,...,pn in the plane is a frequentP task in statistical{ analysis.} A frequently Our Results In this paper we show how to solve used robust line estimator is the so-called Theil-Sen the slope selection problem in expected optimal estimator (see [13] and the references therein) which (n log2 n) time while at the same time using only n O considers all 2 lines induced by the points in and constant extra space. Our follows the ap- selects the line with median slope. This problemP is proach of Matouˇsek [12], and, in the course of im- also known as the (median) slope selection problem plementing his algorithm in-place, we also devise an and has been shown to exhibit an Ω(n log2 n) lower in-place variant of the so-called randomized interpola- bound [6]. Several deterministic algorithms· for solv- tion search technique. This variant, together with an ing this problem in optimal (n log2 n) running time algorithmic subroutine for efficiently constructing and have been presented [5, 6, 10],O however,· as noted by storing a set of randomly sampled intersections, is of Matouˇsek et al. [13], they are based on relatively com- independent interest, since it can be used as a substi- plicated concepts such as parametric search, sorting tute for Megiddo’s parametric search technique [14]. network, expander graphs, or cuttings. More practical approaches have resulted in randomized algorithms 2 Randomized Interpolation Search with expected (n log2 n) running time [7, 12, 15]. O · In the following, the slope selection problem is studied The Model The goal of investigating space-efficient in the dual setting where each point (x, y) is identified with the line (ξ,υ) υ = x ξ y and vice versa. algorithms is to design algorithms that use very little { | · − } extra space in addition to the space used for represent- Selecting the k-th smallest slope is dual to the follow- ing the input. The input is assumed to be stored in ing problem: Given a set of n lines in the plane, find an array A of size n, thereby allowing random access. the k-th leftmost intersection point induced by the We assume that a constant size memory can hold a arrangement of the lines. Since the duality transform constant number of words. Each word can hold one can be performed in an implicit way, we will assume that our input is given as a set of lines in the plane. pointer, or an (log2 n) bit integer, and a constant P 2 number of wordsO can hold one element of the input Since it is infeasible to compute all Θ(n ) intersec- tions induced by , the algorithm of Matouˇsek [12] array. An in-place algorithms uses (1) extra words 2 O maintains a verticalP strip b,e := [b,e] IR IR that of memory. Recently, a number of in-place algorithms h i × ⊂ have been designed for solving geometric problems— is guaranteed to contain the k-th smallest intersection see, e.g., [1, 2, 3, 4, 16]. point. For a parameter r (to be defined later), the al- gorithm first constructs a sample of size r drawn R ∗Westf¨alische Wilhelms-Universit¨at M¨unster, Institut f¨ur from the intersections inside b,e . It then selects two Informatik, Einsteinstr. 62, 48149 M¨unster, Germany. E-mail: intersections from whose xh-coordinatesi are used to [email protected] construct a (narrower)R candidate strip b0,e0 . The al- †Westf¨alische Wilhelms-Universit¨at M¨unster, Institut f¨ur gorithm then checks whether b0,e0 indeedh i contains Informatik, Einsteinstr. 62, 48149 M¨unster, Germany. E-mail: h i [email protected] the k-th smallest intersection point. If this is not the

177 22nd European Workshop on , 2006

case, the process is repeated for b,e but using a new exactly the number of inversions between the permu- h i sample , otherwise the algorithm iterates with the tation of that arranges the lines in sorted 0, we can compute in (r) puted so far in an in-place setting, the algorithm is O time an [θlo,θhi], such that, with probability divided into three phases: During the first phase, we 1 1/Ω(√r), the k-th smallest element of Θ lies within process the lines stored in A[0,...,n/2 1] and use − this interval, and the number of elements in Θ that A[n/2,...,n 1] to encode the ranks and− the inter- lie within the interval is at most N/Ω(√r). sections found− so far. We then reverse the roles of 2 both subarrays, and finalize the algorithm with a third The above lemma will be applied for N (n ). phase that processes the intersections induced by lines Furthermore, Matouˇsek et al. [13] proved that∈ we O may stored in different halves of the array. choose r := nβ for any 0 <β< 1 without affecting the asymptoticd e efficiency of the resulting algorithm, and thus we will set r := √n . As we will see below, Three In-Place Data Structures For maintaining d e more than (1) numbers or indices, we resort to a our in-place algorithm will be working with binary O encoded numbers, and thus accessing a single num- standard technique in the design of space-efficient al- gorithms, namely to encode a single bit by a permuta- ber θi will take (log2 n) time. Thus, the in-place version of the algorithmO implied by Lemma 1 runs in tion of two objects q and r: For lines q, r with q

178 EWCG 2006, Delphi, March 27–29, 2006

Storing Lines Involved in Intersections We use Algorithm 1 CountAndRecord(A1, A2, b,e ) in- h i a “sorted-list” data structure L to record (ref- crements the global count c of intersections (falling D erences to) lines involved in all of the sampled inside b,e ) by the number of intersections induced h i intersections found so far. These (references to) by lines in A1 and A2 while recording intersections to lines are maintained in sorted

Counting Inversions The algorithm for counting all inversions in b,e is an extension of the iterative (n log2 n). Also, leaving out the code in Lines 6– h i mergesort algorithm: starting from the set of lines in O9, we may use Algorithm 1 to compute (b,e) . |I | sorted

179 22nd European Workshop on Computational Geometry, 2006

Lemma 3 The global extra cost incurred by updat- inside b0,e0 in the final iteration of the slope selec- h i ing the references stored in the data structure L tion algorithm. This finishes the proof of Theorem 2. 2 D while merging subarrays is in (r log2 r log2 n). O · · References 3.2 Finishing Up [1] P. Bose, A. Maheshwari, P. Morin, J. Morrison, After we have processed the first half of the input ar- M. Smid, and J. Vahrenhold. Space-efficient geomet- ray (using the second half to maintain the data struc- ric divide-and-conquer algorithms. Computational tures R, L, and I ), we reverse the roles of the Geometry: Theory & Applications, 2006. To appear. two subarrays.D D To doD so, we first need to copy the [2] H. Br¨onnimann and T. M.-Y. Chan. Space-efficient contents of the data structures to the first half of the algorithms for computing the convex hull of a simple array. The important detail to keep in mind is that, polygonal line in linear time. In Proc. Latin Amer- as a result of the inversion-counting algorithm, the ican Theoretical Informatics, LNCS 2976, pp. 162– 171. 2004. first half of the array is sorted according to

180