On the Equivalence of Tries and Dendrograms - EIcient Hierarchical Clustering of TraIc Data

On The Equivalence of Tries and Dendrograms - Eicient Hierarchical Clustering of Traic Data Chia-Tung Kuo Ian Davidson University of California, Davis University of California, Davis [email protected] [email protected] ABSTRACT clustering is its O¹n2 lognº run time, which is prohibitive when the e widespread use of GPS-enabled devices generates voluminous number of instances, n is large. For example, in our experiments and continuous amounts of trac data but analyzing such data for our data set has nearly half a million trips. Classic agglomerative interpretable and actionable insights poses challenges. A hierar- algorithms would take days or even weeks to build a dendrogram chical clustering of the trips has many uses such as discovering on such large data set. shortest paths, common routes and oen traversed areas. However, In this paper we consider an alternative way to eciently con- hierarchical clustering typically has time complexity of O¹n2 lognº struct a dendrogram of the trip data without the long computation where n is the number of instances, and is dicult to scale to large time. We achieve this by rst converting each trip to a trajectory data sets associated with GPS data. Furthermore, incremental hier- string and then building a prex tree from the trip-strings. We then archical clustering is still a developing area. Prex trees (also called formally show the equivalence between a prex tree and a dendro- tries) can be eciently constructed and updated in linear time (in gram by showing that the prex tree we create would have been n). We show how a specially constructed trie can compactly store built by a classic agglomerative method using a specic distance the trips and further show this trie is equivalent to a dendrogram metric on the strings. is result is not trivial as the prex tree is that would have been built by classic agglomerative hierarchical created top-down and the dendrogram boom-up. algorithms using a specic distance metric. is allows creating Creating Trajectory Strings from GPS Data. We discretize hierarchical clusterings of GPS trip data and updating this hierar- both the spatial and temporal dimensions with respective pre- chy in linear time. We demonstrate the usefulness of our proposed dened resolutions, eectively converting each spatial location approach on a real world data set of half a million taxis’ GPS traces, to a unique symbol. For example, in our experiments we disceretize well beyond the capabilities of agglomerative clustering methods. the San Francisco Bay area into a 100 × 100 grid so our alphabet Our work is not limited to trip data and can be used with other data contains 10,000 symbols. We can then naturally represent a trip with a string representation. as a sequence (i.e. string) of the discretized regions (symbols). e symbol at position i in the string represents the location of the trip ACM Reference format: at time step i as shown in Figure 1. Chia-Tung Kuo and Ian Davidson. 2016. On e Equivalence of Tries and Dendrograms - Ecient Hierarchical Clustering of Trac Data. In Proceed- Creating Trip Tries. A prex tree is a tree structure built from ings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17), strings where each path from the root to any node corresponds 9 pages. to a unique string prex and is commonly used for indexing and DOI: 10.1145/nnnnnnn.nnnnnnn retrievals of text and symbolic data. A prex tree (trie) can be constructed in linear time to both number of trips, n and maximum 1 INTRODUCTION number of discretized time steps, l. An example of such a tree is shown in Figure 2. Location tracking devices have become widely popular over the Uses of Trip Tries. A trip trie is not only a hierarchical clus- last decade and this has enabled the collection of large amounts tering (as we shall see) but has other uses. For example, easy to of spatial temporal trip data. Given a collection of such trip data, understand visualizations of a collection of trips such as heat maps clustering is oen a natural start to explore the general properties (see Figures 4 and ) can be created from a trip trie; and trip tries of the data and among clustering methods, hierarchical clustering constructed from dierent collections of trips can be compared (i.e. arXiv:1810.05357v1 [cs.DB] 12 Oct 2018 is well suited as it provides a set of groupings at dierent levels Table 1). Tries have many useful properties such as the ability to ef- where each grouping at one level is a renement of the groupings ciently compute Levenshtein distances and we describe uses such at the previous levels. is dendrogram structure can also naturally as creating more robust clusters using these properties. ough represent the evolutionary/temporal nature of trip data where each tries are commonly used in the database literature for tasks such as level in the dendrogram corresponds to a particular time. However, retrieval and indexing, to our knowledge they have not been used for one signicant drawback of the standard agglomerative hierarchical the purposes we outlined in this paper. Permission to make digital or hard copies of all or part of this work for personal or Uses Beyond GPS Trip Data. In this paper we have focused classroom use is granted without fee provided that copies are not made or distributed on GPS trip data as the application domain is important and has for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM readily available public data. However, our work is applicable in must be honored. Abstracting with credit is permied. To copy otherwise, or republish, other domains where the data represents behavior over time such as to post on servers or to redistribute to lists, requires prior specic permission and/or a seings where some categorical event (a symbol) occurs over time fee. Request permissions from [email protected]. Conference’17, Washington, DC, USA (the position of the symbol). In our earlier work [8] we modeled © 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM...$15.00 behavioral data as these event strings but other applications exist DOI: 10.1145/nnnnnnn.nnnnnnn Conference’17, July 2017, Washington, DC, USA Chia-Tung Kuo and Ian Davidson x 104 12 in areas such as computer network trac where each location is ????? 10 8 an IP address. z z 6 1 9 4 2 Our contributions can be summarized as follows. 0 0 5 10 15 20 25 30 35 • We provide a novel way to organize trip data into symbolic z1???? ::: z9???? data and then a prex tree/trie (see section 2). Tries can be z2 z5 built and updated in linear time to the number of instances : and alphabet size. z1z2??? ::: z1z5??? : • We derive the equivalence between a prex tree and a : dendrogram and verify it empirically (see eorem 3.1 and derivation in section 3). Figure 2: An example of a trip trie. Each node corresponds to • We discuss extensions of our work including uses beyond a prex which matches a subset of the trips with the symbol, hierarchical clustering such as outlier detection (see section “?” meaning unknown as yet. e root of the tree contains 4). all unknown and denotes the beginning of a trip, the next • We demonstrate the usefulness of our dendrogram in illus- level time step 1 and so on. e user can store additional trating interesting insights of trip data on a real data set of relevant data about the trips at each node, such as the trip GPS traces of taxis (see section 5). durations, as shown in the histogram. Our paper is organized as follows. We describe the steps to create string representations and a trip trie in section 2, which is then followed by a proof of its equivalence to standard agglomera- irregular time intervals and later in experiments we will use inter- tive hierarchical clustering in section 3. We further discuss other polation/extrapolation. Our approach consists of three major steps, dierent ways this trip trie can be used in section 4. In section shown in corresponding order in Figure 3(a), 3(b) and 3(c). 5 we evaluate our approach on a real world dataset of taxis’ GPS traces and demonstrate the usefulness of our approach in obtaining (1) Discretization of the geographic space: is step breaks insights on the trac dynamics. We discuss related work in section the modeled space into a nite set of distinct non-overlapping f ; ;:::; g 6 and conclude our paper in section 7. regions Z = z1 z2 zs which we will use as symbols in an alphabet (see Figure 1). is preprocessing is carried out before the major algorithm is applied, just like most work in trajectory mining [11][4]. In our experiment we use equal-sized rectangular grids over the modeled space; this allows straightforward mapping between actual spatial coordinates to the symbols; however our method can be used with any shaped regions. It is worth noting that the actual number of regions with activity is typically much smaller than the possible number of grid cells due to physi- cal presence of roads. is is a desirable property allowing large geographic areas to be eciently represented. For example, in our experiments, though we discretize the San Francisco Bay area into 10,000 cells, less than 20% of them Figure 1: An example that shows the construction of a tra- see any activity. jectory string from the given spatial grids (in black) and tem- (2) Build trajectory strings: In this step we build a trajec- poral resolution.

Load more

On the Equivalence of Tries and Dendrograms - EIcient Hierarchical Clustering of TraIc Data

On the Equivalence of Tries and Dendrograms - EIcient Hierarchical Clustering of TraIc Data