The Power of Two Min-Hashes in Similarity Search Among

The power of two min-hashes for similarity search among hierarchical data objects Sreenivas Gollapudi Rina Panigrahy Microsoft Research Microsoft Research Mountain View, CA 94043 Mountain View, CA 94043 [email protected] [email protected] Abstract tures may be viewed as hierarchical entities that can be communicated in different formats. It then becomes In this study we propose sketching algorithms for com- a challenge to compare two such information corpora. puting similarities between hierarchical data. Specifi- For example, the same information can be represented cally, we look at data objects that are represented us- under different isomorphic permutations of the under- ing leaf-labeled trees denoting a set of elements at the lying directory structure. A database record could be leaves organized in a hierarchy. Such representations presented with attributes in different order. In such are richer alternatives to a set. For example, a docu- cases, rather than looking at the exact tree structure ment can be represented as a hierarchy of sets wherein representing the objects, it may be more appropriate chapters, sections, and paragraphs represent different to check the similarity of the two tree structures under levels in the hierarchy. Such a representation is richer all possible isomorphic permutations. Again, to mea- than viewing the document simply as a set of words. sure the similarity between two directory structures, it We measure distance between trees using the best pos- may be more appropriate to ignore the directory names sible super-imposition that minimizes the number of and check if there is an ordering of the directory struc- mismatched leaf labels. Our distance measure is equiv- ture that minimizes mismatch between the file contents. alent to an Earth Mover’s Distance measure since the However, computing the ordering that minimizes the leaf-labeled trees of height one can be viewed as sets mismatch on the tree data becomes intractable as the and can be recursively extended to trees of larger height degree and height of the tree increase. by viewing them as set of sets. We compute sketches Sketching is a powerful tool for concise represen- of arbitrary weighted trees and analyze them in the tation and comparison of complex and large data. For context of locality-sensitive hashing (LSH) where the instance, strings and sets can be compressed into points probability of two sketches matching is high when two in hamming space. This typically yields sketches repre- trees are similar and low when the two trees are far un- sented by bit-vectors. Such sketches, although are not der the given distance measure. Specifically, we com- identical to the original data, preserve many of their pute sketches of such trees by propagating min-hash properties. For instance sketches can be used to mea- computations up the tree. Furthermore, we show that sure containment and distance [2]. Given the plethora propagating one min-hash results in poor sketch prop- of applications for sketching, there have been several at- erties while propagating two min-hashes results in good tempts to compute sketches of more complex data such sketches. as trees and graphs. Some examples include file system directories, XML DOM trees, and phylogenetic trees [19, 14, 1, 6, 18]. Other works of research have pre- 1 Introduction sented sketch based measurement of similarity between The proliferation of information on the internet cre- trees [1, 5]. Another important application of sketch- ates a huge amount of data in different formats. In the ing sets is comparison of documents such as web pages absence of rigid standards, ad hoc methods are used wherein a document is viewed as a set of words. How- by different entities to represent data. For instance, a ever, we lose the structure of the document by viewing book may be represented hierarchically as a collection it as a set. A hierarchical structure is richer in captur- of chapters and these in turn as a collection of sections. ing context in which words occur such as a paragraph, Such an abstraction may be represented as a tree in chapter, etc. In this work, we propose a sketch based al- the emerging XML standard. Similarly directory struc- gorithm via an Earth Mover’s Distance metric for trees 1 with provable guarantees. ing. Although, our provable guarantees hold only under the assumption that the trees are of uniform height and 1.1 Related Work leaves are uniquely labeled, the algorithms presented are useful in more general practical scenarios. Similarity among trees has been widely studied in the Specifically, for our distance measures we show how context of edit distance between trees, i.e., the number to find an approximate nearest neighbor of a query of edits required to transform one tree into the other tree. Given a database of N trees, we can construct [13, 17, 18] while others have approached this prob- a data structure of size N 1+ρ that can be used to com- lem of matching via alignment of trees and effectively pute approximate nearest neighbor in time N ρ, where computing the best alignment that minimized the ob- 1 log(1 δ) ρ = O( −log(1/ǫ−) ). Given a query point whose nearest jective function [12]. The distance function adopted in neighbor is within distance δ, our search algorithm will this work is similar in spirit and is based on the op- return a neighbor at distance at most i − ǫ from the timal super-imposition that minimizes the number of 1 ǫ query point; essentially, our algorithm returns a −δ - mismatches in the set of leaf labels between the two approximate nearest neighbor when the query point has trees. A more recent work of Augsten et al [1] uses a δ-near neighbor. pq-grams to match hierarchical data. A sketch of the tree is composed of a set of pq-gram profiles and sketch similarity is computed using well known set similarity 2 Models and Definitions measures. This approach, however, is not resistant to permutations in the leaf labels when q = 1. The other In this section, we present the preliminaries for hierar- case of q > 1 generates mismatches that are amplified chies as well as introduce notions of similarities between by the shingling approach adopted to generate the pq- hierarchies. We classify trees as either labeled or un- grams, thus resulting in a poor match between trees labeled and ordered or unordered. In the labeled case, even when they are similar. all nodes in the tree are labeled. A common example of such a hierarchy is a file system directory structure. 1.2 Contributions of this study Leaf-labeled trees where only the leaves are labeled have no labels on the internal nodes. For example, we could We introduce the distance between arbitrary weighted consider only the names and content of all files under trees as the best possible super-imposition that results a directory to compute the similarity between two di- in the maximum number of matches between the trees. rectories. In such a case, we ignore the names of the Specifically, we look at data objects that are repre- directories themselves. An ordered tree has all leaves sented using leaf-labeled trees denoting a set of ele- in a defined order. For a tree of height h (a path from ments at the leaves organized in a hierarchy. Our dis- the root to a leaf has h edges), we will say that the tance measure is equivalent to an Earth Mover’s Dis- leaves are at height 0 and the root is at height h. For tance measure since the leaf-labeled trees of height one example, a string can be considered an ordered tree of can be viewed as sets and can be recursively extended height one. to trees of larger height by viewing them as set of sets. Observe that in practice two trees are considered For weighted trees, this is equivalent to recursive frac- similar even if one is obtained by a isomorphic reorder- tional weighted matching between the leaves according ing of nodes. In this case, we assume the trees to be to the tree hierarchy. We propose sketching algorithms unordered. We also assume that leaves are uniquely that generalize the concept of min-wise independent labeled. permutations to trees. We analyze the sketching al- Before we analyze the similarity measures for leaf- gorithms in the framework of locality-sensitive hashing labeled trees, we introduce well-studied notions of simi- and show how our sketch functions can be used to per- larities between unordered entities such as sets and how form nearest-neighbor search among trees. these notions have been successfully used in defining Specifically, we use sketching algorithms on sets to similarity measures between ordered sequences such as compute their min-hashes and then propagate the min- strings. Set similarity is computed using measures such hashes up the hierarchy. We show that propagating as intersection and (symmetric) difference. String sim- only one min-hash can result in poor similarity guaran- ilarity is measured using edit distances. While there tees. For example, two trees that are far apart can pro- are many efficient algorithms for sketching sets such as duce very similar sketches using this scheme. Finally, min-hash computation [3], obtaining efficient sketching we show that by propagating two min-hashes for each algorithms for strings with strong guarantees is much set of labels, we can compute similarity accurately be- harder. The best known algorithm for sketching strings tween two trees in the context of locality-sensitive hash- work by embedding strings into L1 and the best known 2 O(√log n log log n) embedding has distortion 2 [15] which is MHf (A)= argminx{f(x)|x ∈ A}.

The Power of Two Min-Hashes in Similarity Search Among

Distance-Sensitive Hashing∗

Arxiv:2102.08942V1 [Cs.DB]

SIGMOD Flyer

Lower Bounds on Lattice Sieving and Information Set Decoding

Constraint Clustering and Parity Games

Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality

Model Checking Large Design Spaces: Theory, Tools, and Experiments

Scalable Nearest Neighbor Search for Optimal Transport∗

SETH-Based Lower Bounds for Subset Sum and Bicriteria Path∗

Curriculum Vitae

Approximate Nearest Neighbor Search in High Dimensions

Fiat-Shamir Via List-Recoverable Codes (Or: Parallel Repetition of GMW Is Not Zero-Knowledge)