In Defense of Minhash Over Simhash

Total Page:16

File Type:pdf, Size:1020Kb

In Defense of Minhash Over Simhash In Defense of MinHash Over SimHash Anshumali Shrivastava Ping Li Department of Computer Science Department of Statistics and Biostatistics Computing and Information Science Department of Computer Science Cornell University, Ithaca, NY, USA Rutgers University, Piscataway, NJ, USA Abstract 1 Introduction MinHash and SimHash are the two widely The advent of the Internet has led to generation of adopted Locality Sensitive Hashing (LSH) al- massive and inherently high dimensional data. In gorithms for large-scale data processing ap- many industrial applications, the size of the datasets plications. Deciding which LSH to use for has long exceeded the memory capacity of a single a particular problem at hand is an impor- machine. In web domains, it is not difficult to find tant question, which has no clear answer in datasets with the number of instances and the num- the existing literature. In this study, we pro- ber of dimensions going into billions [1, 6, 28]. vide a theoretical answer (validated by exper- iments) that MinHash virtually always out- The reality that web data are typically sparse and high performs SimHash when the data are binary, dimensional is due to the wide adoption of the \Bag as common in practice such as search. of Words" (BoW) representations for documents and images. In BoW representations, it is known that the The collision probability of MinHash is a word frequency within a document follows power law. function of resemblance similarity (R), while Most of the words occur rarely in a document and most the collision probability of SimHash is a func- of the higher order shingles in the document occur only tion of cosine similarity (S). To provide a once. It is often the case that just the presence or common basis for comparison, we evaluate absence information suffices in practice [7, 14, 17, 23]. retrieval results in terms of S for both Min- Leading search companies routinely use sparse binary Hash and SimHash. This evaluation is valid representations in their large data systems [6]. as we can prove that MinHash is a valid LSH with respect to S, by using a general inequal- Locality sensitive hashing (LSH) [16] is a gen- S2 ≤ R ≤ S ity 2−S . Our worst case analysis eral framework of indexing technique, devised for effi- can show that MinHash significantly outper- ciently solving the approximate near neighbor search forms SimHash in high similarity region. problem [11]. The performance of LSH largely de- Interestingly, our intensive experiments re- pends on the underlying particular hashing methods. veal that MinHash is also substantially better Two popular hashing algorithms are MinHash [3] and than SimHash even in datasets where most SimHash (sign normal random projections) [8]. of the data points are not too similar to each MinHash is an LSH for resemblance similarity other. This is partly because, in practical which is defined over binary vectors, while SimHash R ≥ S data, often z−S holds where z is only is an LSH for cosine similarity which works for gen- slightly larger than 2 (e.g., z ≤ 2:1). Our re- eral real-valued data. With the abundance of binary stricted worst case analysis by assuming data over the web, it has become a practically im- S ≤ R ≤ S z−S 2−S shows that MinHash in- portant question: which LSH should be preferred in deed significantly outperforms SimHash even binary data?. This question has not been adequately in low similarity region. answered in existing literature. There were prior at- We believe the results in this paper will pro- tempts to address this problem from various aspects. vide valuable guidelines for search in practice, For example, the paper on Conditional Random Sam- especially when the data are sparse. pling (CRS) [19] showed that random projections can be very inaccurate especially in binary data, for the Appearing in Proceedings of the 17th International Con- task of inner product estimation (which is not the same ference on Artificial Intelligence and Statistics (AISTATS) as near neighbor search). A more recent paper [26] em- 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- pirically demonstrated that b-bit minwise hashing [22] right 2014 by the authors. outperformed SimHash and spectral hashing [30]. 886 In Defense of MinHash Over SimHash Our contribution: Our paper provides an essentially 1 conclusive answer that MinHash should be used for 0.8 near neighbor search in binary data, both theoretically and empirically. To favor SimHash, our theoretical 0.6 S analysis and experiments evaluate the retrieval results 0.4 2−S of MinHash in terms of cosine similarity (instead of S2 resemblance). This is possible because we are able to 0.2 show that MinHash can be proved to be an LSH for 0 0 0.2 0.4 0.6 0.8 1 cosine similarity by establishing an inequality which S bounds resemblance by purely functions of cosine. Figure 1: Upper (in red) and lower (in blue) bounds Because we evaluate MinHash (which was designed for in Theorem 1, which overlap in high similarity region. resemblance) in terms of cosine, we will first illustrate the close connection between these two similarities. While the high similarity region is often of interest, we must also handle data in the low similarity region, 2 Cosine Versus Resemblance because in a realistic dataset, the majority of the pairs We focus on binary data, which can be viewed as sets are usually not similar. Interestingly, we observe that ⊆ (locations of nonzeros). Consider two sets W1;W2 for the six datasets in Table 1, we often have R = S f g S z−S Ω = 1; 2; :::; D . The cosine similarity ( ) is with z only being slightly larger than 2; see Figure 2. a S = p ; where (1) f1f2 Table 1: Datasets f1 = jW1j; f2 = jW2j; a = jW1 \ W2j (2) R Dataset # Query # Train # Dim The resemblance similarity, denoted by , is MNIST 10,000 60,000 784 jW \ W j a NEWS20 2,000 18,000 1,355,191 R = R(W ;W ) = 1 2 = (3) 1 2 jW [ W j f + f − a NYTIMES 5,000 100,000 102,660 1 2 1 2 RCV1 5,000 100,000 47,236 Clearly these two similarities are closely related. To URL 5,000 90,000 3,231,958 better illustrate the connection, we re-write R as WEBSPAM 5,000 100,000 16,609,143 p a= f f S R p p 1 2 p = = − S (4) 0.8 0.4 f1=f2 + f2=f1 − a= f1f2 z MNIST NEWS20 p 1 0.6 0.3 z = z(r) = r + p ≥ 2 (5) r 0.4 0.2 f2 f1f2 f1f2 1 Frequency Frequency r = = ≤ = (6) 0.2 0.1 2 2 S2 f1 f1 a 0 0 There are two degrees of freedom: f =f , a=f , which 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 2 1 2 z z make it inconvenient for analysis. Fortunately, in The- R S 0.5 0.4 orem 1, we can bound by purely functions of . NYTIMES 0.4 RCV1 0.3 Theorem 1 0.3 S 0.2 2 0.2 S ≤ R ≤ (7) Frequency Frequency − S 0.1 2 0.1 Tightness Without making assumptions on the data, 0 0 S2 S 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 neither the lower bound or the upper bound 2−S can z z be improved in the domain of continuous functions. 1 0.4 URL ∗ 0.8 WEBSPAM Data dependent bound If the data satisfy z ≤ z , 0.3 where z is defined in (5), then 0.6 0.2 S S 0.4 Frequency ≤ R ≤ Frequency ∗ (8) 0.1 z − S 2 − S 0.2 Proof: See Appendix A. 0 0 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 z z Figure 1 illustrates that in high similarity region, the Figure 2: Frequencies of the z values for all six datasets upper and lower bounds essentially overlap. Note that, in Table 1, where z is defined in (5). We compute z ≈ ≈ ≈ in order to obtain S 1, we need f1 f2 (i.e., z 2). for every query-train pair of data points. 887 Anshumali Shrivastava, Ping Li 1 1 For each dataset, we compute both cosine and resem- MNIST Resemblance MNIST Cosine blance for every query-train pair (e.g., 10000 × 60000 0.9 0.9 0.8 pairs for MNIST dataset). For each query point, we 0.8 rank its similarities to all training points in descending 0.7 Similarity 0.7 order. We examine the top-1000 locations as in Fig- 0.6 ure 3. In the left panels, for every top location, we plot 0.5 Resemblance of Rankings 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 the median (among all query points) of the similari- Top Location Top Location ties, separately for cosine (dashed) and resemblance 0.5 1 NEWS20 Resemblance NEWS20 (solid), together with the lower and upper bounds of 0.4 Cosine 0.9 R (dot-dashed). We can see for NEWS20, NYTIMES, 0.3 0.8 and RCV1, the data are not too similar. Interestingly, 0.2 Similarity for all six datasets, R matches fairly well with the up- 0.7 S 0.1 S2 per bound 2−S . In other words, the lower bound 0 Resemblance of Rankings 0.6 0 1 2 3 0 1 2 3 can be very conservative even in low similarity region.
Recommended publications
  • Communications of the Acm
    COMMUNICATIONS CACM.ACM.ORG OF THEACM 11/2014 VOL.57 NO.11 Scene Understanding by Labeling Pixels Evolution of the Product Manager The Data on Diversity On Facebook, Most Ties Are Weak Keeping Online Reviews Honest Association for Computing Machinery tvx-full-page.pdf-newest.pdf 1 11/10/2013 12:03 3-5 JUNE, 2015 BRUSSELS, BELGIUM Course and Workshop C proposals by M 15 November 2014 Y CM Paper Submissions by MY 12 January 2015 CY CMY K Work in Progress, Demos, DC, & Industrial Submissions by 2 March 2015 Welcoming Submissions on Content Production Systems & Infrastructures Devices & Interaction Techniques Experience Design & Evaluation Media Studies Data Science & Recommendations Business Models & Marketing Innovative Concepts & Media Art TVX2015.COM [email protected] ACM Books M MORGAN& CLAYPOOL &C PUBLISHERS Publish your next book in the ACM Digital Library ACM Books is a new series of advanced level books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. I’m pleased that ACM Books is directed by a volunteer organization headed by a dynamic, informed, energetic, visionary Editor-in-Chief (Tamer Özsu), working closely with a forward-looking publisher (Morgan and Claypool). —Richard Snodgrass, University of Arizona books.acm.org ACM Books ◆ will include books from across the entire spectrum of computer science subject matter and will appeal to computing practitioners, researchers, educators, and students. ◆ will publish graduate level texts; research monographs/overviews of established and emerging fields; practitioner-level professional books; and books devoted to the history and social impact of computing. ◆ will be quickly and attractively published as ebooks and print volumes at affordable prices, and widely distributed in both print and digital formats through booksellers and to libraries and individual ACM members via the ACM Digital Library platform.
    [Show full text]
  • Challenges in Web Search Engines
    Challenges in Web Search Engines Monika R. Henzinger Rajeev Motwani* Craig Silverstein Google Inc. Department of Computer Science Google Inc. 2400 Bayshore Parkway Stanford University 2400 Bayshore Parkway Mountain View, CA 94043 Stanford, CA 94305 Mountain View, CA 94043 [email protected] [email protected] [email protected] Abstract or a combination thereof. There are web ranking optimiza• tion services which, for a fee, claim to place a given web site This article presents a high-level discussion of highly on a given search engine. some problems that are unique to web search en• Unfortunately, spamming has become so prevalent that ev• gines. The goal is to raise awareness and stimulate ery commercial search engine has had to take measures to research in these areas. identify and remove spam. Without such measures, the qual• ity of the rankings suffers severely. Traditional research in information retrieval has not had to 1 Introduction deal with this problem of "malicious" content in the corpora. Quite certainly, this problem is not present in the benchmark Web search engines are faced with a number of difficult prob• document collections used by researchers in the past; indeed, lems in maintaining or enhancing the quality of their perfor• those collections consist exclusively of high-quality content mance. These problems are either unique to this domain, or such as newspaper or scientific articles. Similarly, the spam novel variants of problems that have been studied in the liter• problem is not present in the context of intranets, the web that ature. Our goal in writing this article is to raise awareness of exists within a corporation.
    [Show full text]
  • Benchmarks for IP Forwarding Tables
    Reviewers James Abello Richard Cleve Vassos Hadzilacos Dimitris Achilioptas James Clippinger Jim Hafner Micah Adler Anne Condon Torben Hagerup Oswin Aichholzer Stephen Cook Armin Haken William Aiello Tom Cormen Shai Halevi Donald Aingworth Dan Dooly Eric Hansen Susanne Albers Oliver Duschka Refael Hassin Eric Allender Martin Dyer Johan Hastad Rajeev Alur Ran El-Yaniv Lisa Hellerstein Andris Ambainis David Eppstein Monika Henzinger Amihood Amir Jeff Erickson Tom Henzinger Artur Andrzejak Kousha Etessami Jeremy Horwitz Boris Aronov Will Evans Russell Impagliazzo Sanjeev Arora Guy Even Piotr Indyk Amotz Barnoy Ron Fagin Sandra Irani Yair Bartal Michalis Faloutsos Ken Jackson Julien Basch Martin Farach-Colton David Johnson Saugata Basu Uri Feige John Jozwiak Bob Beals Joan Feigenbaum Bala Kalyandasundaram Paul Beame Stefan Felsner Ming-Yang Kao Steve Bellantoni Faith Fich Haim Kaplan Micahel Ben-Or Andy Fingerhut Bruce Kapron Josh Benaloh Paul Fischer Michael Kaufmann Charles Bennett Lance Fortnow Michael Kearns Marshall Bern Steve Fortune Sanjeev Khanna Nikolaj Bjorner Alan Frieze Samir Khuller Johannes Blomer Anna Gal Joe Kilian Avrim Blum Naveen Garg Valerie King Dan Boneh Bernd Gartner Philip Klein Andrei Broder Rosario Gennaro Spyros Kontogiannis Nader Bshouty Ashish Goel Gilad Koren Adam Buchsbaum Michel Goemans Dexter Kozen Lynn Burroughs Leslie Goldberg Dina Kravets Ran Canetti Paul Goldberg S. Ravi Kumar Pei Cao Oded Goldreich Eyal Kushilevitz Moses Charikar John Gray Stephen Kwek Chandra Chekuri Dan Greene Larry Larmore Yi-Jen Chiang
    [Show full text]
  • Mining of Massive Datasets
    Mining of Massive Datasets Anand Rajaraman Kosmix, Inc. Jeffrey D. Ullman Stanford Univ. Copyright c 2010, 2011 Anand Rajaraman and Jeffrey D. Ullman ii Preface This book evolved from material developed over several years by Anand Raja- raman and Jeff Ullman for a one-quarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. What the Book Is About At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are: 1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data. 2. Similarity search, including the key techniques of minhashing and locality- sensitive hashing. 3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost. 4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach. 5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements. 6.
    [Show full text]
  • Applied Statistics
    ISSN 1932-6157 (print) ISSN 1941-7330 (online) THE ANNALS of APPLIED STATISTICS AN OFFICIAL JOURNAL OF THE INSTITUTE OF MATHEMATICAL STATISTICS Special section in memory of Stephen E. Fienberg (1942–2016) AOAS Editor-in-Chief 2013–2015 Editorial......................................................................... iii OnStephenE.Fienbergasadiscussantandafriend................DONALD B. RUBIN 683 Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election . ......................XIAO-LI MENG 685 Hypothesis testing for high-dimensional multinomials: A selective review SIVARAMAN BALAKRISHNAN AND LARRY WASSERMAN 727 When should modes of inference disagree? Some simple but challenging examples D. A. S. FRASER,N.REID AND WEI LIN 750 Fingerprintscience.............................................JOSEPH B. KADANE 771 Statistical modeling and analysis of trace element concentrations in forensic glass evidence.................................KAREN D. H. PAN AND KAREN KAFADAR 788 Loglinear model selection and human mobility . ................ADRIAN DOBRA AND REZA MOHAMMADI 815 On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example YEN-CHI CHEN,Y.SAMUEL WANG AND ELENA A. EROSHEVA 846 Providing accurate models across private partitioned data: Secure maximum likelihood estimation....................................JOSHUA SNOKE,TIMOTHY R. BRICK, ALEKSANDRA SLAVKOVIC´ AND MICHAEL D. HUNTER 877 Clustering the prevalence of pediatric
    [Show full text]
  • Distance-Sensitive Hashing∗
    Distance-Sensitive Hashing∗ Martin Aumüller Tobias Christiani IT University of Copenhagen IT University of Copenhagen [email protected] [email protected] Rasmus Pagh Francesco Silvestri IT University of Copenhagen University of Padova [email protected] [email protected] ABSTRACT tight up to lower order terms. In particular, we extend existing Locality-sensitive hashing (LSH) is an important tool for managing LSH lower bounds, showing that they also hold in the asymmetric high-dimensional noisy or uncertain data, for example in connec- setting. tion with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH CCS CONCEPTS framework is not known to yield good solutions, and instead ad hoc • Theory of computation → Randomness, geometry and dis- solutions have been designed for particular similarity and distance crete structures; Data structures design and analysis; • Informa- measures. For example, this is true for output-sensitive similarity tion systems → Information retrieval; search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query KEYWORDS point. locality-sensitive hashing; similarity search; annulus query In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions ACM Reference Format: Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Francesco Silvestri. such that the probability of two points having the same hash value is 2018. Distance-Sensitive Hashing. In Proceedings of 2018 International Confer- a given function of the distance between them. More precisely, given ence on Management of Data (PODS’18).
    [Show full text]
  • Arxiv:2102.08942V1 [Cs.DB]
    A Survey on Locality Sensitive Hashing Algorithms and their Applications OMID JAFARI, New Mexico State University, USA PREETI MAURYA, New Mexico State University, USA PARTH NAGARKAR, New Mexico State University, USA KHANDKER MUSHFIQUL ISLAM, New Mexico State University, USA CHIDAMBARAM CRUSHEV, New Mexico State University, USA Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains. CCS Concepts: • General and reference → Surveys and overviews. Additional Key Words and Phrases: Locality Sensitive Hashing, Approximate Nearest Neighbor Search, High-Dimensional Similarity Search, Indexing 1 INTRODUCTION Finding nearest neighbors in high-dimensional spaces is an important problem in several diverse applications, such as multimedia retrieval, machine learning, biological and geological sciences, etc. For low-dimensions (< 10), popular tree-based index structures, such as KD-tree [12], SR-tree [56], etc. are effective, but for higher number of dimensions, these index structures suffer from the well-known problem, curse of dimensionality (where the performance of these index structures is often out-performed even by linear scans) [21]. Instead of searching for exact results, one solution to address the curse of dimensionality problem is to look for approximate results.
    [Show full text]
  • Stanford University's Economic Impact Via Innovation and Entrepreneurship
    Full text available at: http://dx.doi.org/10.1561/0300000074 Impact: Stanford University’s Economic Impact via Innovation and Entrepreneurship Charles E. Eesley Associate Professor in Management Science & Engineering W.M. Keck Foundation Faculty Scholar, School of Engineering Stanford University William F. Miller† Herbert Hoover Professor of Public and Private Management Emeritus Professor of Computer Science Emeritus and former Provost Stanford University and Faculty Co-director, SPRIE Boston — Delft Full text available at: http://dx.doi.org/10.1561/0300000074 Contents 1 Executive Summary2 1.1 Regional and Local Impact................. 3 1.2 Stanford’s Approach ..................... 4 1.3 Nonprofits and Social Innovation .............. 8 1.4 Alumni Founders and Leaders ................ 9 2 Creating an Entrepreneurial Ecosystem 12 2.1 History of Stanford and Silicon Valley ........... 12 3 Analyzing Stanford’s Entrepreneurial Footprint 17 3.1 Case Study: Google Inc., the Global Reach of One Stanford Startup ............................ 20 3.2 The Types of Companies Stanford Graduates Create .... 22 3.3 The BASES Study ...................... 30 4 Funding Startup Businesses 33 4.1 Study of Investors ...................... 38 4.2 Alumni Initiatives: Stanford Angels & Entrepreneurs Alumni Group ............................ 44 4.3 Case Example: Clint Korver ................. 44 5 How Stanford’s Academic Experience Creates Entrepreneurs 46 Full text available at: http://dx.doi.org/10.1561/0300000074 6 Changing Patterns in Entrepreneurial Career Paths 52 7 Social Innovation, Non-Profits, and Social Entrepreneurs 57 7.1 Case Example: Eric Krock .................. 58 7.2 Stanford Centers and Programs for Social Entrepreneurs . 59 7.3 Case Example: Miriam Rivera ................ 61 7.4 Creating Non-Profit Organizations ............. 63 8 The Lean Startup 68 9 How Stanford Supports Entrepreneurship—Programs, Cen- ters, Projects 77 9.1 Stanford Technology Ventures Program .........
    [Show full text]
  • Diffie and Hellman Receive 2015 Turing Award Rod Searcey/Stanford University
    Diffie and Hellman Receive 2015 Turing Award Rod Searcey/Stanford University. Linda A. Cicero/Stanford News Service. Whitfield Diffie Martin E. Hellman ernment–private sector relations, and attracts billions of Whitfield Diffie, former chief security officer of Sun Mi- dollars in research and development,” said ACM President crosystems, and Martin E. Hellman, professor emeritus Alexander L. Wolf. “In 1976, Diffie and Hellman imagined of electrical engineering at Stanford University, have been a future where people would regularly communicate awarded the 2015 A. M. Turing Award of the Association through electronic networks and be vulnerable to having for Computing Machinery for their critical contributions their communications stolen or altered. Now, after nearly to modern cryptography. forty years, we see that their forecasts were remarkably Citation prescient.” The ability for two parties to use encryption to commu- “Public-key cryptography is fundamental for our indus- nicate privately over an otherwise insecure channel is try,” said Andrei Broder, Google Distinguished Scientist. fundamental for billions of people around the world. On “The ability to protect private data rests on protocols for a daily basis, individuals establish secure online connec- confirming an owner’s identity and for ensuring the integ- tions with banks, e-commerce sites, email servers, and the rity and confidentiality of communications. These widely cloud. Diffie and Hellman’s groundbreaking 1976 paper, used protocols were made possible through the ideas and “New Directions in Cryptography,” introduced the ideas of methods pioneered by Diffie and Hellman.” public-key cryptography and digital signatures, which are Cryptography is a practice that facilitates communi- the foundation for most regularly used security protocols cation between two parties so that the communication on the Internet today.
    [Show full text]
  • SIGMOD Flyer
    DATES: Research paper SIGMOD 2006 abstracts Nov. 15, 2005 Research papers, 25th ACM SIGMOD International Conference on demonstrations, Management of Data industrial talks, tutorials, panels Nov. 29, 2005 June 26- June 29, 2006 Author Notification Feb. 24, 2006 Chicago, USA The annual ACM SIGMOD conference is a leading international forum for database researchers, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. We invite the submission of original research contributions as well as proposals for demonstrations, tutorials, industrial presentations, and panels. We encourage submissions relating to all aspects of data management defined broadly and particularly ORGANIZERS: encourage work that represent deep technical insights or present new abstractions and novel approaches to problems of significance. We especially welcome submissions that help identify and solve data management systems issues by General Chair leveraging knowledge of applications and related areas, such as information retrieval and search, operating systems & Clement Yu, U. of Illinois storage technologies, and web services. Areas of interest include but are not limited to: at Chicago • Benchmarking and performance evaluation Vice Gen. Chair • Data cleaning and integration Peter Scheuermann, Northwestern Univ. • Database monitoring and tuning PC Chair • Data privacy and security Surajit Chaudhuri, • Data warehousing and decision-support systems Microsoft Research • Embedded, sensor, mobile databases and applications Demo. Chair Anastassia Ailamaki, CMU • Managing uncertain and imprecise information Industrial PC Chair • Peer-to-peer data management Alon Halevy, U. of • Personalized information systems Washington, Seattle • Query processing and optimization Panels Chair Christian S. Jensen, • Replication, caching, and publish-subscribe systems Aalborg University • Text search and database querying Tutorials Chair • Semi-structured data David DeWitt, U.
    [Show full text]
  • The Best Nurturers in Computer Science Research
    The Best Nurturers in Computer Science Research Bharath Kumar M. Y. N. Srikant IISc-CSA-TR-2004-10 http://archive.csa.iisc.ernet.in/TR/2004/10/ Computer Science and Automation Indian Institute of Science, India October 2004 The Best Nurturers in Computer Science Research Bharath Kumar M.∗ Y. N. Srikant† Abstract The paper presents a heuristic for mining nurturers in temporally organized collaboration networks: people who facilitate the growth and success of the young ones. Specifically, this heuristic is applied to the computer science bibliographic data to find the best nurturers in computer science research. The measure of success is parameterized, and the paper demonstrates experiments and results with publication count and citations as success metrics. Rather than just the nurturer’s success, the heuristic captures the influence he has had in the indepen- dent success of the relatively young in the network. These results can hence be a useful resource to graduate students and post-doctoral can- didates. The heuristic is extended to accurately yield ranked nurturers inside a particular time period. Interestingly, there is a recognizable deviation between the rankings of the most successful researchers and the best nurturers, which although is obvious from a social perspective has not been statistically demonstrated. Keywords: Social Network Analysis, Bibliometrics, Temporal Data Mining. 1 Introduction Consider a student Arjun, who has finished his under-graduate degree in Computer Science, and is seeking a PhD degree followed by a successful career in Computer Science research. How does he choose his research advisor? He has the following options with him: 1. Look up the rankings of various universities [1], and apply to any “rea- sonably good” professor in any of the top universities.
    [Show full text]
  • Lower Bounds on Lattice Sieving and Information Set Decoding
    Lower bounds on lattice sieving and information set decoding Elena Kirshanova1 and Thijs Laarhoven2 1Immanuel Kant Baltic Federal University, Kaliningrad, Russia [email protected] 2Eindhoven University of Technology, Eindhoven, The Netherlands [email protected] April 22, 2021 Abstract In two of the main areas of post-quantum cryptography, based on lattices and codes, nearest neighbor techniques have been used to speed up state-of-the-art cryptanalytic algorithms, and to obtain the lowest asymptotic cost estimates to date [May{Ozerov, Eurocrypt'15; Becker{Ducas{Gama{Laarhoven, SODA'16]. These upper bounds are useful for assessing the security of cryptosystems against known attacks, but to guarantee long-term security one would like to have closely matching lower bounds, showing that improvements on the algorithmic side will not drastically reduce the security in the future. As existing lower bounds from the nearest neighbor literature do not apply to the nearest neighbor problems appearing in this context, one might wonder whether further speedups to these cryptanalytic algorithms can still be found by only improving the nearest neighbor subroutines. We derive new lower bounds on the costs of solving the nearest neighbor search problems appearing in these cryptanalytic settings. For the Euclidean metric we show that for random data sets on the sphere, the locality-sensitive filtering approach of [Becker{Ducas{Gama{Laarhoven, SODA 2016] using spherical caps is optimal, and hence within a broad class of lattice sieving algorithms covering almost all approaches to date, their asymptotic time complexity of 20:292d+o(d) is optimal. Similar conditional optimality results apply to lattice sieving variants, such as the 20:265d+o(d) complexity for quantum sieving [Laarhoven, PhD thesis 2016] and previously derived complexity estimates for tuple sieving [Herold{Kirshanova{Laarhoven, PKC 2018].
    [Show full text]