Multi-User File System Search

Multi-User File System Search by Stefan Buttc¨ her A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2007 c Stefan Buttc¨ her, 2007 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. (Stefan Buttc¨ her) iii Abstract Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when generating the results to a search query. The goal of this thesis is to close the gap between information retrieval research and the requirements exacted by these real-life applications. The algorithms and data structures presented in this thesis can be used to implement a file system search engine that is able to react to changes in the file system by updating its index data in real time. File changes (in- sertions, deletions, or modifications) are reflected by the search results within a few seconds, even under a very high system workload. The search engine exhibits a low main memory consumption. By integrating security restrictions into the query processing logic, as opposed to applying them in a postprocessing step, it produces search results that are guaranteed to be consistent with the access permissions defined by the file system. The techniques proposed in this thesis are evaluated theoretically, based on a Zipfian model of term distribution, and through a large number of experiments, involving text collections of non-trivial size | varying between a few gigabytes and a few hundred gigabytes. v Acknowledgements First and foremost, I would like to thank my supervisor, Charlie Clarke, who did a tremen- dous job over the past few years, offering guidance whenever I needed it, providing advice whenever I sought it, and allowing me the academic freedom to pursue the research that I was most interested in. I owe you big-time. Thanks to the members of the IR/PLG lab, who created a most enjoyable working atmosphere. In particular, I would like to thank Richard Bilson, Ashif Harji, Maheedhar Kolla, Roy Krischer, Ian MacKinnon, Brad Lushman, Thomas Lynam, and Peter Yeung. Thanks to Gordon Cormack, who taught me by example that it is possible to have a successful career without ever growing up. Thanks also to the School of Computer Science for supporting me through the David R. Cheriton graduate student scholarship, and to Mi- crosoft for supplying us with little toys that ensured a certain level of distraction at all times. Finally, thanks to the members of my committee, Charlie Clarke, Gordon Cormack, Alistair Moffat, Frank Tompa, and Olga Vechtomova, who provided valuable feedback that greatly helped improve this thesis. vii Bibliographical Notes Preliminary versions of the material presented in some parts of this thesis have appeared in the following publications: • Stefan Buttc¨ her and Charles L. A. Clarke. Indexing time vs. query-time trade-offs in dynamic information retrieval systems. In Proceedings of the 14th ACM Conference on Information and Knowledge Management (CIKM 2005). Bremen, Germany, November 2005. (Chapter 6) • Stefan Buttc¨ her and Charles L. A. Clarke. A security model for full-text file system search in multi-user environments. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST 2005). San Francisco, USA, December 2005. (Chapter 5) • Stefan Buttc¨ her and Charles L. A. Clarke. A hybrid approach to index maintenance in dynamic text retrieval systems. In Proceedings of the 28th European Conference on Information Retrieval (ECIR 2006). London, UK, April 2006. (Chapter 6) • Stefan Buttc¨ her and Charles L. A. Clarke. Adding full-text file system search to Linux. ;login: The USENIX Magazine, 31(3):28{33. Berkeley, USA, June 2006. (Chapter 7) • Stefan Buttc¨ her, Charles L. A. Clarke, and Brad Lushman. Hybrid index maintenance for growing text collections. In Proceedings of the 29th ACM SIGIR Conference on Re- search and Development in Information Retrieval (SIGIR 2006). Seattle, USA, August 2006. (Chapter 6) I would like to thank my co-authors for their assistance and the anonymous reviewers for their helpful feedback on the issues discussed in these papers. ix to my parents xi Contents 1 Introduction 1 1.1 What is File System Search? . 3 1.2 Thesis Organization . 5 2 Background and Related Work 9 2.1 Traditional File System Search in UNIX . 10 2.2 Search Queries . 13 2.2.1 Boolean Queries . 13 2.2.2 Ranked Queries . 17 2.2.3 Boolean vs. Ranked Retrieval in File System Search . 24 2.2.4 Structural Constraints . 25 2.3 Index Data Structures . 27 2.4 Inverted Indices . 29 2.4.1 Dictionary . 30 2.4.2 Posting Lists . 31 2.4.3 Index Construction . 31 2.4.4 Index Compression . 37 2.5 Index Maintenance . 41 2.5.1 Incremental Index Updates . 41 2.5.2 Document Deletions . 47 2.5.3 Document Modifications . 48 2.6 Applications . 49 3 Evaluation Methodology 51 3.1 Data Sets . 51 3.2 Hardware Configuration & Performance Measurements . 53 xiii 4 Index Structures 57 4.1 General Considerations . 58 4.2 Statistical Properties of Inverted Files . 62 4.2.1 Generalized Zipfian Distributions . 63 4.2.2 Long Lists and Short Lists . 68 4.2.3 Notation . 69 4.3 Index Construction . 70 4.3.1 In-Memory Index Construction: Extensible Posting Lists . 71 4.3.2 Merge-Based Index Construction: The Final Merge Operation . 73 4.3.3 Performance Baseline . 75 4.4 Memory Requirements . 76 4.4.1 Dictionary Compression . 77 4.4.2 Interleaving Posting Lists and Dictionary Entries . 78 4.5 Document-Centric vs. Schema-Independent Inverted Files . 81 4.5.1 Query Flexibility . 82 4.5.2 Query Processing Performance . 83 4.6 Summary . 84 5 Secure File System Search 87 5.1 The UNIX Security Model . 88 5.1.1 File Permissions . 88 5.1.2 Traditional File System Search in UNIX . 89 5.2 A File System Search Security Model . 90 5.3 Enforcing Security by Postprocessing . 92 5.3.1 Exploiting Relevance Scores . 94 5.3.2 Exploiting Ranking Results . 98 5.3.3 Exploiting Support for Structural Constraints . 104 5.4 Integrating Security Restrictions into the Query Processor . 105 5.5 Performance Evaluation . 109 5.6 Query Optimization . 111 5.7 Discussion . 115 6 Real-Time Index Updates 119 6.1 General Considerations and Terminology . 120 6.2 Merge-Based Update . 122 6.2.1 Index Partitioning . 125 xiv 6.3 In-Place Update . 134 6.3.1 File-System-Based Update . 135 6.3.2 Partial Flushing . 137 6.3.3 In-Place vs. Merge-Based Update . 140 6.4 Hybrid Index Maintenance . 141 6.4.1 Hybrid Index Maintenance with Contiguous Posting Lists . 142 6.4.2 Hybrid Index Maintenance with Non-Contiguous Posting Lists . 145 6.4.3 Implementing the In-Place Index . 148 6.4.4 Complexity Analysis . 149 6.4.5 Experiments . 152 6.4.6 Hybrid vs. Non-Hybrid Index Maintenance . 157 6.5 File Deletions . 160 6.5.1 Utilizing Security Restrictions to Delay Index Updates . 160 6.5.2 Collecting Garbage Postings . 162 6.5.3 Performance Trade-offs and Garbage Collection Policies . 164 6.5.4 Experiments . 167 6.6 File Modifications . 170 6.6.1 Append Indexing Strategies . 171 6.6.2 Experiments . 174 6.7 Meeting Real-Time Requirements . 177 6.8 Discussion . 182 7 Interaction with the Operating System 185 7.1 File System Notification . 186 7.1.1 Essential File System Events . 186 7.1.2 Event Notification in Linux: dnotify and inotify . 187 7.1.3 An Alternative to inotify: fschange . 188 7.2 Dealing with Different File Types . 190 7.3 Temporal Locality in File Systems . 190 7.4 Scheduling Index Updates . 192 8 Conclusion 195 xv List of Tables 2.1 Space requirements: Dictionary vs. posting lists . 37 2.2 Dictionary compression: Impact on dictionary size and lookup time . 40 3.1 Overview of the collections used in the experiments . 52 3.2 Overview of the SBFS collection . 54 4.1 Locality axes in file system search . 58 4.2 Predicting the vocabulary size via Zipf's law . 66 4.3 Comparing realloc and grouping for in-memory posting lists . 72 4.4 Merging compressed indices without decompressing posting lists . 75 4.5 Index construction performance baseline . 76 4.6 Compressing the in-memory dictionary . 77 4.7 Memory requirements of incomplete dictionaries . 81 4.8 Query processing performance baseline . 84 6.1 Number of tokens seen vs. number of unique terms seen . 121 6.2 Impact of memory resources on update performance . 134 6.3 Effect of partial flushing for different threshold values . 139 6.4 Impact of garbage collection on merge performance ..

Multi-User File System Search

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support