A Security Model for Full-Text File System Search in Multi-User Environments

A Security Model for Full-Text File System Search in Multi-User Environments Stefan Büttcher and Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Ontario, Canada Abstract able. Therefore, it is important to keep the disk space consumption of the indexing system as low as possible. Most desktop search systems maintain per-user indices In particular, for a computer system with many users, it to keep track of file contents. In a multi-user environ- is infeasible to have an individual index for every user in ment, this is not a viable solution, because the same file the system. In a typical UNIX environment, for example, has to be indexed many times, once for every user that it is not unusual that about half of the file system is read- may access the file, causing both space and performance able by all users in the system. In such a situation, even problems. Having a single system-wide index for all a single chmod operation – making a previously private users, on the other hand, allows for efficient indexing but file readable by everybody – would trigger a large num- requires special security mechanisms to guarantee that ber of index update operations if per-user indices were the search results do not violate any file permissions. used. Similarly, due to the lack of information sharing We present a security model for full-text file system among the individual per-user indices, multiple copies of search, based on the UNIX security model, and discuss the index information about the same file would need to two possible implementations of the model. We show be stored on disk, leading to a disk space consumption that the first implementation, based on a postprocess- that could easily exceed that of the original file. ing approach, allows an arbitrary user to obtain infor- We investigated different desktop search tools, by mation about the content of files for which he does not Google1, Microsoft2, Apple3, Yahoo4, and Copernic5, have read permission. The second implementation does and found that all but Apple’s Spotlight maintain a sep- not share this problem. We give an experimental perfor- arate index for every user (Google’s search tool uses a mance evaluation for both implementations and point out system-wide index, but this index may only be accessed query optimization opportunities for the second one. by users with administrator rights, which makes the soft- ware unusable in multi-user environments). While this is 1 Introduction and Overview an unsatisfactory solution because of the increased disk space consumption, it is very secure because all file ac- With the advent of desktop and file system search tools cess permissions are automatically respected. Since the by Google, Microsoft, Apple, and others, efficient file indexing process has the same privileges as the user that system search is becoming an integral component of fu- it belongs to, security restrictions cannot be violated, and ture operating systems. These search systems are able to the index accurately resembles the user’s view of the file deliver the response to a search query within a fraction of system. a second because they index the file system ahead of time If a single system-wide index is used instead, this in- and keep an index that, for every term that appears in the dex contains information about all files in the file system. file system, contains a list of all files in which the term Thus, whenever the search system processes a search occurs and the exact positions within those files (called query, care has to be taken that the results are consistent the term’s posting list). with the user’s view of the file system. A search result While indexing the file system has the obvious advan- is obviously inconsistent with the user’s view of the file tage that queries can be answered much faster from the system if it contains files for which the user does not have index than by an exhaustive disk scan, it also has the read permission. However, there are more subtle cases obvious disadvantage that a full-text index requires sig- of inconsistency. In general, we say that the result to a nificant disk space, sometimes more than what is avail- search query is inconsistent with the user’s view of the USENIX Association FAST ’05: 4th USENIX Conference on File and Storage Technologies 169 file system if some aspect of it (e.g., the order in which plementation of the security model are based. matching files are returned) depends on the content of In section 4, we present a general file system search files that cannot be read by the user. Examples of such security model and define what it means for a file to be inconsistencies are discussed in section 5. searchable by a user. Section 5 discusses the first im- An obvious way to address the consistency problem plementation of the security model, based on the post- is the postprocessing approach: The same, system-wide processing approach described above. We show how this index is used for all users, and every query is processed implementation can be exploited in order to obtain the to- in the same way, regardless of which user submitted the tal number of files in the file system containing a certain query; after the query processor has computed the list of term. This is done by systematically creating and delet- files matching the query, all file permissions are checked, ing files, submitting search queries to the search system, and files that may not be searched by the user are re- and looking at either the relevance scores or the relative moved from the final result. This approach, which is ranks of the files returned by the search engine. used by Apple’s Spotlight search system (see the Apple In section 6, we present a second implementation of Spotlight technology brief 6 for details), works well for the security model. This implementation is immune Boolean queries. However, pure Boolean queries are not against the attacks described in section 5. Its perfor- always appropriate. If the number of files in a file sys- mance is evaluated experimentally in section 7 and com- tem is large, the search system has to do some sort of pared to the performance of the postprocessing approach. relevance ranking in order to present the most likely rel- Opportunities for query optimization are discussed in evant files first and help the user find the information he section 8, where we show that making an almost non- is looking for faster. Usually, a TF/IDF-based (term fre- restrictive assumption about the independence of differ- quency / inverse document frequency) algorithm is used ent files allows us to virtually nullify the overhead of the to perform this relevance ranking. security mechanisms in the search system. In this paper, we present a full-text search security model. We show that, if a TF/IDF-style ranking algorithm is used by the search system, an implementation 2 Related Work of the security model must not follow the postprocessing approach. If it does, it produces search results that are While some research has been done in the area of high- inconsistent with the user’s view of the file system. The performance dynamic indexing [BCC94] [LZW04], inconsistencies can be exploited by the user in a system- which is also very important for file system search, the atical way and allow him to obtain information about the security problems associated with full-text search in a content of files which he is not allowed to search. While multi-user environment have not yet been studied. we do not know the exact ranking algorithm employed In his report on the major decisions in the design of by Apple’s Spotlight, we conjecture that it is at least in Microsoft’s Tripoli search engine, Peltonen [Pel97] de- parts based on the TF/IDF paradigm (as TF/IDF-based mands that “full text indexing must never compromise algorithms are the most popular ranking techniques in operating or file system security”. However, after this information retrieval systems) and therefore amenable to initial claim, the topic is not mentioned again in his pa- the attacks described in this paper. per. Turtle and Flood [TF95] touch the topic of text re- After discussing possible attacks on the postprocess- trieval in multi-user environments, but only mention the ing approach, we present a second approach to the in- special memory requirements, not the security require- consistency problem which guarantees that all search re- ments. sults are consistent with the user’s view of the file system Griffiths and Wade [GW76] and Fagin [Fag78] were and which therefore does not allow a user to infer any- among the first who investigated security mechanisms thing about the content of files which he may not search. and access control in relational database systems (Sys- This safe implementation of the file system search se- tem R). Both papers study discretionary access con- curity model is part of the Wumpus7 file system search trol with ownership-based administration, in some sense engine. The system is freely available under the terms of similar to the UNIX file system security model [RT74] the GNU General Public License. [Rit78]. However, their work goes far beyond UNIX in In the next two sections, we give a brief overview of some aspects. For example, in their model it is possi- previous work on security issues in multi-user environ- ble that a user grants the right to grant rights for file (ta- ments (section 2) and an introduction to basic informa- ble) access to other users, which is impossible in UNIX.

Load more