Lucene in Action Second Edition

Covers Apache Lucene 3.0 IN ACTION SECOND EDITION Michael McCandless Erik Hatcher , Otis Gospodnetic FOREWORD BY DOUG CUTTING MANNING www.it-ebooks.info Praise for the First Edition This is definitely the book to have if you’re planning on using Lucene in your application, or are interested in what Lucene can do for you. —JavaLobby Search powers the information age. This book is a gateway to this invaluable resource...It suc- ceeds admirably in elucidating the application programming interface (API), with many code examples and cogent explanations, opening the door to a fine tool. —Computing Reviews A must-read for anyone who wants to learn about Lucene or is even considering embedding search into their applications or just wants to learn about information retrieval in general. Highly recommended! —TheServerSide.com Well thought-out...thoroughly edited...stands out clearly from the crowd....I enjoyed reading this book. If you have any text-searching needs, this book will be more than sufficient equipment to guide you to successful completion. Even, if you are just looking to download a pre-written search engine, then this book will provide a good background to the nature of information retrieval in general and text indexing and searching specifically. —Slashdot.org The book is more like a crystal ball than ink on pape--I run into solutions to my most pressing problems as I read through it. —Arman Anwar, Arman@Web Provides a detailed blueprint for using and customizing Lucene...a thorough introduction to the inner workings of what’s arguably the most popular open source search engine...loaded with code examples and emphasizes a hands-on approach to learning. —SearchEngineWatch.com Hatcher and Gospodnetic´ bring their experience as two of Lucene’s core committers to author this excellently written book. This book helps any developer not familiar with Lucene or development of a search engine to get up to speed within minutes on the project and domain....I would recom- mend this book to anyone who is new to Lucene, anyone who needs powerful indexing and searching capabilities in their application, or anyone who needs a great reference for Lucene. —Fort Worth Java Users Group www.it-ebooks.info Licensed to theresa smith <[email protected]> More Praise for the First Edition Outstanding...comprehensive and up-to-date ...grab this book and learn how to leverage Lucene’s potential. —Val’s blog ...the code examples are useful and reusable. —Scott Ganyo, Lucene Java Committer ...packed with examples and advice on how to effectively use this incredibly powerful tool. —Brian Goetz, Quiotix Corporation ...it unlocked for me the amazing power of Lucene. —Reece Wilton, Walt Disney Internet Group ...code samples as JUnit test cases are incredibly helpful. —Norman Richards, co-author XDoclet in Action A quick and easy guide to making Lucene work. —Books-On-Line A comprehensive guide...The authors of this book are experts in this field...they have unleashed the power of Lucene ...the best guide to Lucene available so far. —JavaReference.com www.it-ebooks.info Licensed to theresa smith <[email protected]> Lucene in Action Second Edition MICHAEL MCCANDLESS ERIK HATCHER OTIS GOSPODNETIĆ MANNING Greenwich (74° w. long.) www.it-ebooks.info Licensed to theresa smith <[email protected]> For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 180 Broad St. Suite 1323 Stamford, CT 06901 Email: [email protected] ©2010 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine Manning Publications Co. Development editor: Sebastian Stirling 180 Broad St. Copyeditor: Liz Welch Suite 1323 Typesetter: Dottie Marsico Stamford, CT 06901 Cover designer: Marija Tudor ISBN 978-1-933988-17-7 Printed in the United States of America 12345678910–MAL–151413121110 www.it-ebooks.info Licensed to theresa smith <[email protected]> brief contents PART 1CORE LUCENE ............................................................. 1 1 ■ Meet Lucene 3 2 ■ Building a search index 31 3 ■ Adding search to your application 74 4 ■ Lucene’s analysis process 110 5 ■ Advanced search techniques 152 6 ■ Extending search 204 PART 2APPLIED LUCENE...................................................... 233 7 ■ Extracting text with Tika 235 8 ■ Essential Lucene extensions 255 9 ■ Further Lucene extensions 288 10 ■ Using Lucene from other programming languages 325 11 ■ Lucene administration and performance tuning 345 PART 3CASE STUDIES........................................................... 381 12 ■ Case study 1: Krugle 383 13 ■ Case study 2: SIREn 394 14 ■ Case study 3: LinkedIn 409 v www.it-ebooks.info Licensed to theresa smith <[email protected]> www.it-ebooks.info Licensed to theresa smith <[email protected]> contents foreword xvii preface xix preface to the first edition xx acknowledgments xxiii about this book xxvi JUnit primer xxxiv about the authors xxxvii PART 1CORE LUCENE......................................................1 Meet Lucene 3 1 1.1 Dealing with information explosion 4 1.2 What is Lucene? 6 What Lucene can do 7 ■ History of Lucene 7 1.3 Lucene and the components of a search application 9 Components for indexing 11 ■ Components for searching 14 The rest of the search application 16 ■ Where Lucene fits into your application 18 1.4 Lucene in action: a sample application 19 Creating an index 19 ■ Searching an index 23 1.5 Understanding the core indexing classes 25 IndexWriter 26 ■ Directory 26 ■ Analyzer 26 Document 27 ■ Field 27 vii www.it-ebooks.info Licensed to theresa smith <[email protected]> viii CONTENTS 1.6 Understanding the core searching classes 28 IndexSearcher 28 ■ Term 28 ■ Query 29 ■ TermQuery 29 TopDocs 29 1.7 Summary 29 Building a search index 31 2 2.1 How Lucene models content 32 Documents and fields 32 ■ Flexible schema 33 Denormalization 34 2.2 Understanding the indexing process 34 Extracting text and creating the document 34 Analysis 35 ■ Adding to the index 35 2.3 Basic index operations 36 Adding documents to an index 37 ■ Deleting documents from an index 39 ■ Updating documents in the index 41 2.4 Field options 43 Field options for indexing 43 ■ Field options for storing fields 44 Field options for term vectors 44 ■ Reader, TokenStream, and byte[] field values 45 ■ Field option combinations 46 ■ Field options for sorting 46 ■ Multivalued fields 47 2.5 Boosting documents and fields 48 Boosting documents 48 ■ Boosting fields 49 ■ Norms 50 2.6 Indexing numbers, dates, and times 51 Indexing numbers 51 ■ Indexing dates and times 52 2.7 Field truncation 53 2.8 Near-real-time search 54 2.9 Optimizing an index 54 2.10 Other directory implementations 56 2.11 Concurrency, thread safety, and locking issues 58 Thread and multi-JVM safety 58 ■ Accessing an index over a remote file system 59 ■ Index locking 61 2.12 Debugging indexing 63 2.13 Advanced indexing concepts 64 Deleting documents with IndexReader 65 ■ Reclaiming disk space used by deleted documents 66 ■ Buffering and flushing 66 Index commits 67 ■ ACID transactions and index consistency 69 ■ Merging 70 2.14 Summary 72 www.it-ebooks.info Licensed to theresa smith <[email protected]> CONTENTS ix Adding search to your application 74 3 3.1 Implementing a simple search feature 76 Searching for a specific term 76 ■ Parsing a user-entered query expression: QueryParser 77 3.2 Using IndexSearcher 80 Creating an IndexSearcher 81 ■ Performing searches 82 Working with TopDocs 82 ■ Paging through results 84 Near-real-time search 84 3.3 Understanding Lucene scoring 86 How Lucene scores 86 ■ Using explain() to understand hit scoring 88 3.4 Lucene’s diverse queries 90 Searching by term: TermQuery 90 ■ Searching within a term range: TermRangeQuery 91 ■ Searching within a numeric range: NumericRangeQuery 92 ■ Searching on a string: PrefixQuery 93 ■ Combining queries: BooleanQuery 94 Searching by phrase: PhraseQuery 96 ■ Searching by wildcard: WildcardQuery 99 ■ Searching for similar terms: FuzzyQuery 100 ■ Matching all documents: MatchAllDocsQuery 101 3.5 Parsing query expressions: QueryParser 101 Query.toString 102 ■ TermQuery 103 ■ Term range searches 103 ■ Numeric and date range searches 104 Prefix and wildcard queries 104 ■ Boolean operators 105 Phrase queries 105 ■ Fuzzy queries 106 MatchAllDocsQuery 107 ■ Grouping 107 ■ Field selection 107 ■ Setting the boost for a subquery 108 To QueryParse or not to QueryParse? 108 3.6 Summary 109 Lucene’s analysis process 110 4 4.1 Using analyzers 111 Indexing analysis 113 ■ QueryParser analysis 114

Lucene in Action Second Edition

Apache Lucene Searching the Web and Everything Else

Magnify Search Security and Administration Release 8.2 Version 04

Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University

Unravel Data Systems Version 4.5

Associate Enablement Perspectives

Instagram’S Success, Networking the Old Way

IBM Cloudant: Database As a Service Fundamentals

Final Report CS 5604: Information Storage and Retrieval

Enterprise 2.0 in Europe”, Produced by Tech4i2, IDC and Headshift for the European Commission

Apache Lucene™ Integration Reference Guide

Apache Lucene - Overview

Open Source and Third Party Documentation