Apache Solr 3 1 Cookbook.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Apache Solr 3.1 Cookbook Over 100 recipes to discover new ways to work with Apache's Enterprise Search Server Rafał Kuć BIRMINGHAM - MUMBAI Apache Solr 3.1 Cookbook Copyright © 2011 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: July 2011 Production Reference: 1140711 Published by Packt Publishing Ltd. 32 Lincoln Road Olton Birmingham, B27 6PA, UK. ISBN 978-1-849512-18-3 www.packtpub.com Cover Image by Rakesh Shejwal ([email protected]) Credits Author Project Coordinator Rafał Kuć Srimoyee Ghoshal Reviewers Proofreader Ravindra Bharathi Samantha Lyon Juan Grande Indexer Acquisition Editor Hemangini Bari Sarah Cullington Production Coordinators Development Editor Adline Swetha Jesuthas Meeta Rajani Kruthika Bangera Technical Editors Cover Work Manasi Poonthottam Kruthika Bangera Sakina Kaydawala Copy Editors Kriti Sharma Leonard D'Silva About the Author Rafał Kuć is a born team leader and software developer. Right now, he holds the position of Software Architect, and Solr and Lucene specialist. He has eight years of experience in various software branches, from banking software to the e-commerce products. He is mainly focused on Java, but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site where he tries to share his knowledge and help people with their problems with Solr and Lucene. Rafał began his journey with Lucene in 2002 and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. From then, Rafał has concentrated on search technologies and data analysis. Right now, Lucene and Solr are his main points of interest. Many Polish e-commerce sites use Solr deployments that were made with the guidance of Rafał and with the use of tools developed by him. The writing of this book was a big thing for me. It took almost all of my free time for the last six months. Of course, this was my sacrifice and if I were to make that decision today, it would be the same. However, the ones that suffered from this the most was my family—my wife Agnes and our son Philip. Without their patience for the always busy husband and father, the writing of this book wouldn't have been possible. Furthermore, the last drafts and corrections were made, when we were waiting for our daughter to be born, so it was even harder for them. I would like to thank all the people involved in creating, developing, and maintaining Lucene and Solr projects for their work and passion. Without them, this book wouldn't be written. Last, but not least, I would like to thank all the people who made the writing of this book possible, and who listened to my endless doubts and thoughts, especially Marek Rogoziński. Your point-of-view helped me a lot. I'm not able to mention you all, but you all know who you are. Once again, thank you. About the Reviewers Ravindra Bharathi holds a master's degree in Structural Engineering and started his career in the construction industry. He took up software development in 2001 and has since worked in domains such as education, digital media marketing/advertising, enterprise search, and energy management systems. He has a keen interest in search- based applications that involve data visualization, mashups, and dashboards. He blogs at http://ravindrabharathi.blogspot.com. Juan Grande is an Informatics Engineering student at Universidad de Buenos Aires, Argentina. He has worked with Java-related technologies since 2005, when he started as web applications developer, gaining experience in technologies such as Struts, Hibernate, and Spring. Since 2010, he has worked as search consultant for Plug Tree. He also works as an undergraduate teaching assistant in programming-related subjects since 2006. His other topic of interest is experimental physics, where he has a paper published. I want to acknowledge Plug Tree for letting me get involved in this project, and specially Tomás Fernández Löbbe for introducing me to the fascinating world of Solr. www.PacktPub.com Support files, eBooks, discount offers, and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Apache Solr Configuration 5 Introduction 5 Running Solr on Jetty 6 Running Solr on Apache Tomcat 9 Using the Suggester component 11 Handling multiple languages in a single index 15 Indexing fields in a dynamic way 17 Making multilingual data searchable with multicore deployment 19 Solr cache configuration 23 How to fetch and index web pages 28 Getting the most relevant results with early query termination 31 How to set up Extracting Request Handler 33 Chapter 2: Indexing your Data 35 Introduction 35 Indexing data in CSV format 36 Indexing data in XML format 38 Indexing data in JSON format 40 Indexing PDF files 43 Indexing Microsoft Office files 45 Extracting metadata from binary files 47 How to properly configure Data Import Handler with JDBC 49 Indexing data from a database using Data Import Handler 52 How to import data using Data Import Handler and delta query 55 How to use Data Import Handler with URL Data Source 57 How to modify data while importing with Data Import Handler 60 Table of Contents Chapter 3: Analyzing your Text Data 63 Introduction 64 Storing additional information using payloads 64 Eliminating XML and HTML tags from the text 66 Copying the contents of one field to another 68 Changing words to other words 70 Splitting text by camel case 72 Splitting text by whitespace only 74 Making plural words singular, but without stemming 75 Lowercasing the whole string 77 Storing geographical points in the index 78 Stemming your data 81 Preparing text to do efficient trailing wildcard search 83 Splitting text by numbers and non-white space characters 86 Chapter 4: Solr Administration 89 Introduction 89 Monitoring Solr via JMX 90 How to check the cache status 93 How to check how the data type or field behave 95 How to check Solr query handler usage 98 How to check Solr update handler usage 100 How to change Solr instance logging configuration 102 How to check the Java based replication status 104 How to check the script based replication status 106 Setting up a Java based index replication 108 Setting up script based replication 110 How to manage Java based replication status using HTTP commands 113 How to analyze your index structure 116 Chapter 5: Querying Solr 121 Introduction 121 Asking for a particular field value 122 Sorting results by a field value 123 Choosing a different query parser 125 How to search for a phrase, not a single word 126 Boosting phrases over words 128 Positioning some documents over others on a query 131 Positioning documents with words closer to each other first 136 Sorting results by a distance from a point 138 Getting documents with only a partial match 141 Affecting scoring with function 143 ii Table of Contents Nesting queries 147 Chapter 6: Using Faceting Mechanism 151 Introduction 151 Getting the number of documents with the same field value 152 Getting the number of documents with the same date range 155 Getting the number of documents with the same value range 158 Getting the number of documents matching the query and sub query 161 How to remove filters from faceting results 164 How to name different faceting results 167 How to sort faceting results in an alphabetical order 170 How to implement the autosuggest feature using faceting 173 How to get the number of documents that