Apache Solr 4 Cookbook.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
www.it-ebooks.info Apache Solr 4 Cookbook Over 100 recipes to make Apache Solr faster, more reliable, and return better results Rafał Kuć BIRMINGHAM - MUMBAI www.it-ebooks.info Apache Solr 4 Cookbook Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: July 2011 Second edition: January 2013 Production Reference: 1150113 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-132-5 www.packtpub.com Cover Image by J. Blaminsky ([email protected]) www.it-ebooks.info Credits Author Project Coordinator Rafał Kuć Anurag Banerjee Reviewers Proofreaders Ravindra Bharathi Maria Gould Marcelo Ochoa Aaron Nash Vijayakumar Ramdoss Indexer Acquisition Editor Tejal Soni Andrew Duckworth Production Coordinators Lead Technical Editor Manu Joseph Arun Nadar Nitesh Thakur Technical Editors Cover Work Jalasha D'costa Nitesh Thakur Charmaine Pereira Lubna Shaikh www.it-ebooks.info About the Author Rafał Kuć is a born team leader and software developer. Currently working as a Consultant and a Software Engineer at Sematext Inc, where he concentrates on open source technologies such as Apache Lucene and Solr, ElasticSearch, and Hadoop stack. He has more than 10 years of experience in various software branches, from banking software to e-commerce products. He is mainly focused on Java, but open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with their problems with Solr and Lucene. He is also a speaker for various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, and ApacheCon. Rafał began his journey with Lucene in 2002 and it wasn't love at first sight. When he came back to Lucene later in 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. From then on, Rafał has concentrated on search technologies and data analysis. Right now Lucene, Solr, and ElasticSearch are his main points of interest. www.it-ebooks.info Acknowledgement This book is an update to the first cookbook for Solr that was released almost two year ago now. What was at the beginning an update turned out to be a rewrite of almost all the recipes in the book, because we wanted to not only bring you an update to the already existing recipes, but also give you whole new recipes that will help you with common situations when using Apache Solr 4.0. I hope that the book you are holding in your hands (or reading on a computer or reader screen) will be useful to you. Although I would go the same way if I could get back in time, the time of writing this book was not easy for my family. Among the ones who suffered the most were my wife Agnes and our two great kids, our son Philip and daughter Susanna. Without their patience and understanding, the writing of this book wouldn't have been possible. I would also like to thank my parents and Agnes' parents for their support and help. I would like to thank all the people involved in creating, developing, and maintaining Lucene and Solr projects for their work and passion. Without them this book wouldn't have been written. Once again, thank you. www.it-ebooks.info About the Reviewers Ravindra Bharathi has worked in the software industry for over a decade in various domains such as education, digital media marketing/advertising, enterprise search, and energy management systems. He has a keen interest in search-based applications that involve data visualization, mashups, and dashboards. He blogs at http://ravindrabharathi.blogspot.com. Marcelo Ochoa works at the System Laboratory of Facultad de Ciencias Exactas of the Universidad Nacional del Centro de la Provincia de Buenos Aires, and is the CTO at Scotas. com, a company specialized in near real time search solutions using Apache Solr and Oracle. He divides his time between University jobs and external projects related to Oracle, and big data technologies. He has worked in several Oracle related projects such as translation of Oracle manuals and multimedia CBTs. His background is in database, network, web, and Java technologies. In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project, the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration by using Oracle JVM Directory implementation, and the Restlet.org project – the Oracle XDB Restlet Adapter, an alternative to writing native REST web services inside the database resident JVM. Since 2006, he has been a part of the Oracle ACE program. Oracle ACEs are known for their strong credentials as Oracle community enthusiasts and advocates, with candidates nominated by ACEs in the Oracle Technology and Applications communities. He is the author of Chapter 17 of the book Oracle Database Programming using Java and Web Services, Kuassi Mensah, Digital Press and Chapter 21 of the book Professional XML Databases, Kevin Williams, Wrox Press. www.it-ebooks.info www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version atwww.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. www.it-ebooks.info www.it-ebooks.info Table of Contents Preface 1 Chapter 1: Apache Solr Configuration 5 Introduction 5 Running Solr on Jetty 6 Running Solr on Apache Tomcat 10 Installing a standalone ZooKeeper 14 Clustering your data 15 Choosing the right directory implementation 17 Configuring spellchecker to not use its own index 19 Solr cache configuration 22 How to fetch and index web pages 27 How to set up the extracting request handler 30 Changing the default similarity implementation 32 Chapter 2: Indexing Your Data 35 Introduction 35 Indexing PDF files 36 Generating unique fields automatically 38 Extracting metadata from binary files 40 How to properly configure Data Import Handler with JDBC 42 Indexing data from a database using Data Import Handler 45 How to import data using Data Import Handler and delta query 48 How to use Data Import Handler with the URL data source 50 How to modify data while importing with Data Import Handler 53 Updating a single field of your document 56 Handling multiple currencies 59 Detecting the document's language 62 Optimizing your primary key field indexing 67 www.it-ebooks.info Table of Contents Chapter 3: Analyzing Your Text Data 69 Introduction 70 Storing additional information using payloads 70 Eliminating XML and HTML tags from text 73 Copying the contents of one field to another 75 Changing words to other words 77 Splitting text by CamelCase 80 Splitting text by whitespace only 82 Making plural words singular without stemming 84 Lowercasing the whole string 87 Storing geographical points in the index 88 Stemming your data 91 Preparing text to perform an efficient trailing wildcard search 93 Splitting text by numbers and non-whitespace characters 96 Using Hunspell as a stemmer 99 Using your own stemming dictionary 101 Protecting words from being stemmed 103 Chapter 4: Querying Solr 107 Introduction 108 Asking for a particular field value 108 Sorting results by a field value 109 How to search for a phrase, not a single word 111 Boosting phrases over words 114 Positioning some documents over others in a query 117 Positioning documents with words closer to each other first 122 Sorting results by the distance from a point 125 Getting documents with only a partial match 128 Affecting scoring with functions 130 Nesting queries 134 Modifying returned documents 136 Using parent-child relationships