Tika-In-Action.Pdf

IN ACTION Chris A. Mattmann Jukka L. Zitting FOREWORD BY JÉRÔME CHARRON MANNING www.it-ebooks.info Tika in Action www.it-ebooks.info www.it-ebooks.info Tika in Action CHRIS A. MATTMANN JUKKA L. ZITTING MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: [email protected] ©2012 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Cynthia Kane 20 Baldwin Road Copyeditor: Benjamin Berg PO Box 261 Proofreader: Katie Tennant Shelter Island, NY 11964 Typesetter: Dottie Marsico Cover designer: Marija Tudor ISBN 9781935182856 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 16 15 14 13 12 11 www.it-ebooks.info To my lovely wife Lisa and my son Christian —CM To my lovely wife Kirsi-Marja and our happy cats —JZ www.it-ebooks.info www.it-ebooks.info brief contents PART 1 GETTING STARTED ........................................................1 1 ■ The case for the digital Babel fish 3 2 ■ Getting started with Tika 24 3 ■ The information landscape 38 PART 2 TIKA IN DETAIL ...........................................................53 4 ■ Document type detection 55 5 ■ Content extraction 73 6 ■ Understanding metadata 94 7 ■ Language detection 113 8 ■ What’s in a file? 123 PART 3 INTEGRATION AND ADVANCED USE .............................143 9 ■ The big picture 145 10 ■ Tika and the Lucene search stack 154 11 ■ Extending Tika 167 vii www.it-ebooks.info viii BRIEF CONTENTS PART 4 CASE STUDIES............................................................179 12 ■ Powering NASA science data systems 181 13 ■ Content management with Apache Jackrabbit 191 14 ■ Curating cancer research data with Tika 196 15 ■ The classic search engine example 204 www.it-ebooks.info contents foreword xv preface xvii acknowledgments xix about this book xxi about the authors xxv about the cover illustration xxvi PART 1 GETTING STARTED............................................ 1 The case for the digital Babel fish 3 1 1.1 Understanding digital documents 4 A taxonomy of file formats 5 ■ Parser libraries 6 Structured text as the universal language 9 ■ Universal metadata 10 ■ The program that understands everything 13 1.2 What is Apache Tika? 15 A bit of history 15 ■ Key design goals 17 ■ When and where to use Tika 21 1.3 Summary 22 Getting started with Tika 24 2 2.1 Working with Tika source code 25 Getting the source code 25 ■ The Maven build 26 Including Tika in Ant projects 26 ix www.it-ebooks.info x CONTENTS 2.2 The Tika application 27 Drag-and-drop text extraction: the Tika GUI 29 ■ Tika on the command line 30 2.3 Tika as an embedded library 32 Using the Tika facade 32 ■ Managing dependencies 34 2.4 Summary 36 The information landscape 38 3 3.1 Measuring information overload 40 Scale and growth 40 ■ Complexity 42 3.2 I’m feeling lucky—searching the information landscape 44 Just click it: the modern search engine 44 ■ Tika’s role in search 46 3.3 Beyond lucky: machine learning 47 Your likes and dislikes 48 ■ Real-world machine learning 50 3.4 Summary 52 PART 2 TIKA IN DETAIL................................................53 Document type detection 55 4 4.1 Internet media types 56 The parlance of media type names 58 ■ Categories of media types 58 ■ IANA and other type registries 60 4.2 Media types in Tika 60 The shared MIME-info database 61 ■ The MediaType class 62 The MediaTypeRegistry class 63 ■ Type hierarchies 64 4.3 File format diagnostics 65 Filename globs 66 ■ Content type hints 68 ■ Magic bytes 68 Character encodings 69 ■ Other mechanisms 70 4.4 Tika, the type inspector 71 4.5 Summary 72 Content extraction 73 5 5.1 Full-text extraction 74 Abstracting the parsing process 74 ■ Full-text indexing 75 Incremental parsing 77 www.it-ebooks.info CONTENTS xi 5.2 The Parser interface 78 Who knew parsing could be so easy? 78 ■ The parse() method 79 Parser implementations 80 ■ Parser selection 82 5.3 Document input stream 84 Standardizing input to Tika 84 ■ The TikaInputStream class 85 5.4 Structured XHTML output 87 Semantic structure of text 87 ■ Structured output via SAX events 88 ■ Marking up structure with XHTML 89 5.5 Context-sensitive parsing 91 Environment settings 91 ■ Custom document handling 92 5.6 Summary 93 Understanding metadata 94 6 6.1 The standards of metadata 96 Metadata models 96 ■ General metadata standards 99 Content-specific metadata standards 99 6.2 Metadata quality 101 Challenges/Problems 101 ■ Unifying heterogeneous standards 103 6.3 Metadata in Tika 104 Keys and multiple values 105 ■ Transformations and views 106 6.4 Practical uses of metadata 107 Common metadata for the Lucene indexer 108 ■ Give me my metadata in my schema! 109 6.5 Summary 111 Language detection 113 7 7.1 The most translated document in the world 114 7.2 Sounds Greek to me—theory of language detection 115 Language profiles 116 ■ Profiling algorithms 117 The N-gram algorithm 118 ■ Advanced profiling algorithms 119 7.3 Language detection in Tika 119 Incremental language detection 120 ■ Putting it all together 121 7.4 Summary 122 www.it-ebooks.info xii CONTENTS What’s in a file? 123 8 8.1 Types of content 124 HDF: a format for scientific data 125 ■ Really Simple Syndication: a format for rapidly changing content 126 8.2 How Tika extracts content 127 Organization of content 128 ■ File header and naming conventions 133 ■ Storage affects extraction 139 8.3 Summary 141 PART 3 INTEGRATION AND ADVANCED USE .................143 The big picture 145 9 9.1 Tika in search engines 146 The search use case 146 ■ The anatomy of a search index 146 9.2 Managing and mining information 147 Document management systems 148 ■ Text mining 149 9.3 Buzzword compliance 150 Modularity, Spring, and OSGi 150 ■ Large-scale computing 151 9.4 Summary 153 Tika and the Lucene search stack 154 10 10.1 Load-bearing walls 155 ManifoldCF 156 ■ Open Relevance 157 10.2 The steel frame 159 Lucene Core 159 ■ Solr 161 10.3 The finishing touches 162 Nutch 162 ■ Droids 164 ■ Mahout 165 10.4 Summary 166 Extending Tika 167 11 11.1 Adding type information 168 Custom media type configuration 169 www.it-ebooks.info CONTENTS xiii 11.2 Custom type detection 169 The Detector interface 170 ■ Building a custom type detector 170 ■ Plugging in new detectors 172 11.3 Customized parsing 172 Customizing existing parsers 173 ■ Writing a new parser 174 ■ Plugging in new parsers 175 Overriding existing parsers 176 11.4 Summary 176 PART 4 CASE STUDIES ................................................179 Powering NASA science data systems 181 12 12.1 NASA’s Planetary Data System 182 PDS data model 182 ■ The PDS search redesign 184 12.2 NASA’s Earth Science Enterprise 186 Leveraging Tika in NASA Earth Science SIPS 187 Using Tika within the ground data systems 188 12.3 Summary 190 Content management with Apache Jackrabbit 191 13 13.1 Introducing Apache Jackrabbit 192 13.2 The text extraction pool 192 13.3 Content-aware WebDAV 194 13.4 Summary 195 Curating cancer research data with Tika 196 14 14.1 The NCI Early Detection Research Network 197 The EDRN data model 197 ■ Scientific data curation 198 14.2 Integrating Tika 198 Metadata extraction 199 ■ MIME type identification and classification 201 14.3 Summary 203 www.it-ebooks.info xiv CONTENTS The classic search engine example 204 15 15.1 The Public Terabyte Dataset Project 205 15.2 The Bixo web crawler 206 Parsing fetched documents 207 ■ Validating Tika’s charset detection 209 15.3 Summary 210 appendix A Tika quick reference 211 appendix B Supported metadata keys 214 index 219 www.it-ebooks.info foreword I’m a big fan of search engines and Java, so early in the year 2004 I was looking for a good Java-based open source project on search engines. I quickly discovered Nutch. Nutch is an open source search engine project from the Apache Software Foundation. It was initiated by Doug Cutting, the well-known father of Lucene. With my new toy on my laptop, I tested and tried to evaluate it. Even if Nutch was in its early stages, it was a promising project—exactly what I was looking for. I pro- posed my first patches to Nutch relating to language identification in early 2005. Then, in the middle of 2005 I become a Nutch committer and increased my number of contributions relating to language identification, content-type guessing, and document analysis. Looking more deeply at Lucene, I discovered a wide set of projects around it: Nutch, Solr, and what would eventually become Mahout.

Load more