Tika-In-Action.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Tika-In-Action.Pdf IN ACTION Chris A. Mattmann Jukka L. Zitting FOREWORD BY JÉRÔME CHARRON MANNING www.it-ebooks.info Tika in Action www.it-ebooks.info www.it-ebooks.info Tika in Action CHRIS A. MATTMANN JUKKA L. ZITTING MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: [email protected] ©2012 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Cynthia Kane 20 Baldwin Road Copyeditor: Benjamin Berg PO Box 261 Proofreader: Katie Tennant Shelter Island, NY 11964 Typesetter: Dottie Marsico Cover designer: Marija Tudor ISBN 9781935182856 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 16 15 14 13 12 11 www.it-ebooks.info To my lovely wife Lisa and my son Christian —CM To my lovely wife Kirsi-Marja and our happy cats —JZ www.it-ebooks.info www.it-ebooks.info brief contents PART 1 GETTING STARTED ........................................................1 1 ■ The case for the digital Babel fish 3 2 ■ Getting started with Tika 24 3 ■ The information landscape 38 PART 2 TIKA IN DETAIL ...........................................................53 4 ■ Document type detection 55 5 ■ Content extraction 73 6 ■ Understanding metadata 94 7 ■ Language detection 113 8 ■ What’s in a file? 123 PART 3 INTEGRATION AND ADVANCED USE .............................143 9 ■ The big picture 145 10 ■ Tika and the Lucene search stack 154 11 ■ Extending Tika 167 vii www.it-ebooks.info viii BRIEF CONTENTS PART 4 CASE STUDIES............................................................179 12 ■ Powering NASA science data systems 181 13 ■ Content management with Apache Jackrabbit 191 14 ■ Curating cancer research data with Tika 196 15 ■ The classic search engine example 204 www.it-ebooks.info contents foreword xv preface xvii acknowledgments xix about this book xxi about the authors xxv about the cover illustration xxvi PART 1 GETTING STARTED............................................ 1 The case for the digital Babel fish 3 1 1.1 Understanding digital documents 4 A taxonomy of file formats 5 ■ Parser libraries 6 Structured text as the universal language 9 ■ Universal metadata 10 ■ The program that understands everything 13 1.2 What is Apache Tika? 15 A bit of history 15 ■ Key design goals 17 ■ When and where to use Tika 21 1.3 Summary 22 Getting started with Tika 24 2 2.1 Working with Tika source code 25 Getting the source code 25 ■ The Maven build 26 Including Tika in Ant projects 26 ix www.it-ebooks.info x CONTENTS 2.2 The Tika application 27 Drag-and-drop text extraction: the Tika GUI 29 ■ Tika on the command line 30 2.3 Tika as an embedded library 32 Using the Tika facade 32 ■ Managing dependencies 34 2.4 Summary 36 The information landscape 38 3 3.1 Measuring information overload 40 Scale and growth 40 ■ Complexity 42 3.2 I’m feeling lucky—searching the information landscape 44 Just click it: the modern search engine 44 ■ Tika’s role in search 46 3.3 Beyond lucky: machine learning 47 Your likes and dislikes 48 ■ Real-world machine learning 50 3.4 Summary 52 PART 2 TIKA IN DETAIL................................................53 Document type detection 55 4 4.1 Internet media types 56 The parlance of media type names 58 ■ Categories of media types 58 ■ IANA and other type registries 60 4.2 Media types in Tika 60 The shared MIME-info database 61 ■ The MediaType class 62 The MediaTypeRegistry class 63 ■ Type hierarchies 64 4.3 File format diagnostics 65 Filename globs 66 ■ Content type hints 68 ■ Magic bytes 68 Character encodings 69 ■ Other mechanisms 70 4.4 Tika, the type inspector 71 4.5 Summary 72 Content extraction 73 5 5.1 Full-text extraction 74 Abstracting the parsing process 74 ■ Full-text indexing 75 Incremental parsing 77 www.it-ebooks.info CONTENTS xi 5.2 The Parser interface 78 Who knew parsing could be so easy? 78 ■ The parse() method 79 Parser implementations 80 ■ Parser selection 82 5.3 Document input stream 84 Standardizing input to Tika 84 ■ The TikaInputStream class 85 5.4 Structured XHTML output 87 Semantic structure of text 87 ■ Structured output via SAX events 88 ■ Marking up structure with XHTML 89 5.5 Context-sensitive parsing 91 Environment settings 91 ■ Custom document handling 92 5.6 Summary 93 Understanding metadata 94 6 6.1 The standards of metadata 96 Metadata models 96 ■ General metadata standards 99 Content-specific metadata standards 99 6.2 Metadata quality 101 Challenges/Problems 101 ■ Unifying heterogeneous standards 103 6.3 Metadata in Tika 104 Keys and multiple values 105 ■ Transformations and views 106 6.4 Practical uses of metadata 107 Common metadata for the Lucene indexer 108 ■ Give me my metadata in my schema! 109 6.5 Summary 111 Language detection 113 7 7.1 The most translated document in the world 114 7.2 Sounds Greek to me—theory of language detection 115 Language profiles 116 ■ Profiling algorithms 117 The N-gram algorithm 118 ■ Advanced profiling algorithms 119 7.3 Language detection in Tika 119 Incremental language detection 120 ■ Putting it all together 121 7.4 Summary 122 www.it-ebooks.info xii CONTENTS What’s in a file? 123 8 8.1 Types of content 124 HDF: a format for scientific data 125 ■ Really Simple Syndication: a format for rapidly changing content 126 8.2 How Tika extracts content 127 Organization of content 128 ■ File header and naming conventions 133 ■ Storage affects extraction 139 8.3 Summary 141 PART 3 INTEGRATION AND ADVANCED USE .................143 The big picture 145 9 9.1 Tika in search engines 146 The search use case 146 ■ The anatomy of a search index 146 9.2 Managing and mining information 147 Document management systems 148 ■ Text mining 149 9.3 Buzzword compliance 150 Modularity, Spring, and OSGi 150 ■ Large-scale computing 151 9.4 Summary 153 Tika and the Lucene search stack 154 10 10.1 Load-bearing walls 155 ManifoldCF 156 ■ Open Relevance 157 10.2 The steel frame 159 Lucene Core 159 ■ Solr 161 10.3 The finishing touches 162 Nutch 162 ■ Droids 164 ■ Mahout 165 10.4 Summary 166 Extending Tika 167 11 11.1 Adding type information 168 Custom media type configuration 169 www.it-ebooks.info CONTENTS xiii 11.2 Custom type detection 169 The Detector interface 170 ■ Building a custom type detector 170 ■ Plugging in new detectors 172 11.3 Customized parsing 172 Customizing existing parsers 173 ■ Writing a new parser 174 ■ Plugging in new parsers 175 Overriding existing parsers 176 11.4 Summary 176 PART 4 CASE STUDIES ................................................179 Powering NASA science data systems 181 12 12.1 NASA’s Planetary Data System 182 PDS data model 182 ■ The PDS search redesign 184 12.2 NASA’s Earth Science Enterprise 186 Leveraging Tika in NASA Earth Science SIPS 187 Using Tika within the ground data systems 188 12.3 Summary 190 Content management with Apache Jackrabbit 191 13 13.1 Introducing Apache Jackrabbit 192 13.2 The text extraction pool 192 13.3 Content-aware WebDAV 194 13.4 Summary 195 Curating cancer research data with Tika 196 14 14.1 The NCI Early Detection Research Network 197 The EDRN data model 197 ■ Scientific data curation 198 14.2 Integrating Tika 198 Metadata extraction 199 ■ MIME type identification and classification 201 14.3 Summary 203 www.it-ebooks.info xiv CONTENTS The classic search engine example 204 15 15.1 The Public Terabyte Dataset Project 205 15.2 The Bixo web crawler 206 Parsing fetched documents 207 ■ Validating Tika’s charset detection 209 15.3 Summary 210 appendix A Tika quick reference 211 appendix B Supported metadata keys 214 index 219 www.it-ebooks.info foreword I’m a big fan of search engines and Java, so early in the year 2004 I was looking for a good Java-based open source project on search engines. I quickly discovered Nutch. Nutch is an open source search engine project from the Apache Software Foundation. It was initiated by Doug Cutting, the well-known father of Lucene. With my new toy on my laptop, I tested and tried to evaluate it. Even if Nutch was in its early stages, it was a promising project—exactly what I was looking for. I pro- posed my first patches to Nutch relating to language identification in early 2005. Then, in the middle of 2005 I become a Nutch committer and increased my number of contributions relating to language identification, content-type guessing, and docu- ment analysis. Looking more deeply at Lucene, I discovered a wide set of projects around it: Nutch, Solr, and what would eventually become Mahout.
Recommended publications
  • Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected]
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by St. Cloud State University St. Cloud State University theRepository at St. Cloud State Culminating Projects in Information Assurance Department of Information Systems 3-2017 Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected] Follow this and additional works at: https://repository.stcloudstate.edu/msia_etds Recommended Citation Annangi, Harikrishna, "Security Log Analysis Using Hadoop" (2017). Culminating Projects in Information Assurance. 19. https://repository.stcloudstate.edu/msia_etds/19 This Starred Paper is brought to you for free and open access by the Department of Information Systems at theRepository at St. Cloud State. It has been accepted for inclusion in Culminating Projects in Information Assurance by an authorized administrator of theRepository at St. Cloud State. For more information, please contact [email protected]. Security Log Analysis Using Hadoop by Harikrishna Annangi A Starred Paper Submitted to the Graduate Faculty of St. Cloud State University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Information Assurance April, 2016 Starred Paper Committee: Dr. Dennis Guster, Chairperson Dr. Susantha Herath Dr. Sneh Kalia 2 Abstract Hadoop is used as a general-purpose storage and analysis platform for big data by industries. Commercial Hadoop support is available from large enterprises, like EMC, IBM, Microsoft and Oracle and Hadoop companies like Cloudera, Hortonworks, and Map Reduce. Hadoop is a scheme written in Java that allows distributed processes of large data sets across clusters of computers using programming models. A Hadoop frame work application works in an environment that provides storage and computation across clusters of computers.
    [Show full text]
  • About:Config .Init About:Me About:Presentation Web 2.0 User
    about:config .init about:me about:presentation Web 2.0 User View Technical View Simple Overview Picture Complex Overview Picture Main Problems Statistics I Statistics II .next Targets Of Attack Targets Methods Kinds Of Session Hijacking SQL Injection Introduction Examples SQL Injection Picture Analysis SQL Escaping SQL Escaping #2 SQL Parameter Binding XSS Introduction What Can It Do? Main Problem Types Of XSS Components Involved In XSS Reflected XSS Picture Reflected XSS Analysis (Server Side) Stored XSS Picture Server Side Stored XSS (Local) DOM XSS Example Picture local DOM XSS CSRF Introduction Example Picture CSRF Session Riding Analysis Complex Example Hijack Via DNS + XSS Picture DNS+XSS Combo Cookie Policy Analysis Variant Components .next Misplaced Trust 3rd Party Script Picture Trust 3rd Party Script Analysis Misplaced Trust In Middleware Misplaced Trust In Server­Local Data Picture Local Scripts Analysis Same Origin Policy Frame Policy UI Redressing Introduction Clickjacking Picture Clickjacking Analysis BREAK .next Summary of Defense Strategies "Best Effort" vs. "Best Security" Protection against Hijacking Session Theft Riding, Fixation, Prediction Separate by Trust Validation Why Input Validation at Server Check Origin and Target of Request Validation of Form Fields Validation of File Upload Validation Before Forwarding Validation of Server Output Validation of Target in Client Validation of Origin in Client Validation of Input in Client Normalization What's That? Normalizing HTML Normalizing XHTML Normalizing Image, Audio,
    [Show full text]
  • Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software
    Department of Information Science and Technology Building a Scalable Index and a Web Search Engine for Music on the Internet using Open Source software André Parreira Ricardo Thesis submitted in partial fulfillment of the requirements for the degree of Master in Computer Science and Business Management Advisor: Professor Carlos Serrão, Assistant Professor, ISCTE-IUL September, 2010 Acknowledgments I should say that I feel grateful for doing a thesis linked to music, an art which I love and esteem so much. Therefore, I would like to take a moment to thank all the persons who made my accomplishment possible and hence this is also part of their deed too. To my family, first for having instigated in me the curiosity to read, to know, to think and go further. And secondly for allowing me to continue my studies, providing the environment and the financial means to make it possible. To my classmate André Guerreiro, I would like to thank the invaluable brainstorming, the patience and the help through our college years. To my friend Isabel Silva, who gave me a precious help in the final revision of this document. Everyone in ADETTI-IUL for the time and the attention they gave me. Especially the people over Caixa Mágica, because I truly value the expertise transmitted, which was useful to my thesis and I am sure will also help me during my professional course. To my teacher and MSc. advisor, Professor Carlos Serrão, for embracing my will to master in this area and for being always available to help me when I needed some advice.
    [Show full text]
  • Natural Language Processing Technique for Information Extraction and Analysis
    International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 2, Issue 8, August 2015, PP 32-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Natural Language Processing Technique for Information Extraction and Analysis T. Sri Sravya1, T. Sudha2, M. Soumya Harika3 1 M.Tech (C.S.E) Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] 2 Head (I/C) of C.S.E & IT Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] 3 M. Tech C.S.E, Assistant Professor, Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] Abstract: In the current internet era, there are a large number of systems and sensors which generate data continuously and inform users about their status and the status of devices and surroundings they monitor. Examples include web cameras at traffic intersections, key government installations etc., seismic activity measurement sensors, tsunami early warning systems and many others. A Natural Language Processing based activity, the current project is aimed at extracting entities from data collected from various sources such as social media, internet news articles and other websites and integrating this data into contextual information, providing visualization of this data on a map and further performing co-reference analysis to establish linkage amongst the entities. Keywords: Apache Nutch, Solr, crawling, indexing 1. INTRODUCTION In today’s harsh global business arena, the pace of events has increased rapidly, with technological innovations occurring at ever-increasing speed and considerably shorter life cycles.
    [Show full text]
  • Database Globalization Support Guide
    Oracle® Database Database Globalization Support Guide 19c E96349-05 May 2021 Oracle Database Database Globalization Support Guide, 19c E96349-05 Copyright © 2007, 2021, Oracle and/or its affiliates. Primary Author: Rajesh Bhatiya Contributors: Dan Chiba, Winson Chu, Claire Ho, Gary Hua, Simon Law, Geoff Lee, Peter Linsley, Qianrong Ma, Keni Matsuda, Meghna Mehta, Valarie Moore, Cathy Shea, Shige Takeda, Linus Tanaka, Makoto Tozawa, Barry Trute, Ying Wu, Peter Wallack, Chao Wang, Huaqing Wang, Sergiusz Wolicki, Simon Wong, Michael Yau, Jianping Yang, Qin Yu, Tim Yu, Weiran Zhang, Yan Zhu This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S.
    [Show full text]
  • Study of the Utility of Text Classification Based Software
    Study of the Utility of Text Classification Based Software Architecture Recovery Method RELAX for Maintenance Daniel Link Kamonphop Srisopha Barry Boehm University of Southern California University of Southern California University of Southern California Los Angeles, California, USA Los Angeles, California, USA Los Angeles, California, USA [email protected] [email protected] [email protected] ABSTRACT ACM Reference Format: Background. The software architecture recovery method RELAX Daniel Link, Kamonphop Srisopha, and Barry Boehm. 2021. Study of the Utility of Text Classification Based Software Architecture Recovery Method produces a concern-based architectural view of a software sys- RELAX for Maintenance. In ACM / IEEE International Symposium on Em- tem graphically and textually from that system’s source code. The pirical Software Engineering and Measurement (ESEM) (ESEM ’21), Octo- method has been implemented in software which can recover the ber 11–15, 2021, Bari, Italy. ACM, New York, NY, USA, 6 pages. https: architecture of systems whose source code is written in Java. //doi.org/10.1145/3475716.3484194 Aims. Our aim was to find out whether the availability of archi- tectural views produced by RELAX can help maintainers who are 1 INTRODUCTION new to a project in becoming productive with development tasks While several definitions of what a software architecture is exist sooner, and how they felt about working in such an environment. [11], e.g., the set of design decisions about a software system [13], Method. We conducted a user study with nine participants. They they all refer to the structure of a software system and the reasoning were subjected to a controlled experiment in which maintenance process that led to that structure.
    [Show full text]
  • Return of Organization Exempt from Income
    OMB No. 1545-0047 Return of Organization Exempt From Income Tax Form 990 Under section 501(c), 527, or 4947(a)(1) of the Internal Revenue Code (except black lung benefit trust or private foundation) Open to Public Department of the Treasury Internal Revenue Service The organization may have to use a copy of this return to satisfy state reporting requirements. Inspection A For the 2011 calendar year, or tax year beginning 5/1/2011 , and ending 4/30/2012 B Check if applicable: C Name of organization The Apache Software Foundation D Employer identification number Address change Doing Business As 47-0825376 Name change Number and street (or P.O. box if mail is not delivered to street address) Room/suite E Telephone number Initial return 1901 Munsey Drive (909) 374-9776 Terminated City or town, state or country, and ZIP + 4 Amended return Forest Hill MD 21050-2747 G Gross receipts $ 554,439 Application pending F Name and address of principal officer: H(a) Is this a group return for affiliates? Yes X No Jim Jagielski 1901 Munsey Drive, Forest Hill, MD 21050-2747 H(b) Are all affiliates included? Yes No I Tax-exempt status: X 501(c)(3) 501(c) ( ) (insert no.) 4947(a)(1) or 527 If "No," attach a list. (see instructions) J Website: http://www.apache.org/ H(c) Group exemption number K Form of organization: X Corporation Trust Association Other L Year of formation: 1999 M State of legal domicile: MD Part I Summary 1 Briefly describe the organization's mission or most significant activities: to provide open source software to the public that we sponsor free of charge 2 Check this box if the organization discontinued its operations or disposed of more than 25% of its net assets.
    [Show full text]
  • Bae Systems Information and Electronic Systems Integration Inc
    BAE SYSTEMS INFORMATION AND ELECTRONIC SYSTEMS INTEGRATION INC. GEOSPATIAL EXPLOITATION PRODUCTS® PLATFORM SOFTWARE LICENSE AGREEMENT THIS SOFTWARE LICENSE AGREEMENT (“AGREEMENT”) APPLIES TO ANY SOFTWARE PRODUCT(S) THAT MAY BE PROVIDED BY BAE SYSTEMS INFORMATION AND ELECTRONIC SYSTEMS INTEGRATION INC. (“LICENSOR”) TO YOU (“LICENSEE”), INCLUDING BUT NOT LIMITED TO GXP XPLORER®, SOCET GXP®, SOCET SET®, GXP WEBVIEW®, GXP INMOTION™, GXP INMOTION™ SERVER, GXP JPIP SERVER, GXP OPSVIEW™ AND IF APPLICABLE (AS SPECIFICALLY IDENTIFIED IN THE APPLICABLE ORDERING DOCUMENT OR CONTRACT) THE GXP XPLORER SERVER TO DIB CONNECTOR (EACH SEPARATELY REFERRED TO BELOW AS “GXP SOFTWARE”). READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE (1) OPENING THE PACKAGE OR DOWNLOADING THE FILE CONTAINING THE GXP SOFTWARE, OR (2) CLICKING THE “I ACCEPT” BUTTON. THE GXP SOFTWARE AND THE ACCOMPANYING USER DOCUMENTATION (EACH REFERRED TO AS THE “PROGRAM”) ARE COPYRIGHTED AND LICENSED - NOT SOLD. BY OPENING THE SOFTWARE PACKAGE, OR CLICKING “I ACCEPT”, YOU ARE ACCEPTING AND AGREEING TO THE TERMS OF THIS AGREEMENT. IF YOU DO NOT ACCEPT THE TERMS AND CONDITIONS OF THIS AGREEMENT PROMPTLY RETURN THE UNOPENED PACKAGE TO THE PARTY FROM WHOM IT WAS ACQUIRED, CANCEL THE DOWNLOAD OR CANCEL THE INSTALLATION. IF YOU ARE A UNITED STATES (“U.S”) GOVERNMENT CUSTOMER, ACCEPTANCE OF THESE TERMS ARE EFFECTUATED BY ACCEPTANCE OF A PROPOSAL, QUOTE, OR OTHER ORDERING DOCUMENT OR CONTRACT INCORPORATING THIS AGREEMENT BY REFERENCE OR OTHERWISE OR BY CONTRACTING OFFICER EXECUTION OF THIS AGREEMENT. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT CONCERNING THE LICENSING OF THE PROGRAM BETWEEN LICENSEE AND LICENSOR, AND IT SUPERSEDES AND REPLACES IN ITS ENTIRETY ANY PRIOR PROPOSAL, REPRESENTATION, OR UNDERSTANDING BETWEEN THE PARTIES.
    [Show full text]
  • An Unobtrusive, Scalable Approach to Large Scale Software License Analysis
    DRAT: An Unobtrusive, Scalable Approach to Large Scale Software License Analysis Chris A. Mattmann1,2, Ji-Hyun Oh1,2, Tyler Palsulich1*, Lewis John McGibbney1, Yolanda Gil2,3, Varun Ratnakar3 1Jet Propulsion Laboratory 2Computer Science Department 3USC Information Sciences Institute California Institute of Technology University of Southern California University of Southern California Pasadena, CA 91109 USA Los Angeles, CA 90089 USA Marina Del Rey, CA [email protected] {mattmann,jihyuno}@usc.edu {gil,varunr}@ isi.edu Abstract— The Apache Release Audit Tool (RAT) performs (OSI) for complying with open source software open source license auditing and checking, however definition, however, there exist slight differences among these RAT fails to successfully audit today's large code bases. Being a licenses [2]. For instance, GPL is a “copyleft” license that natural language processing (NLP) tool and a crawler, RAT only allows derivative works under the original license, marches through a code base, but uses rudimentary black lists whereas MIT license is a “permissive” license that grants the and white lists to navigate source code repositories, and often does a poor job of identifying source code versus binary files. In right to sublicense the code under any kind of license [2]. This addition RAT produces no incremental output and thus on code difference could affect architectural design of the software. bases that themselves are "Big Data", RAT could run for e.g., a Furthermore, circumstances are more complicated when month and still not provide any status report. We introduce people publish software under the multiple licenses. Distributed "RAT" or the Distributed Release Audit Tool Therefore, an automated tool for verifying software licenses in (DRAT).
    [Show full text]
  • Building a Search Engine for the Cuban Web
    Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer NOVEMBER 16-18, 2016 • SEVILLE, SPAIN Who am I 01 Jorge Luis Betancourt González Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast 2 Agenda • Introduction & motivation • Technologies used • Customizations • Conclusions and future work 3 Introduction / Motivation Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet 4 Writing your own web search engine from scratch? or … 5 Common search engine features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 6 How to fulfill these requirements? store query At the core a search engine: stores some information a retrieve this information when a question is received 7 Open Source to the rescue … crawler 1 Index Server 2 web interface 3 8 Apache Nutch Nutch is a well matured, production ready “ Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. 9 Apache Nutch • Highly scalable • Highly extensible • Pluggable parsing protocols, storage, indexing, scoring, • Active community • Apache License 10 Apache Solr TOTAL DOWNLOADS 8M+ MONTHLY DOWNLOADS 250,000+ • Apache License • Great community • Highly modular • Stability / Scalability
    [Show full text]
  • Plain Text & Character Encoding
    Journal of eScience Librarianship Volume 10 Issue 3 Data Curation in Practice Article 12 2021-08-11 Plain Text & Character Encoding: A Primer for Data Curators Seth Erickson Pennsylvania State University Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/jeslib Part of the Scholarly Communication Commons, and the Scholarly Publishing Commons Repository Citation Erickson S. Plain Text & Character Encoding: A Primer for Data Curators. Journal of eScience Librarianship 2021;10(3): e1211. https://doi.org/10.7191/jeslib.2021.1211. Retrieved from https://escholarship.umassmed.edu/jeslib/vol10/iss3/12 Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License. This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Journal of eScience Librarianship by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ISSN 2161-3974 JeSLIB 2021; 10(3): e1211 https://doi.org/10.7191/jeslib.2021.1211 Full-Length Paper Plain Text & Character Encoding: A Primer for Data Curators Seth Erickson The Pennsylvania State University, University Park, PA, USA Abstract Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability.
    [Show full text]
  • Avaliando a Dívida Técnica Em Produtos De Código Aberto Por Meio De Estudos Experimentais
    UNIVERSIDADE FEDERAL DE GOIÁS INSTITUTO DE INFORMÁTICA IGOR RODRIGUES VIEIRA Avaliando a dívida técnica em produtos de código aberto por meio de estudos experimentais Goiânia 2014 IGOR RODRIGUES VIEIRA Avaliando a dívida técnica em produtos de código aberto por meio de estudos experimentais Dissertação apresentada ao Programa de Pós–Graduação do Instituto de Informática da Universidade Federal de Goiás, como requisito parcial para obtenção do título de Mestre em Ciência da Computação. Área de concentração: Ciência da Computação. Orientador: Prof. Dr. Auri Marcelo Rizzo Vincenzi Goiânia 2014 Ficha catalográfica elaborada automaticamente com os dados fornecidos pelo(a) autor(a), sob orientação do Sibi/UFG. Vieira, Igor Rodrigues Avaliando a dívida técnica em produtos de código aberto por meio de estudos experimentais [manuscrito] / Igor Rodrigues Vieira. - 2014. 100 f.: il. Orientador: Prof. Dr. Auri Marcelo Rizzo Vincenzi. Dissertação (Mestrado) - Universidade Federal de Goiás, Instituto de Informática (INF) , Programa de Pós-Graduação em Ciência da Computação, Goiânia, 2014. Bibliografia. Apêndice. Inclui algoritmos, lista de figuras, lista de tabelas. 1. Dívida técnica. 2. Qualidade de software. 3. Análise estática. 4. Produto de código aberto. 5. Estudo experimental. I. Vincenzi, Auri Marcelo Rizzo, orient. II. Título. Todos os direitos reservados. É proibida a reprodução total ou parcial do trabalho sem autorização da universidade, do autor e do orientador(a). Igor Rodrigues Vieira Graduado em Sistemas de Informação, pela Universidade Estadual de Goiás – UEG, com pós-graduação lato sensu em Desenvolvimento de Aplicações Web com Interfaces Ricas, pela Universidade Federal de Goiás – UFG. Foi Coordenador da Ouvidoria da UFG e, atualmente, é Analista de Tecnologia da Informação do Centro de Recursos Computacionais – CERCOMP/UFG.
    [Show full text]