Hortonworks Data Platform Apache Solr Search Installation (August 16, 2019)

Total Page:16

File Type:pdf, Size:1020Kb

Hortonworks Data Platform Apache Solr Search Installation (August 16, 2019) Hortonworks Data Platform Apache Solr Search Installation (August 16, 2019) docs.cloudera.com Hortonworks Data Platform August 16, 2019 Hortonworks Data Platform: Apache Solr Search Installation Copyright © 2012-2019 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License. http://creativecommons.org/licenses/by-sa/4.0/legalcode ii Hortonworks Data Platform August 16, 2019 Table of Contents 1. Introduction ................................................................................................................. 1 2. HDP Search 5.0 ............................................................................................................ 2 2.1. HDP Search 5.0 Release Notes ........................................................................... 2 2.1.1. HDP Search 5.0 New Features ................................................................ 2 2.1.2. HDP Search 5.0 Behavioral Changes ........................................................ 2 2.1.3. HDP Search 5.0 Apache Solr Version Information .................................... 2 2.1.4. HDP Search 5.0 Known Issues ................................................................. 4 2.2. HDP Search 5.0 - Getting Ready ........................................................................ 5 2.2.1. HDP Search 5.0 Minimum System Requirements ..................................... 5 2.2.2. Using a Local Repository ........................................................................ 6 2.3. Installing HDP Search 5.0 Management Pack ..................................................... 8 2.4. Installing HDP Search 5.0 Manually ................................................................... 9 3. HDP Search 4.0 .......................................................................................................... 10 3.1. HDP Search 4.0 Release Notes ......................................................................... 10 3.1.1. HDP Search 4.0 New Features .............................................................. 10 3.1.2. HDP Search 4.0 Behavioral Changes ...................................................... 10 3.1.3. HDP Search 4.0 Apache Solr Version Information .................................. 11 3.1.4. HDP Search 4.0 Known Issues ............................................................... 11 3.2. HDP Search 4.0 - Getting Ready ...................................................................... 11 3.2.1. HDP Search 4.0 Minimum System Requirements .................................... 12 3.2.2. Using a Local Repository ....................................................................... 12 3.3. Installing HDP Search 4.0 Management Pack ................................................... 15 3.4. Installing HDP Search 4.0 Manually .................................................................. 16 4. HDP Search 3.0 .......................................................................................................... 17 4.1. HDP Search 3.0 Release Notes ......................................................................... 17 4.1.1. HDP Search 3.0 New Features .............................................................. 17 4.1.2. HDP Search 3.0 Behavioral Changes ...................................................... 17 4.1.3. HDP Search 3.0 Apache Solr Version Information .................................. 18 4.1.4. HDP Search 3.0 Known Issues ............................................................... 18 4.2. HDP Search 3.0 - Getting Ready ...................................................................... 18 4.2.1. HDP Search 3.0 Minimum System Requirements .................................... 18 4.2.2. Using a Local Repository ....................................................................... 19 4.3. Installing HDP Search 3.0 Management Pack ................................................... 22 4.4. Installing HDP Search 3.0 Manually .................................................................. 23 5. Upgrading HDP Search ............................................................................................... 26 6. Applying Minor Upgrades to HDP Search ................................................................... 27 iii Hortonworks Data Platform August 16, 2019 List of Tables 3.1. HDP Search ............................................................................................................. 10 3.2. HDP Search ............................................................................................................. 10 4.1. HDP Search ............................................................................................................. 17 4.2. HDP Search ............................................................................................................. 17 iv Hortonworks Data Platform August 16, 2019 1. Introduction HDP Search is a full-text search server, designed for enterprise-level performance, flexibility, scalability, and fault-tolerance. HDP Search exposes REST-like HTTP/XML and JSON APIs for use with a wide range of programming languages. HDP Search includes: • Apache Solr 7.4.0 (for HDP Search 5.0) • Banana 1.6.12 • JARs for integration with Hadoop, Spark, Hive and Pig. Note HDP Search is a separate product, not packaged with the HDP platform. The high-level steps for using HDP search are as follows: 1. Install and deploy HDP Search, either manually or by using Ambari. 2. Ingest documents from sources such as HDFS. 3. Index the data. Documents, and updates to documents, will be available for search almost immediately after being indexed. 4. Perform a wide range of basic and advanced operations on the indexed documents. Resources: • This document describes software requirements for HDP Search, followed by installation instructions for specific operating systems. • For information about using Solr, see the Solr Reference Guide and the Solr tutorial. 1 Hortonworks Data Platform August 16, 2019 2. HDP Search 5.0 2.1. HDP Search 5.0 Release Notes The HDP Search 5.0 Release Notes summarize and describe the following information released in HDP Search 5.0: • HDP Search 5.0 New Features [2] • HDP Search 5.0 Behavioral Changes [2] • HDP Search 5.0 Apache Solr Version Information [2] • HDP Search 5.0 Known Issues [4] Note HDP Search is a separate product and does NOT come with the HDP platform. 2.1.1. HDP Search 5.0 New Features HDP 5.0 includes no significant new features. 2.1.2. HDP Search 5.0 Behavioral Changes HDP Search 5.0 introduces no significant behavioral changes. 2.1.3. HDP Search 5.0 Apache Solr Version Information HDP Search 5.0 is based on Apache Solr 7.4 (Apache Solr Release Notes). The following patches are applied on top of Apache Solr 7.4. Apache Patch Link to the Jira SOLR-12343 https://issues.apache.org/jira/browse/SOLR-12343 JSON Field Facet refinement can return incorrect counts/stats for sorted buckets. SOLR-12516 https://issues.apache.org/jira/browse/SOLR-12516 JSON range facets can incorrectly refine subfacets for buckets SOLR-12541 https://issues.apache.org/jira/browse/SOLR-12541 Metrics handler throws an error if there are transient cores. SOLR-12594 https://issues.apache.org/jira/browse/SOLR-12594 MetricsHistoryHandler.getOverseerLeader fails when hostname contains hyphen. SOLR-12683 https://issues.apache.org/jira/browse/SOLR-12683 HashQuery will throw an exception if more than 4 partitionKeys is specified. SOLR-12704 https://issues.apache.org/jira/browse/SOLR-12704 Guard AddSchemaFieldsUpdateProcessorFactory against null field names and field values. SOLR-12750 https://issues.apache.org/jira/browse/SOLR-12750 Migrate API should lock the collection instead of shared. 2 Hortonworks Data Platform August 16, 2019 Apache Patch Link to the Jira SOLR-12765 https://issues.apache.org/jira/browse/SOLR-12765 Incorrect format of JMX cache stats. SOLR-12836 https://issues.apache.org/jira/browse/SOLR-12836 ZkController creates a cloud solr client with no connection or read timeouts. SOLR-12615 https://issues.apache.org/jira/browse/SOLR-12615 HashQParserPlugin won't throw an NPE for string hash key and documents with empty
Recommended publications
  • Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University
    Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University Follow this and additional works at: http://opus.govst.edu/capstones Part of the Databases and Information Systems Commons Recommended Citation Ravikumar, Padmavathy, "Enterprise Search Technology Using Solr and Cloud" (2015). All Capstone Projects. 91. http://opus.govst.edu/capstones/91 For more information about the academic degree, extended learning, and certificate programs of Governors State University, go to http://www.govst.edu/Academics/Degree_Programs_and_Certifications/ Visit the Governors State Computer Science Department This Project Summary is brought to you for free and open access by the Student Capstone Projects at OPUS Open Portal to University Scholarship. It has been accepted for inclusion in All Capstone Projects by an authorized administrator of OPUS Open Portal to University Scholarship. For more information, please contact [email protected]. ENTERPRISE SEARCH TECHNOLOGY USING SOLR AND CLOUD By Padmavathy Ravikumar Masters Project Submitted in partial fulfillment of the requirements For the Degree of Master of Science, With a Major in Computer Science Governors State University University Park, IL 60484 Fall 2014 ENTERPRISE SEARCH TECHNOLOGY USING SOLR AND CLOUD 2 Abstract Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database in9tegration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
    [Show full text]
  • Splitting the Load How Separating Compute from Storage Can Transform the Flexibility, Scalability and Maintainability of Big Data Analytics Platforms
    IBM Analytics Engine White paper Splitting the load How separating compute from storage can transform the flexibility, scalability and maintainability of big data analytics platforms 2 Splitting the load Contents Executive summary 2 Executive summary Hadoop is the dominant big data processing system in use today. It is, however, a technology that has been around for 10 3 Challenges of Hadoop design years, and the world of big data has changed dramatically over 4 Limitations of traditional Hadoop clusters that time. 5 How cloud has changed the game 5 Introducing IBM Analytics Engine Hadoop started with a specific focus – a bunch of engineers 6 Overcoming the limitations wanted a way to store and analyze copious amounts of web logs. They knew how to write Java and how to set 7 Exploring the IBM Analytics Engine architecture up infrastructure, and they were hands-on with systems 8 Use cases for IBM Analytics Engine programming. All they really needed was a cost-effective file 10 Benefits of migrating to IBM Analytics Engine system (HDFS) and an execution paradigm (MapReduce)—the 11 Conclusion rest, they could code for themselves. Companies like Google, 11 About the author Facebook and Yahoo built many products and business models using just these two pieces of technology. 11 For more information Today, however, we’re seeing a big shift in the way big data applications are being programmed and deployed in production. Many different user personas, from data scientists and data engineers to business analysts and app developers need access to data. Each of these personas needs to access the data through a different tool and on a different schedule.
    [Show full text]
  • Final Report CS 5604: Information Storage and Retrieval
    Final Report CS 5604: Information Storage and Retrieval Solr Team Abhinav Kumar, Anand Bangad, Jeff Robertson, Mohit Garg, Shreyas Ramesh, Siyu Mi, Xinyue Wang, Yu Wang January 16, 2018 Instructed by Professor Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA 24061 1 Abstract The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects [6]. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. This report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments.
    [Show full text]
  • Apache Hadoop Today & Tomorrow
    Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com © Hortonworks, Inc. All Rights Reserved. Agenda Brief Overview of Apache Hadoop Where Apache Hadoop is Used Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce Where Apache Hadoop Is Going Q&A © Hortonworks, Inc. All Rights Reserved. 2 What is Apache Hadoop? A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases © Hortonworks, Inc. All Rights Reserved. 3 What is it used for? Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook Cutting edge analytics Machine learning, data mining… Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining And lots more! © Hortonworks, Inc. All Rights Reserved. 4 Apache Hadoop Projects Programming Pig Hive (Data Flow) (SQL) Languages MapReduce Computation (Distributed Programing Framework) HMS (Management) HBase (Coordination) Zookeeper Zookeeper HCatalog Table Storage (Meta Data) (Columnar Storage) HDFS Object Storage (Hadoop Distributed File System) Core Apache Hadoop Related Apache Projects © Hortonworks, Inc. All Rights Reserved. 5 Where Hadoop is Used © Hortonworks, Inc.
    [Show full text]
  • Hot Technologies” Within the O*NET® System
    Identification of “Hot Technologies” within the O*NET® System Phil Lewis National Center for O*NET Development Jennifer Norton North Carolina State University Prepared for U.S. Department of Labor Employment and Training Administration Office of Workforce Investment Division of National Programs, Tools, & Technical Assistance Washington, DC April 4, 2016 www.onetcenter.org National Center for O*NET Development, Post Office Box 27625, Raleigh, NC 27611 Table of Contents Background ......................................................................................................................... 2 Hot Technologies Identification Procedure ...................................................................... 3 Mine data to collect the top technology related terms ................................................ 3 Convert the data-mined technology terms into O*NET technologies ......................... 3 Organize the hot technologies within the O*NET Tools & Technology Taxonomy ..... 4 Link the hot technologies to O*NET-SOC occupations .............................................. 4 Determine the display of occupations linked to a hot technology ............................... 4 Summary ............................................................................................................................. 5 Figure 1: O*NET Hot Technology Icon .............................................................................. 6 Appendix A: Hot Technologies Identified During the Initial Implementation ................ 7 National Center
    [Show full text]
  • Big Business Value from Big Data and Hadoop
    Big Business Value from Big Data and Hadoop © Hortonworks Inc. 2012 Page 1 Topics • The Big Data Explosion: Hype or Reality • Introduction to Apache Hadoop • The Business Case for Big Data • Hortonworks Overview & Product Demo Page 2 © Hortonworks Inc. 2012 Big Data: Hype or Reality? © Hortonworks Inc. 2012 Page 3 What is Big Data? What is Big Data? Page 4 © Hortonworks Inc. 2012 Big Data: Changing The Game for Organizations Transactions + Interactions Mobile Web + Observations Petabytes BIG DATA Sentiment SMS/MMS = BIG DATA User Click Stream Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing ERP Customer Touches User Generated Content Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Product/Service Logs Payment record Offer details Offer history Increasing Data Variety and Complexity Page 5 © Hortonworks Inc. 2012 Next Generation Data Platform Drivers Organizations will need to become more data driven to compete Business • Enable new business models & drive faster growth (20%+) Drivers • Find insights for competitive advantage & optimal returns Technical • Data continues to grow exponentially • Data is increasingly everywhere and in many formats Drivers • Legacy solutions unfit for new requirements growth Financial • Cost of data systems, as % of IT spend, continues to
    [Show full text]
  • Apache Lucene - a Library Retrieving Data for Millions of Users
    Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene Core Committer & PMC Chair [email protected] / [email protected] Friday, October 14, 2011 About me? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • BerlinBuzzwords Co-Founder • Addicted to OpenSource 2 Friday, October 14, 2011 Apache Lucene - a library retrieving data for .... Agenda ‣ Apache Lucene a historical introduction ‣ (Small) Features Overview ‣ The Lucene Eco-System ‣ Upcoming features in Lucene 4.0 ‣ Maintaining superior quality in Lucene (backup slides) ‣ Questions 3 Friday, October 14, 2011 Apache Lucene - a brief introduction • A fulltext search library entirely written in Java • An ASF Project since 2001 (happy birthday Lucene) • Founded by Doug Cutting • Grown up - being the de-facto standard in OpenSource search • Starting point for a other well known projects • Apache 2.0 License 4 Friday, October 14, 2011 Where are we now? • Current Version 3.4 (frequent minor releases every 2 - 4 month) • Strong Backwards compatibility guarantees within major releases • Solid Inverted-Index implementation • large committer base from various companies • well established community • Upcoming Major Release is Lucene 4.0 (more about this later) 5 Friday, October 14, 2011 (Small) Features Overview • Fulltext search • Boolean-, Range-, Prefix-, Wildcard-, RegExp-, Fuzzy-, Phase-, & SpanQueries • Faceting, Result Grouping, Sorting, Customizable Scoring • Large set of Language / Text-Processing
    [Show full text]
  • View Whitepaper
    INFRAREPORT Top M&A Trends in Infrastructure Software EXECUTIVE SUMMARY 4 1 EVOLUTION OF CLOUD INFRASTRUCTURE 7 1.1 Size of the Prize 7 1.2 The Evolution of the Infrastructure (Public) Cloud Market and Technology 7 1.2.1 Original 2006 Public Cloud - Hardware as a Service 8 1.2.2 2016 - 2010 - Platform as a Service 9 1.2.3 2016 - 2019 - Containers as a Service 10 1.2.4 Container Orchestration 11 1.2.5 Standardization of Container Orchestration 11 1.2.6 Hybrid Cloud & Multi-Cloud 12 1.2.7 Edge Computing and 5G 12 1.2.8 APIs, Cloud Components and AI 13 1.2.9 Service Mesh 14 1.2.10 Serverless 15 1.2.11 Zero Code 15 1.2.12 Cloud as a Service 16 2 STATE OF THE MARKET 18 2.1 Investment Trend Summary -Summary of Funding Activity in Cloud Infrastructure 18 3 MARKET FOCUS – TRENDS & COMPANIES 20 3.1 Cloud Providers Provide Enhanced Security, Including AI/ML and Zero Trust Security 20 3.2 Cloud Management and Cost Containment Becomes a Challenge for Customers 21 3.3 The Container Market is Just Starting to Heat Up 23 3.4 Kubernetes 24 3.5 APIs Have Become the Dominant Information Sharing Paradigm 27 3.6 DevOps is the Answer to Increasing Competition From Emerging Digital Disruptors. 30 3.7 Serverless 32 3.8 Zero Code 38 3.9 Hybrid, Multi and Edge Clouds 43 4 LARGE PUBLIC/PRIVATE ACQUIRERS 57 4.1 Amazon Web Services | Private Company Profile 57 4.2 Cloudera (NYS: CLDR) | Public Company Profile 59 4.3 Hortonworks | Private Company Profile 61 Infrastructure Software Report l Woodside Capital Partners l Confidential l October 2020 Page | 2 INFRAREPORT
    [Show full text]
  • Final HDP with IBM Spectrum Scale
    Hortonworks HDP with IBM Spectrum Scale Chih-Feng Ku ([email protected]) Sr Manager, Solution Engineering APAC Par Hettinga ( [email protected]) Program Director, GloBal SDI EnaBlement Challenges with the Big Data Storage Models IBM Storage & SDI It’s not just one ! type of data – file & oBject Key Business processes now It’s not just ! depend on the analytics ! one type of analytics . Ingest data at Move data to the Perform Repeat! various end points analytics engine analytics More data sources than ever ! It takes hours or days Can’t just throw away data due to Before, not just data you own, ! to move the data! ! regulations or Business requirement But puBlic or rented data 2 Modernizing and Integrating Data Lake Infrastructure IBM Storage & SDI Insurance Company Use Case Value of IBM Power Systems Big SQL Business unit - Performance and Business Business scalability unit unit - High memory, I/O bandwidth - Optimized server for analytics and integration of accelerators ETL-processes DB2 for Warehouse H PD Data a Policy D d Value of IBM Spectrum Scale M Data D o M Customer D o - Separation of compute Data M p and storagepPart - In-place analytics (Posix conform) - Integration of objects and Files Global Namespace - HA / DR solutions - Storage tiering Spectrum Scale (inkl. tape integration) 8 16 9 17 EXP3524 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4 - Flexibility of data movement 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4 8 16 9 17 EXP3524 - Future: Using of common data formats (f.e.
    [Show full text]
  • Hortonworks Data Platform Release Notes (October 30, 2017)
    Hortonworks Data Platform Release Notes (October 30, 2017) docs.cloudera.com Hortonworks Data Platform October 30, 2017 Hortonworks Data Platform: Release Notes Copyright © 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Software Foundation projects that focus on the storage and processing of Big Data, along with operations, security, and governance for the resulting system. This includes Apache Hadoop -- which includes MapReduce, Hadoop Distributed File System (HDFS), and Yet Another Resource Negotiator (YARN) -- along with Ambari, Falcon, Flume, HBase, Hive, Kafka, Knox, Oozie, Phoenix, Pig, Ranger, Slider, Spark, Sqoop, Storm, Tez, and ZooKeeper. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page.
    [Show full text]
  • JATE 2.0: Java Automatic Term Extraction with Apache Solr
    JATE 2.0: Java Automatic Term Extraction with Apache Solr Ziqi Zhang, Jie Gao, Fabio Ciravegna Regent Court, 211 Portobello, Sheffield, UK, S1 4DP ziqi.zhang@sheffield.ac.uk, j.gao@sheffield.ac.uk, f.ciravegna@sheffield.ac.uk Abstract Automatic Term Extraction (ATE) or Recognition (ATR) is a fundamental processing step preceding many complex knowledge engineering tasks. However, few methods have been implemented as public tools and in particular, available as open-source freeware. Further, little effort is made to develop an adaptable and scalable framework that enables customization, development, and comparison of algorithms under a uniform environment. This paper introduces JATE 2.0, a complete remake of the free Java Automatic Term Extraction Toolkit (Zhang et al., 2008) delivering new features including: (1) highly modular, adaptable and scalable ATE thanks to integration with Apache Solr, the open source free-text indexing and search platform; (2) an extended collection of state-of-the-art algorithms. We carry out experiments on two well-known benchmarking datasets and compare the algorithms along the dimensions of effectiveness (precision) and efficiency (speed and memory consumption). To the best of our knowledge, this is by far the only free ATE library offering a flexible architecture and the most comprehensive collection of algorithms. Keywords: term extraction, term recognition, NLP, text mining, Solr, search, indexing 1. Introduction by completely re-designing and re-implementing JATE to Automatic Term Extraction (or Recognition) is an impor- fulfill three goals: adaptability, scalability, and extended tant Natural Language Processing (NLP) task that deals collections of algorithms. The new library, named JATE 3 4 with the extraction of terminologies from domain-specific 2.0 , is built on the Apache Solr free-text indexing and textual corpora.
    [Show full text]
  • Hortonworks Data Platform Apache Solr Search Installation (July 12, 2018)
    Hortonworks Data Platform Apache Solr Search Installation (July 12, 2018) docs.cloudera.com Hortonworks Data Platform July 12, 2018 Hortonworks Data Platform: Apache Solr Search Installation Copyright © 2012-2018 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License. http://creativecommons.org/licenses/by-sa/4.0/legalcode ii Hortonworks Data Platform July 12, 2018 Table of Contents 1.
    [Show full text]