Solr Properties Schema Index
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University
Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University Follow this and additional works at: http://opus.govst.edu/capstones Part of the Databases and Information Systems Commons Recommended Citation Ravikumar, Padmavathy, "Enterprise Search Technology Using Solr and Cloud" (2015). All Capstone Projects. 91. http://opus.govst.edu/capstones/91 For more information about the academic degree, extended learning, and certificate programs of Governors State University, go to http://www.govst.edu/Academics/Degree_Programs_and_Certifications/ Visit the Governors State Computer Science Department This Project Summary is brought to you for free and open access by the Student Capstone Projects at OPUS Open Portal to University Scholarship. It has been accepted for inclusion in All Capstone Projects by an authorized administrator of OPUS Open Portal to University Scholarship. For more information, please contact [email protected]. ENTERPRISE SEARCH TECHNOLOGY USING SOLR AND CLOUD By Padmavathy Ravikumar Masters Project Submitted in partial fulfillment of the requirements For the Degree of Master of Science, With a Major in Computer Science Governors State University University Park, IL 60484 Fall 2014 ENTERPRISE SEARCH TECHNOLOGY USING SOLR AND CLOUD 2 Abstract Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database in9tegration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. -
The Construction of Open Data Portal Using DKAN for Integrate to Multiple Japanese Local Government Open Data *Toshikazu Seto 1 , Yoshihide Sekimoto 2
Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings Volume 16 Bonn, Germany Article 17 2016 The onsC truction of Open Data Portal using DKAN for nI tegrate to Multiple Japanese Local Government Open Data Toshikazu Seto Center for Spatial Information Science, the University of Tokyo Yoshihide Sekimoto Institute of Industrial Science, the University of Tokyo Follow this and additional works at: https://scholarworks.umass.edu/foss4g Part of the Computer and Systems Architecture Commons, and the Geographic Information Sciences Commons Recommended Citation Seto, Toshikazu and Sekimoto, Yoshihide (2016) "The onC struction of Open Data Portal using DKAN for Integrate to Multiple Japanese Local Government Open Data," Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings: Vol. 16 , Article 17. DOI: https://doi.org/10.7275/R5W957B0 Available at: https://scholarworks.umass.edu/foss4g/vol16/iss1/17 This Poster is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Center for Spatial Information Science at The University of Tokyo The Construction of Open Data Portal using DKAN for Integrate to Multiple Japanese Local Government Open Data *Toshikazu Seto 1 , Yoshihide Sekimoto 2 *1: Center for Spatial Information Science, the University of Tokyo, 4-6-1, Komaba, Meguro-ku, Tokyo 153-8505, Japan, 153-8505 Email: [email protected] 2: Institute of Industrial Science, the University of Tokyo 3. -
Linkedpipes DCAT-AP Viewer: a Native DCAT-AP Data Catalog⋆
LinkedPipes DCAT-AP Viewer: A Native DCAT-AP Data Catalog? Jakub Klímek[0000−0001−7234−3051] and Petr Škoda[0000−0002−2732−9370] Charles University, Faculty of Mathematics and Physics Malostranské nám. 25, 118 00 Praha 1, Czech Republic [email protected] Abstract. In this demonstration we present LinkedPipes DCAT-AP Viewer (LP-DAV), a data catalog built to support DCAT-AP, the Eu- ropean standard for representation of metadata in data portals, and an application profile of the DCAT W3C Recommendation. We present its architecture and data loading process and on the example of the Czech National Open Data portal we show its main advantages compared to other data catalog solutions such as CKAN. These include the support for Named Authority Lists in EU Vocabularies (EU NALs), controlled vocabularies mandatory in DCAT-AP, and the support for bulk loading of DCAT-AP RDF dumps using LinkedPipes ETL. Keywords: catalog · DCAT · DCAT-AP · linked data 1 Introduction Currently, two worlds exist in the area of data catalogs on the web. In the first one there are a few well established data catalog implementations such as CKAN or DKAN, each with their data model and a JSON-based API for accessing and writing the metadata. In the second one, there is the Linked Data and RDF based DCAT W3C Recommendation [1] and its application profiles, such as the European DCAT-AP1, which are de facto standards for representation of metadata in data portals. The problem is that CKAN has been around for a while now, and is better developed, whereas DCAT is still quite new, with insufficient tooling support, nevertheless it is the standard. -
Final Report CS 5604: Information Storage and Retrieval
Final Report CS 5604: Information Storage and Retrieval Solr Team Abhinav Kumar, Anand Bangad, Jeff Robertson, Mohit Garg, Shreyas Ramesh, Siyu Mi, Xinyue Wang, Yu Wang January 16, 2018 Instructed by Professor Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA 24061 1 Abstract The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects [6]. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. This report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments. -
The Envidat Concept for an Institutional Environmental Data
I Iosifescu Enescu, I, et al. 2018. The EnviDat Concept for an CODATA '$7$6&,(1&( S Institutional Environmental Data Portal. Data Science Journal, U -2851$/ 17: 28, pp. 1–17. DOI: https://doi.org/10.5334/dsj-2018-028 RESEARCH PAPER The EnviDat Concept for an Institutional Environmental Data Portal Ionuț Iosifescu Enescu1, Gian-Kasper Plattner1, Lucia Espona Pernas1, Dominik Haas-Artho1, Sandro Bischof1, Michael Lehning2,3 and Konrad Steffen1,3,4 1 Swiss Federal Institute for Forest, Snow and Landscape WSL, CH 2 WSL Institute for Snow and Avalanche Research SLF, CH 3 School of Architecture, Civil and Environmental Engineering, EPFL, CH 4 ETH Zurich, CH Corresponding author: Ionuț Iosifescu Enescu ([email protected]) EnviDat is the environmental data portal developed by the Swiss Federal Institute for Forest, Snow and Landscape Research WSL. The strategic initiative EnviDat highlights the importance WSL lays on Research Data Management (RDM) at the institutional level and demonstrates the commitment to accessible research data in order to advance environmental science. EnviDat focuses on registering and publishing environmental data sets and provides unified and efficient access to the WSL’s comprehensive reservoir of environmental monitoring and research data. Research data management is organized in a decentralized manner where the responsibility to curate research data remains with the experts and the original data providers. EnviDat supports data producers and data users in registration, documentation, storage, publi- cation, search and retrieval of a wide range of heterogeneous data sets from the environmental domain. Innovative features include (i) a flexible, three-layer metadata schema, (ii) an additive data discovery model that considers spatial data and (iii) a DataCRediT mechanism designed for specifying data authorship. -
Hot Technologies” Within the O*NET® System
Identification of “Hot Technologies” within the O*NET® System Phil Lewis National Center for O*NET Development Jennifer Norton North Carolina State University Prepared for U.S. Department of Labor Employment and Training Administration Office of Workforce Investment Division of National Programs, Tools, & Technical Assistance Washington, DC April 4, 2016 www.onetcenter.org National Center for O*NET Development, Post Office Box 27625, Raleigh, NC 27611 Table of Contents Background ......................................................................................................................... 2 Hot Technologies Identification Procedure ...................................................................... 3 Mine data to collect the top technology related terms ................................................ 3 Convert the data-mined technology terms into O*NET technologies ......................... 3 Organize the hot technologies within the O*NET Tools & Technology Taxonomy ..... 4 Link the hot technologies to O*NET-SOC occupations .............................................. 4 Determine the display of occupations linked to a hot technology ............................... 4 Summary ............................................................................................................................. 5 Figure 1: O*NET Hot Technology Icon .............................................................................. 6 Appendix A: Hot Technologies Identified During the Initial Implementation ................ 7 National Center -
Apache Lucene - a Library Retrieving Data for Millions of Users
Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene Core Committer & PMC Chair [email protected] / [email protected] Friday, October 14, 2011 About me? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • BerlinBuzzwords Co-Founder • Addicted to OpenSource 2 Friday, October 14, 2011 Apache Lucene - a library retrieving data for .... Agenda ‣ Apache Lucene a historical introduction ‣ (Small) Features Overview ‣ The Lucene Eco-System ‣ Upcoming features in Lucene 4.0 ‣ Maintaining superior quality in Lucene (backup slides) ‣ Questions 3 Friday, October 14, 2011 Apache Lucene - a brief introduction • A fulltext search library entirely written in Java • An ASF Project since 2001 (happy birthday Lucene) • Founded by Doug Cutting • Grown up - being the de-facto standard in OpenSource search • Starting point for a other well known projects • Apache 2.0 License 4 Friday, October 14, 2011 Where are we now? • Current Version 3.4 (frequent minor releases every 2 - 4 month) • Strong Backwards compatibility guarantees within major releases • Solid Inverted-Index implementation • large committer base from various companies • well established community • Upcoming Major Release is Lucene 4.0 (more about this later) 5 Friday, October 14, 2011 (Small) Features Overview • Fulltext search • Boolean-, Range-, Prefix-, Wildcard-, RegExp-, Fuzzy-, Phase-, & SpanQueries • Faceting, Result Grouping, Sorting, Customizable Scoring • Large set of Language / Text-Processing -
Towards a Harmonized Dataset Model for Open Data Portals
HDL - Towards a Harmonized Dataset Model for Open Data Portals Ahmad Assaf1;2, Rapha¨elTroncy1 and Aline Senart2 1 EURECOM, Sophia Antipolis, France, <[email protected]> 2 SAP Labs France, <[email protected]> Abstract. The Open Data movement triggered an unprecedented amount of data published in a wide range of domains. Governments and corpo- rations around the world are encouraged to publish, share, use and in- tegrate Open Data. There are many areas where one can see the added value of Open Data, from transparency and self-empowerment to improv- ing efficiency, effectiveness and decision making. This growing amount of data requires rich metadata in order to reach its full potential. This meta- data enables dataset discovery, understanding, integration and mainte- nance. Data portals, which are considered to be datasets' access points, offer metadata represented in different and heterogenous models. In this paper, we first conduct a unique and comprehensive survey of seven meta- data models: CKAN, DKAT, Public Open Data, Socrata, VoID, DCAT and Schema.org. Next, we propose HDL, an harmonized dataset model based on this survey. We describe use cases that show the benefits of providing rich metadata to enable dataset discovery, search and spam detection. Keywords: Dataset Metadata, Dataset Profile, Dataset Model, Data Quality 1 Introduction Open data is the data that can be easily discovered, reused and redistributed by anyone. It can include anything from statistics, geographical data, meteo- rological data to digitized books from libraries. Open data should have both legal and technical dimensions. It should be placed in the public domain un- der liberal terms of use with minimal restrictions and should be available in electronic formats that are non-proprietary and machine readable. -
What Is Open Source?
Putting OPen SOurce tO WOrk in the enterPriSe: A guide tO riSkS And OPPOrtunitieS © Copyright 2007 SAP AG. All rights reserved. HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C®, World Wide Web Consortium, No part of this publication may be reproduced or transmitted in Massachusetts Institute of Technology. any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed Java is a registered trademark of Sun Microsystems, Inc. without prior notice. JavaScript is a registered trademark of Sun Microsystems, Inc., Some software products marketed by SAP AG and its distributors used under license for technology invented and implemented contain proprietary software components of other software by Netscape. vendors. MaxDB is a trademark of MySQL AB, Sweden. Microsoft, Windows, Excel, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation. SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, Duet, PartnerEdge, and other SAP products and services IBM, DB2, DB2 Universal Database, OS/2, Parallel Sysplex, mentioned herein as well as their respective logos are trademarks MVS/ESA, AIX, S/390, AS/400, OS/390, OS/400, iSeries, pSeries, or registered trademarks of SAP AG in Germany and in several xSeries, zSeries, System i, System i5, System p, System p5, System x, other countries all over the world. All other product and service System z, System z9, z/OS, AFP, Intelligent Miner, WebSphere, names mentioned are the trademarks of their respective compa- Netfinity, Tivoli, Informix, i5/OS, POWER, POWER5, POWER5+, nies. Data contained in this document serves informational OpenPower and PowerPC are trademarks or registered purposes only. -
JATE 2.0: Java Automatic Term Extraction with Apache Solr
JATE 2.0: Java Automatic Term Extraction with Apache Solr Ziqi Zhang, Jie Gao, Fabio Ciravegna Regent Court, 211 Portobello, Sheffield, UK, S1 4DP ziqi.zhang@sheffield.ac.uk, j.gao@sheffield.ac.uk, f.ciravegna@sheffield.ac.uk Abstract Automatic Term Extraction (ATE) or Recognition (ATR) is a fundamental processing step preceding many complex knowledge engineering tasks. However, few methods have been implemented as public tools and in particular, available as open-source freeware. Further, little effort is made to develop an adaptable and scalable framework that enables customization, development, and comparison of algorithms under a uniform environment. This paper introduces JATE 2.0, a complete remake of the free Java Automatic Term Extraction Toolkit (Zhang et al., 2008) delivering new features including: (1) highly modular, adaptable and scalable ATE thanks to integration with Apache Solr, the open source free-text indexing and search platform; (2) an extended collection of state-of-the-art algorithms. We carry out experiments on two well-known benchmarking datasets and compare the algorithms along the dimensions of effectiveness (precision) and efficiency (speed and memory consumption). To the best of our knowledge, this is by far the only free ATE library offering a flexible architecture and the most comprehensive collection of algorithms. Keywords: term extraction, term recognition, NLP, text mining, Solr, search, indexing 1. Introduction by completely re-designing and re-implementing JATE to Automatic Term Extraction (or Recognition) is an impor- fulfill three goals: adaptability, scalability, and extended tant Natural Language Processing (NLP) task that deals collections of algorithms. The new library, named JATE 3 4 with the extraction of terminologies from domain-specific 2.0 , is built on the Apache Solr free-text indexing and textual corpora. -
Recommendations for Open Data Portals: from Setup to Sustainability
This study has been prepared by Capgemini Invent as part of the European Data Portal. The European Data Portal is an initiative of the European Commission, implemented with the support of a consortiumi led by Capgemini Invent, including Intrasoft International, Fraunhofer Fokus, con.terra, Sogeti, 52North, Time.Lex, the Lisbon Council, and the University of Southampton. The Publications Office of the European Union is responsible for contract management of the European Data Portal. For more information about this paper, please contact: European Commission Directorate General for Communications Networks, Content and Technology Unit G.1 Data Policy and Innovation Daniele Rizzi – Policy Officer Email: [email protected] European Data Portal Gianfranco Cecconi, European Data Portal Lead Email: [email protected] Written by: Jorn Berends Wendy Carrara Wander Engbers Heleen Vollers Last update: 15.07.2020 www: https://europeandataportal.eu/ @: [email protected] DISCLAIMER By the European Commission, Directorate-General of Communications Networks, Content and Technology. The information and views set out in this publication are those of the author(s) and do not necessarily reflect the official opinion of the Commission. The Commission does not guarantee the accuracy of the data included in this study. Neither the Commission nor any person acting on the Commission’s behalf may be held responsible for the use, which may be made of the information contained therein. Luxembourg: Publications Office of the European Union, 2020 © European Union, 2020 OA-03-20-042-EN-N ISBN: 978-92-78-41872-4 doi: 10.2830/876679 The reuse policy of European Commission documents is implemented by the Commission Decision 2011/833/EU of 12 December 2011 on the reuse of Commission documents (OJ L 330, 14.12.2011, p. -
RDM Technical Infrastructure Components and Evaluations
RDM Technical Infrastructure Components and Evaluations John A. Lewis 13/11/2014 Contents RDM Technical Infrastructure Components ……………………………………………………………1 1. Integrated systems and integrating components …………………………………………………………1 2. Repository platforms …………………………………………………………………………………………………...2 3. Digital preservation (repository) systems and services ……………………………………………….4 4. ‘Archive Data’ storage ………………………………………………………………………………………………….6 5. ‘Active data’ management and collaboration platforms ……………………………………………..7 6. Catalogue software / Access platforms ………………………………………………………………………..9 7. Current Research Information Systems (CRIS)…………………………………………………………….10 8. Data management planning (DMP) tools ……………………………………………………………………11 9. Metadata Generators ………………………………………………………………………………………………….11 10. Data capture and workflow management systems …………………………………………………….11 11. Data transfer protocols ……………………………………………………………………………………………….14 12. Identifier services and identity components ………………………………………………………………14 13. Other software systems and platforms of interest …………………………………………………….16 Reviews, Evaluations and Comparisons of Infrastructure Components .………17 References …………………………………………………………………………………………………………………….22 RDM Technical Infrastructure Components Components of the RDM Infrastructures established by higher education institutions are briefly considered below. The component function, the software / platform underlying the component and component interoperability are described, any evaluations identified, and institutions employing the component,