Islandora Sync Bringing Drupal and Fedora Closer

Total Page:16

File Type:pdf, Size:1020Kb

Islandora Sync Bringing Drupal and Fedora Closer Islandora Sync Bringing Drupal and Fedora closer 2014 - Islandora Camp : Berkeley, California Kristian Allen Programmer - UCLA Digital Library Discussion Outline ● UCLA Digital Library transition to Islandora Overview of shift to Islandora within UCLA DL ● Digital Library community development options ● Islandora Sync details Why the need? What does it do? Live Demo : Create slideshow with stock Drupal modules ● Questions Follow along at home ● Instructions to enable on your Camp VirtualBox Terminal commands to install: https://gist.github.com/kjallen/8451109 ● After executing above, NEW content added to Basic Image Collection will be syncable UCLA Digital Library Snapshot ● 250 Digital Collections ● 1.8MM+ Digital Asset Files ● 20MM+ Metadata database records ● 35TB+ Disk storage Growing Pains ● Current system Locally Developed: Java EE / Struts 1.02 Oracle Database, Search and Persistence frameworks JBoss on Windows ● Challenges Domain specific knowledge needed for development Proprietary software limits ability to leverage the community Bolting on partial solutions can become problematic The Community to the Rescue ● Blacklight / Hydra Ruby / Rails ● Islandora PHP / Drupal ● Common Platforms Apache Solr for search and discovery Fedora Commons for repository and storage The Appeal of Drupal and Islandora ● Instant access to “lego block” code Thousands of modules help reduce development time Lower barrier to entry to encourage participation ● UCLA Library move to Drupal CMS shift to Drupal within UCLA Library increases local expertise ● Trend towards Drupal on campus Shift from Plone to Drupal in other departments Islandora Development ● The Islandora Way != The Drupal Way Assets and metadata stored only in Fedora repository Limits which Drupal modules can be used out of the box ● Leveraging the Drupal community Harnessing the collective power of Drupal ecosystem is limited ● Option: Make Islandora more Drupal-ish? Yep, we sure can! Islandora Sync ● Joint project with Discovery Garden Mitch Mackenzie lead DGI developer ● Goal to mask Fedora data as Drupal data ● Available for community use on GitHub Core Features ● Ability to map Drupal Content Types to Fedora Content Models ● Map Drupal Fields to Fedora Datastreams or metadata elements ● Replicate Fedora RELS-EXT references with Drupal Relation entities Islandora Sync Workflow Islandora Modules Basic Image Islandora Data Large Image Entry Forms Book 1) User updates an PDF Islandora object, saves Audio data to Fedora Video 2) With Islandora Sync enabled, hooks are Fedora Repo Drupal fired to populate native Modules Drupal entities with Fedora data Views Services Native Drupal Gallery Entities Gmap First iteration, be aware... ● Works for UCLA DL workflow The UCLA Digital Library is the gatekeeper for our Fedora instance Changes via processes outside of Islandora might not be sync’d ● One way sync Initial use cases placed most weight on Fedora to Drupal direction Hooks in place for Drupal push to Fedora ● Scalability Still need to tweak Drupal for performance for large imports Islandora Data Access Options ● Access via Solr index Example: http://cla.library.ucla.edu ● Access via Fedora REST API Example: http://digital.library.ucla.edu/aqueduct (and now Islandora REST API too! Thanks Nigel) ● Access via Islandora Sync Yep, we sure can! Demo - Slideshow ● Walk through sync setup and basics ● Demonstrate syncing process ● Using Views slideshow and bxslider module Where we hope to go ● Foster increased community development By lowering barrier, we hope to encourage increased development ● Continued improvement of core code Goal to contribute regular development to support roll into Islandora core ● Increased Drupal Features packages MODS image mapping example is available now Yep, we sure can! References UCLA Digital Library Islandora http://digital2.library.ucla.edu http://www.islandora.ca UCLA Library GitHub Islandora VM (7x.1-2 Ubuntu 32 bit) https://github.com/UCLALibrary http://islandora.ca/downloads Islandora Sync GitHub repo Discovery Garden https://github.com/UCLALibrary/islandora_sync http://www.discoverygarden.ca Kristian Allen Gist install document: [email protected] https://gist.github.com/kjallen/8451109.
Recommended publications
  • Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University
    Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Enterprise Search Technology Using Solr and Cloud Padmavathy Ravikumar Governors State University Follow this and additional works at: http://opus.govst.edu/capstones Part of the Databases and Information Systems Commons Recommended Citation Ravikumar, Padmavathy, "Enterprise Search Technology Using Solr and Cloud" (2015). All Capstone Projects. 91. http://opus.govst.edu/capstones/91 For more information about the academic degree, extended learning, and certificate programs of Governors State University, go to http://www.govst.edu/Academics/Degree_Programs_and_Certifications/ Visit the Governors State Computer Science Department This Project Summary is brought to you for free and open access by the Student Capstone Projects at OPUS Open Portal to University Scholarship. It has been accepted for inclusion in All Capstone Projects by an authorized administrator of OPUS Open Portal to University Scholarship. For more information, please contact [email protected]. ENTERPRISE SEARCH TECHNOLOGY USING SOLR AND CLOUD By Padmavathy Ravikumar Masters Project Submitted in partial fulfillment of the requirements For the Degree of Master of Science, With a Major in Computer Science Governors State University University Park, IL 60484 Fall 2014 ENTERPRISE SEARCH TECHNOLOGY USING SOLR AND CLOUD 2 Abstract Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database in9tegration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
    [Show full text]
  • Final Report CS 5604: Information Storage and Retrieval
    Final Report CS 5604: Information Storage and Retrieval Solr Team Abhinav Kumar, Anand Bangad, Jeff Robertson, Mohit Garg, Shreyas Ramesh, Siyu Mi, Xinyue Wang, Yu Wang January 16, 2018 Instructed by Professor Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA 24061 1 Abstract The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects [6]. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. This report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments.
    [Show full text]
  • Hot Technologies” Within the O*NET® System
    Identification of “Hot Technologies” within the O*NET® System Phil Lewis National Center for O*NET Development Jennifer Norton North Carolina State University Prepared for U.S. Department of Labor Employment and Training Administration Office of Workforce Investment Division of National Programs, Tools, & Technical Assistance Washington, DC April 4, 2016 www.onetcenter.org National Center for O*NET Development, Post Office Box 27625, Raleigh, NC 27611 Table of Contents Background ......................................................................................................................... 2 Hot Technologies Identification Procedure ...................................................................... 3 Mine data to collect the top technology related terms ................................................ 3 Convert the data-mined technology terms into O*NET technologies ......................... 3 Organize the hot technologies within the O*NET Tools & Technology Taxonomy ..... 4 Link the hot technologies to O*NET-SOC occupations .............................................. 4 Determine the display of occupations linked to a hot technology ............................... 4 Summary ............................................................................................................................. 5 Figure 1: O*NET Hot Technology Icon .............................................................................. 6 Appendix A: Hot Technologies Identified During the Initial Implementation ................ 7 National Center
    [Show full text]
  • Apache Lucene - a Library Retrieving Data for Millions of Users
    Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene Core Committer & PMC Chair [email protected] / [email protected] Friday, October 14, 2011 About me? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • BerlinBuzzwords Co-Founder • Addicted to OpenSource 2 Friday, October 14, 2011 Apache Lucene - a library retrieving data for .... Agenda ‣ Apache Lucene a historical introduction ‣ (Small) Features Overview ‣ The Lucene Eco-System ‣ Upcoming features in Lucene 4.0 ‣ Maintaining superior quality in Lucene (backup slides) ‣ Questions 3 Friday, October 14, 2011 Apache Lucene - a brief introduction • A fulltext search library entirely written in Java • An ASF Project since 2001 (happy birthday Lucene) • Founded by Doug Cutting • Grown up - being the de-facto standard in OpenSource search • Starting point for a other well known projects • Apache 2.0 License 4 Friday, October 14, 2011 Where are we now? • Current Version 3.4 (frequent minor releases every 2 - 4 month) • Strong Backwards compatibility guarantees within major releases • Solid Inverted-Index implementation • large committer base from various companies • well established community • Upcoming Major Release is Lucene 4.0 (more about this later) 5 Friday, October 14, 2011 (Small) Features Overview • Fulltext search • Boolean-, Range-, Prefix-, Wildcard-, RegExp-, Fuzzy-, Phase-, & SpanQueries • Faceting, Result Grouping, Sorting, Customizable Scoring • Large set of Language / Text-Processing
    [Show full text]
  • JATE 2.0: Java Automatic Term Extraction with Apache Solr
    JATE 2.0: Java Automatic Term Extraction with Apache Solr Ziqi Zhang, Jie Gao, Fabio Ciravegna Regent Court, 211 Portobello, Sheffield, UK, S1 4DP ziqi.zhang@sheffield.ac.uk, j.gao@sheffield.ac.uk, f.ciravegna@sheffield.ac.uk Abstract Automatic Term Extraction (ATE) or Recognition (ATR) is a fundamental processing step preceding many complex knowledge engineering tasks. However, few methods have been implemented as public tools and in particular, available as open-source freeware. Further, little effort is made to develop an adaptable and scalable framework that enables customization, development, and comparison of algorithms under a uniform environment. This paper introduces JATE 2.0, a complete remake of the free Java Automatic Term Extraction Toolkit (Zhang et al., 2008) delivering new features including: (1) highly modular, adaptable and scalable ATE thanks to integration with Apache Solr, the open source free-text indexing and search platform; (2) an extended collection of state-of-the-art algorithms. We carry out experiments on two well-known benchmarking datasets and compare the algorithms along the dimensions of effectiveness (precision) and efficiency (speed and memory consumption). To the best of our knowledge, this is by far the only free ATE library offering a flexible architecture and the most comprehensive collection of algorithms. Keywords: term extraction, term recognition, NLP, text mining, Solr, search, indexing 1. Introduction by completely re-designing and re-implementing JATE to Automatic Term Extraction (or Recognition) is an impor- fulfill three goals: adaptability, scalability, and extended tant Natural Language Processing (NLP) task that deals collections of algorithms. The new library, named JATE 3 4 with the extraction of terminologies from domain-specific 2.0 , is built on the Apache Solr free-text indexing and textual corpora.
    [Show full text]
  • Apache Solr Reference Guide Covering Apache Solr
    TM Apache Solr Reference Guide Covering Apache Solr 6.4 Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Lucene, Apache Solr and their respective logos are trademarks of the Apache Software Foundation. Please see the Apache Trademark Policy for more information. Fonts used in the Apache Solr Reference Guide include Raleway, licensed under the SIL Open Font License, 1.1. Apache Solr Reference Guide This reference guide describes Apache Solr, the open source solution for search. You can download Apache Solr from the Solr website at http://lucene.apache.org/solr/. This Guide contains the following sections: Getting Started: This section guides you through the installation and setup of Solr. Using the Solr Administration User Interface: This section introduces the Solr Web-based user interface. From your browser you can view configuration files, submit queries, view logfile settings and Java environment settings, and monitor and control distributed configurations.
    [Show full text]
  • Diseño E Implementación De Un Servicio De Análisis De Variables Demográficas En Redes Sociales”
    Grado en Ingeniería Informática (Plan 2011) Curso académico (2016-2017) Trabajo Fin de Grado “DISEÑO E IMPLEMENTACIÓN DE UN SERVICIO DE ANÁLISIS DE VARIABLES DEMOGRÁFICAS EN REDES SOCIALES” Andrea Sánchez Sáez Tutor: José María Álvarez Leganés, 3 de julio de 2017 [Incluir en el caso del interés de su publicación en el archivo abierto] Esta obra se encuentra sujeta a la licencia Creative Commons Reconocimiento – No Comercial – Sin Obra Derivada 1 2 Agradecimientos Gracias a mis padres y a mi hermana por apoyarme para superar los obstáculos sin desanimarme, siempre y en todo momento, por fomentarme una actitud de acabar todo lo que se empieza y querer siempre sacar lo mejor de mí. No se puede pedir una familia mejor. Gracias a mis amigos. A los que me llevo de esta etapa universitaria, sin vosotros no me llevaría tan buenos recuerdos de esta etapa, los viajes, las comidas y las risas y también por las prácticas, los exámenes y los momentos de estrés. A los de siempre. Por estar en los buenos momentos y en los no tan buenos, por ayudarme a compaginar todas las facetas de mi vida y por seguir ahí. A todos. Gracias por ser y estar. Y, por último, agradecer a mi tutor por haber sido guía y haberme animado a terminar esta etapa. 3 4 Índice ....................................................................................................................................................... 1 Agradecimientos ........................................................................................................................... 3 Índice de ilustraciones
    [Show full text]
  • From Spark Ata from Solr Ectors & Spark SQL T Matching
    Solr & Spark • Spark Overview / High-level Architecture • Indexing from Spark • Reading data from Solr + term vectors & Spark SQL • Document Matching • Q&A About Me … • Solr user since 2010, committer since April 2014, work for Lucidworks • Focus mainly on SolrCloud features … and bin/solr! • Release manager for Lucene / Solr 5.1 • Co-author of Solr in Action • Several years experience working with Hadoop, Pig, Hive, ZooKeeper, but only started using Spark about 6 months ago … • Other contributions include Solr on YARN, Solr Scale Toolkit, and Spark-Solr integration project on github Spark Overview • Wealth of overview / getting started resources on the Web Ø Start here -> https://spark.apache.org/ Ø Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf • Faster, more modernized alternative to MapReduce Ø Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less computing power) • Unified platform for Big Data Ø Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining • Write code in Java, Scala, or Python … REPL interface too • Runs on YARN (or Mesos), plays well with HDFS Spark Components Can combine all of these together in the same app! MLlib Spark Spark GraphX UI / API (machine SQL (BSP) Streaming learning) Spark Core Execution The Shuffle Caching Model engine HDFS cluster mgmt Hadoop YARN Mesos Standalone When selecting which node to execute a task on, Physical Architecture the master takes into account data locality Spark
    [Show full text]
  • Apache UIMA Solrcas Documentation Written and Maintained by the Apache UIMA Development Community
    Apache UIMA Solrcas documentation Written and maintained by the Apache UIMA Development Community Version 2.3.1 Copyright © 2006, 2011 The Apache Software Foundation License and Disclaimer. The ASF licenses this documentation to you under the Apache License, Version 2.0 (the "License"); you may not use this documentation except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, this documentation and its contents are distributed under the License on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Trademarks. All terms mentioned in the text that are known to be trademarks or service marks have been appropriately capitalized. Use of such terms in this book should not be regarded as affecting the validity of the the trademark or service mark. Publication date August, 2011 Table of Contents Introduction ...................................................................................................................... v 1. Configuration ................................................................................................................ 1 2. The mapping file ........................................................................................................... 3 Apache UIMA Solrcas documentation iii Introduction The Solr CAS Consumer (Solrcas) is responsible to write UIMA CAS objects to an Apache Solr instance. It uses SolrJ client classes to execute local or remote updates to the specified Solr instance. Introduction v Chapter 1. Configuration To use Solrcas the following parameters have to be specified: • mappingFile : identifies where is the file which holds information about which (and how) UIMA objects must be sent to which Solr fields. • solrInstanceType : this has to be http.
    [Show full text]
  • Open-Source Search Engines and Lucene/Solr
    Open-Source Search Engines and Lucene/Solr UCSB 293S, 2017. Tao Yang Slides are based on Y. Seeley, S. Das, C. Hostetter 1 Open Source Search Engines • Why? § Low cost: No licensing fees § Source code available for customization § Good for modest or even large data sizes • Challenges: § Performance, Scalability § Maintenance 2 Open Source Search Engines: Examples • Lucene § A full-text search library with core indexing and search services § Competitive in engine performance, relevancy, and code maintenance • Solr § based on the Lucene Java search library with XML/HTTP APIs § caching, replication, and a web administration interface. • Lemur/Indri § C++ search engine from U. Mass/CMU 3 A Comparison of Open Source Search Engines • Middleton/Baeza-Yates 2010 (Modern Information Retrieval. Text book) A Comparison of Open Source Search Engines for 1.69M Pages • Middleton/Baeza-Yates 2010 (Modern Information Retrieval) A Comparison of Open Source Search Engines • July 2009, Vik’s blog (http://zooie.Wordpress.com/2009/07/06/a- comparison-of-open-source-search-engines-and-indexing-twitter/) A Comparison of Open Source Search Engines • Vik’s blog(http://zooie.Wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/) Lucene • Developed by Doug Cutting initially – Java-based. Created in 1999, Donated to Apache in 2001 • Features § No crawler, No document parsing, No “PageRank” • PoWered by Lucene – IBM Omnifind Y! Edition, Technorati – Wikipedia, Internet Archive, LinkedIn, monster.com • Add documents to an index via
    [Show full text]
  • A Framework for Bridging the Gap Between Open Source Search Tools
    AFrameworkforBridgingtheGapBetweenOpenSource Search Tools Madian Khabsa1 ,StephenCarman2,SagnikRayChoudhury2 and C. Lee Giles1,2 1Computer Science and Engineering 2Information Sciences and Technology The Pennsylvania State University University Park, PA [email protected], [email protected], [email protected], [email protected] ABSTRACT searchable. These documents may be found on the local in- Building a search engine that can scale to billions of docu- tranet, a local machine, or on the Web. In addition, these ments while satisfying the needs of the users presents serious documents range in format and type from textual files to challenges. Few successful stories have been reported so far multimedia files that incorporate video and audio. Expe- [36]. Here, we report our experience in building YouSeer, a diently ranking the millions of results found for a query in complete open source search engine tool that includes both awaythatsatisfiestheend-userneedisstillanunresolved an open source crawler and an open source indexer. Our problem. Patterson [36] provides a detailed discussion of approach takes other open source components that have these hurdles. been proven to scale and combines them to create a compre- hensive search engine. YouSeer employs Heritrix as a web As such, researchers and developers have spent much time crawler, and Apache Lucene/Solr for indexing. We describe and effort designing separate pieces of the search engine sys- the design and architecture, as well as additional compo- tem. This lead to the introduction of many popular search nents that need to be implemented to build such a search engine tools including crawlers, ingestion systems, and in- engine. The results of experimenting with our framework in dexers.
    [Show full text]
  • Lumada Data Catalog Product Manager Lumada Data Catalog V 6
    HITACHI Inspire the Next 2535 Augustine Drive Santa Clara, CA 95054 USA Contact Information : Lumada Data Catalog Product Manager Lumada Data Catalog v 6 . 0 . 0 ( D r a f t ) Hitachi Vantara LLC 2535 Augustine Dr. Santa Clara CA 95054 Component Version License Modified "Java Concurrency in Practice" book 1 Creative Commons Attribution 2.5 annotations BSD 3-clause "New" or "Revised" abego TreeLayout Core 1.0.1 License ActiveMQ Artemis JMS Client All 2.9.0 Apache License 2.0 Aliyun Message and Notification 1.1.8.8 Apache License 2.0 Service SDK for Java An open source Java toolkit for 0.9.0 Apache License 2.0 Amazon S3 Annotations for Metrics 3.1.0 Apache License 2.0 ANTLR 2.7.2 ANTLR Software Rights Notice ANTLR 2.7.7 ANTLR Software Rights Notice BSD 3-clause "New" or "Revised" ANTLR 4.5.3 License BSD 3-clause "New" or "Revised" ANTLR 4.7.1 License ANTLR 4.7.1 MIT License BSD 3-clause "New" or "Revised" ANTLR 4 Tool 4.5.3 License AOP Alliance (Java/J2EE AOP 1 Public Domain standard) Aopalliance Version 1.0 Repackaged 2.5.0 Eclipse Public License 2.0 As A Module Aopalliance Version 1.0 Repackaged Common Development and 2.5.0-b05 As A Module Distribution License 1.1 Aopalliance Version 1.0 Repackaged 2.6.0 Eclipse Public License 2.0 As A Module Apache Atlas Common 1.1.0 Apache License 2.0 Apache Atlas Integration 1.1.0 Apache License 2.0 Apache Atlas Typesystem 0.8.4 Apache License 2.0 Apache Avro 1.7.4 Apache License 2.0 Apache Avro 1.7.6 Apache License 2.0 Apache Avro 1.7.6-cdh5.3.3 Apache License 2.0 Apache Avro 1.7.7 Apache License
    [Show full text]