Course Title

Total Page:16

File Type:pdf, Size:1020Kb

Course Title "Charting the Course ... ... to Your Success!" Real World Hadoop in the Enterprise Course Summary Description Apache Hadoop is an OpenSource(™) framework for creating reliable and distributable compute clusters. Credited with the IBM Watson Jeopardy win in 2011, Hadoop can be used (with other related frameworks) to process large unstructured or semi-structured data sets from multiple sources to dissect, classify, learn from and make suggestions for business analytics, decision support, and other advanced forms of machine intelligence. This class is targeted towards the Java Developer and assumes working knowledge of Java programming in Eclipse and comfort in a Unix shell environment. We will go well beyond the "Hello World" word-count example into practical, applied uses of Hadoop in large-scale real-world scenarios, including fraud detection, algorithmic trading, and data mining. Students will develop in an environment architected for a dynamically changing business-rule driven infrastructure with multiple disparate data sources and large-scale datasets on a real Hadoop/Drools cluster. Topics Overview Applying Business Rules with Drools Hadoop Architecture Pig and Pig Pipelines Retrieving and Localizing Data Working with the Hive Feeding Hadoop in the Enterprise Testing, Performance and Troubleshooting Machine Learning with Mahout Other Optional Overview Topics Audience This class is designed for Java Developers. Prerequisites This class assumes working knowledge of Java programming in Eclipse and comfort in a Unix shell environment. Introduction to Java (IJSEP) - Experience developing Java with Eclipse Introduction to Unix (UNIXI) - Exposure to bash or tcsh shell use Data Persistence with JPA 2 - Experience using JPA and data access Duration Five days Due to the nature of this material, this document refers to numerous hardware and software products by their trade names. References to other companies and their products are for informational purposes only, and all trademarks are the properties of their respective companies. It is not the intent of ProTech Professional Technical Services, Inc. to use any of these names generically PT0756_REALWORLDHADOOPINTHEENTERPRISE.DOC "Charting the Course ... ... to Your Success!" Real World Hadoop in the Enterprise Course Outline I. Overview E. Bayesian Classifiers A. Map/Reduce F. Analytics B. Hadoop G. Random Forests C. NoSQL H. Decision Support with Mahout and Hadoop D. Mahout E. Alternate Frameworks VI. Applying Business Rules with Drools A. Drools Overview II. Hadoop Architecture B. Integrating Rules-based approach with A. Hadoop Map/Reduce Hadoop B. HDFS C. Decision Making with Drools and Hadoop C. Cassandra D. Integrating Drools, Mahout, and Hadoop D. HBase E. Hive VII. Pig and Pig Pipelines F. Pig A. Pig Latin B. Pig Pipelines III. Retrieving and Localizing Data C. Pig UDFs (User Defined Functions) A. Using JPA in Map/Reduce: Pros and Cons B. HDFS VIII. Working with the Hive C. NoSQL A. Hive and HDFS D. HBase B. Meta-data and indexing E. Cassandra C. Hive UDFs (User Defined Functions) F. Neo4J D. Hive and Apache S3 G. Sqoop E. HQL H. Flume I. Caching with JBoss Infinispan IX. Testing, Performance and Troubleshooting J. Caching with OpenTerracotta A. TDD with MRUnit K. Using Spring Data B. TDD with other Unit Testing Frameworks C. Bottleneck discovery IV. Feeding Hadoop in the Enterprise D. Monitoring A. Apache UIMA E. Join Framework Optimization B. Spring Integration F. Troubleshooting C. Apache Camel G. Hadoop and Virtualization D. Spring Batch H. Hadoop in the Cloud I. Hadoop and Amazon EC2 V. Machine Learning with Mahout A. Artificial Intelligence Overview X. Other Optional Overview Topics B. Fuzzy Logic A. Storm Project C. K-Means B. Apache Kafka D. Pattern Mining C. Cassandra Bolt Due to the nature of this material, this document refers to numerous hardware and software products by their trade names. References to other companies and their products are for informational purposes only, and all trademarks are the properties of their respective companies. It is not the intent of ProTech Professional Technical Services, Inc. to use any of these names generically PT0756_REALWORLDHADOOPINTHEENTERPRISE.DOC .
Recommended publications
  • Combined Documents V2
    Outline: Combining Brainstorming Deliverables Table of Contents 1. Introduction and Definition 2. Reference Architecture and Taxonomy 3. Requirements, Gap Analysis, and Suggested Best Practices 4. Future Directions and Roadmap 5. Security and Privacy - 10 Top Challenges 6. Conclusions and General Advice Appendix A. Terminology Glossary Appendix B. Solutions Glossary Appendix C. Use Case Examples Appendix D. Actors and Roles 1. Introduction and Definition The purpose of this outline is to illustrate how some initial brainstorming documents might be pulled together into an integrated deliverable. The outline will follow the diagram below. Section 1 introduces a definition of Big Data. An extended terminology Glossary is found in Appendix A. In section 2, a Reference Architecture diagram is presented followed by a taxonomy describing and extending the elements of the Reference Architecture. Section 3 maps requirements from use case building blocks to the Reference Architecture. A description of the requirement, a gap analysis, and suggested best practice is included with each mapping. In Section 4 future improvements in Big Data technology are mapped to the Reference Architecture. An initial Technology Roadmap is created on the requirements and gap analysis in Section 3 and the expected future improvements from Section 4. Section 5 is a placeholder for an extended discussion of Security and Privacy. Section 6 gives an example of some general advice. The Appendices provide Big Data terminology and solutions glossaries, Use Case Examples, and some possible Actors and Roles. Big Data Definition - “Big Data refers to the new technologies and applications introduced to handle increasing Volumes of data while enhancing data utilization capabilities such as Variety, Velocity, Variability, Veracity, and Value.” The key attribute is the large Volume of data available that forces horizontal scalability of storage and processing and has implications for all the other V-attributes.
    [Show full text]
  • Enterprise Integration Patterns N About Apache Camel N Essential Patterns Enterprise Integration Patterns N Conclusions and More
    Brought to you by... #47 CONTENTS INCLUDE: n About Enterprise Integration Patterns n About Apache Camel n Essential Patterns Enterprise Integration Patterns n Conclusions and more... with Apache Camel Visit refcardz.com By Claus Ibsen ABOUT ENTERPRISE INTEGRATION PaTTERNS Problem A single event often triggers a sequence of processing steps Solution Use Pipes and Filters to divide a larger processing steps (filters) that are connected by channels (pipes) Integration is a hard problem. To help deal with the complexity Camel Camel supports Pipes and Filters using the pipeline node. of integration problems the Enterprise Integration Patterns Java DSL from(“jms:queue:order:in”).pipeline(“direct:transformOrd (EIP) have become the standard way to describe, document er”, “direct:validateOrder”, “jms:queue:order:process”); and implement complex integration problems. Hohpe & Where jms represents the JMS component used for consuming JMS messages Woolf’s book the Enterprise Integration Patterns has become on the JMS broker. Direct is used for combining endpoints in a synchronous fashion, allow you to divide routes into sub routes and/or reuse common routes. the bible in the integration space – essential reading for any Tip: Pipeline is the default mode of operation when you specify multiple integration professional. outputs, so it can be omitted and replaced with the more common node: from(“jms:queue:order:in”).to(“direct:transformOrder”, “direct:validateOrder”, “jms:queue:order:process”); Apache Camel is an open source project for implementing TIP: You can also separate each step as individual to nodes: the EIP easily in a few lines of Java code or Spring XML from(“jms:queue:order:in”) configuration.
    [Show full text]
  • Apache Sentry
    Apache Sentry Prasad Mujumdar [email protected] [email protected] Agenda ● Various aspects of data security ● Apache Sentry for authorization ● Key concepts of Apache Sentry ● Sentry features ● Sentry architecture ● Integration with Hadoop ecosystem ● Sentry administration ● Future plans ● Demo ● Questions Who am I • Software engineer at Cloudera • Committer and PPMC member of Apache Sentry • also for Apache Hive and Apache Flume • Part of the the original team that started Sentry work Aspects of security Perimeter Access Visibility Data Authentication Authorization Audit, Lineage Encryption, what user can do data origin, usage Kerberos, LDAP/AD Masking with data Data access Access ● Provide user access to data Authorization ● Manage access policies what user can do ● Provide role based access with data Agenda ● Various aspects of data security ● Apache Sentry for authorization ● Key concepts of Apache Sentry ● Sentry features ● Sentry architecture ● Integration with Hadoop ecosystem ● Sentry administration ● Future plans ● Demo ● Questions Apache Sentry (Incubating) Unified Authorization module for Hadoop Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Enforce a common set of policies across multiple data access path in Hadoop. Key Capabilities of Sentry Fine-Grained Authorization Permissions on object hierarchie. Eg, Database, Table, Columns Role-Based Authorization Support for role templetes to manage authorization for a large set of users and data objects Multi Tanent Administration
    [Show full text]
  • Apache Camel
    Apache Camel USER GUIDE Version 2.0.0 Copyright 2007-2009, Apache Software Foundation 1 Table of Contents Table of Contents......................................................................... ii Chapter 1 Introduction ...................................................................................1 Chapter 2 Quickstart.......................................................................................1 Chapter 3 Getting Started..............................................................................7 Chapter 4 Architecture................................................................................ 17 Chapter 5 Enterprise Integration Patterns.............................................. 27 Chapter 6 Cook Book ................................................................................... 32 Chapter 7 Tutorials....................................................................................... 85 Chapter 8 Language Appendix.................................................................. 190 Chapter 9 Pattern Appendix..................................................................... 231 Chapter 10 Component Appendix ............................................................. 299 Index ................................................................................................0 ii APACHE CAMEL CHAPTER 1 °°°° Introduction Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration. Camel lets you create the Enterprise Integration
    [Show full text]
  • An Enterprise Knowledge Network
    Fogbeam Labs Cut Through The Information Fog http://www.fogbeam.com An Enterprise Knowledge Network Knowledge exists in many forms inside your organization – ranging from tacit knowledge which exists only in the minds of the users who possess it, to codified knowledge stored in databases and document repositories. Unfortunately while knowledge exists throughout the organization, it is often not easy (if even possible) to locate, use, share, and reuse existing knowledge. This results in a situation often described as “the left hand doesn't know what the right hand is doing” and damages morale as employees spend their days frustrated and complaining that “nobody knows what is going on around here”. The obstacles that hinder access to existing knowledge can be cultural, geographical, social, and/or technological. And while no technological solution can guarantee perfect knowledge-sharing, tools drawn from big data, data mining / machine learning, deep learning, and artificial intelligence techniques can improve an organization's power to generate, capture, use, share and reuse knowledge. Using technologies developed as part of the semantic web initiative, and applying the principles of linked data within the enterprise, the Fogbeam Labs Enterprise Knowledge Network approach can help your firm integrate and aggregate knowledge which is spread across your existing enterprise applications, content repositories and Intranet. An Enterprise Knowledge Network enables your firm's capabilities to: • engage in high levels of knowledge transfer and
    [Show full text]
  • Talend Open Studio for Big Data Release Notes
    Talend Open Studio for Big Data Release Notes 6.0.0 Talend Open Studio for Big Data Adapted for v6.0.0. Supersedes previous releases. Publication date July 2, 2015 Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/ Notices Talend is a trademark of Talend, Inc. All brands, product names, company names, trademarks and service marks are the properties of their respective owners. License Agreement The software described in this documentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed at AOP Alliance (Java/J2EE AOP standards), ASM, Amazon, AntlR, Apache ActiveMQ, Apache Ant, Apache Avro, Apache Axiom, Apache Axis, Apache Axis 2, Apache Batik, Apache CXF, Apache Cassandra, Apache Chemistry, Apache Common Http Client, Apache Common Http Core, Apache Commons, Apache Commons Bcel, Apache Commons JxPath, Apache
    [Show full text]
  • Realization of Big Data Ana- Lytics Tool for Optimization Processes Within the Finnish Engineering Company
    OPINNÄYTETYÖ - AMMATTIKORKEAKOULUTUTKINTO TEKNIIKAN JA LIIKENTEEN ALA REALIZATION OF BIG DATA ANA- LYTICS TOOL FOR OPTIMIZATION PROCESSES WITHIN THE FINNISH ENGINEERING COMPANY A u t h o r / s : Karapetyan Karina SAVONIA UNIVERSITY OF APPLIED SCIENCES THESIS Abstract Field of Study Technology, Communication and Transport Degree Programme Degree Programme in Information Technology Author(s) Karapetyan Karina Title of Thesis Realization of Big Data Analytics Tool for optimization processes within the Finnish engineering company Date 23.05.2016 Pages/Appendices 54 Supervisor(s) Mr. Arto Toppinen, Principal Lecturer at Savonia University of Applied Sciences, Mr. Anssi Suhonen, Lecturer at Savonia University of Applied Sciences Client Organisation /Partners Hydroline Oy Abstract Big Data Analytics Tool offers an entire business picture for making both operational and strategic deci- sions from selecting the product price to establishing the priorities for the further vendor’s enhancement. The purpose of the thesis was to explore the industrial system of Hydroline Oy and provide a software solution for the elaboration of the manufacture, due to the internal analyzing within the company. For the development of Big Data Analytics Tool, several software programs and tools were employed. Java-written server controls all components in the project and visualizes the processed data via a user- friendly client web application. The SQL Server maintains data, observed from the ERP system. Moreo- ver, it is responsible for the login and registration procedure to enforce the information security. In the Hadoop environment, two research methods were implemented. The Overall Equipment Effectiveness model investigated the production data to obtain daily, monthly and annual efficiency indices of equip- ment utilization, employees’ workload, resource management, quality degree, among others.
    [Show full text]
  • Mapreduce Service
    MapReduce Service Troubleshooting Issue 01 Date 2021-03-03 HUAWEI TECHNOLOGIES CO., LTD. Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Issue 01 (2021-03-03) Copyright © Huawei Technologies Co., Ltd. i MapReduce Service Troubleshooting Contents Contents 1 Account Passwords.................................................................................................................. 1 1.1 Resetting
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • SAS 9.4 Hadoop Configuration Guide for Base SAS And
    SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS® Second Edition SAS® Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS®, Second Edition. Cary, NC: SAS Institute Inc. SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS®, Second Edition Copyright © 2015, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S.
    [Show full text]
  • Realization of EAI Patterns with Apache Camel
    Institut für Architektur von Anwendungssystemen Universität Stuttgart Universitätsstraße 38 70569 Stuttgart Studienarbeit Nr. 2127 Realization of EAI Patterns with Apache Camel Pascal Kolb Studiengang: Informatik Prüfer: Prof. Dr. Frank Leymann Betreuer: Dipl.‐Inf. Thorsten Scheibler begonnen am: 26.10.2007 beendet am: 26.04.2008 CR‐Klassifikation D.2.11, D.3, H.4.1 Table of Contents Table of Listings ............................................................................................................. vii 1 Introduction ............................................................................................................. 1 1.1 Task Description ................................................................................................................................. 1 1.2 Structure of this thesis ....................................................................................................................... 2 2 Apache Camel Fundamentals ................................................................................... 3 2.1 Introduction into Apache Camel ........................................................................................................ 3 2.2 Apache Camel’s Architecture ............................................................................................................. 4 2.2.1 Camel Components and Endpoints............................................................................................ 4 2.2.2 Camel Exchange and Message ..................................................................................................
    [Show full text]
  • UIMA Asynchronous Scaleout Written and Maintained by the Apache UIMA™ Development Community
    UIMA Asynchronous Scaleout Written and maintained by the Apache UIMA™ Development Community Version 2.10.3 Copyright © 2006, 2018 The Apache Software Foundation License and Disclaimer. The ASF licenses this documentation to you under the Apache License, Version 2.0 (the "License"); you may not use this documentation except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, this documentation and its contents are distributed under the License on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Trademarks. All terms mentioned in the text that are known to be trademarks or service marks have been appropriately capitalized. Use of such terms in this book should not be regarded as affecting the validity of the the trademark or service mark. Publication date March, 2018 Table of Contents 1. Overview - Asynchronous Scaleout ................................................................................. 1 1.1. Terminology ....................................................................................................... 1 1.2. AS versus CPM .................................................................................................. 2 1.3. Design goals for Asynchronous Scaleout ............................................................... 3 1.4. AS Concepts .....................................................................................................
    [Show full text]