HDF Operations: Hortonworks Data Flow

Total Page:16

File Type:pdf, Size:1020Kb

HDF Operations: Hortonworks Data Flow HDF Operations: Hortonworks Data Flow Overview Hands-On Labs This course is designed for ‘Data Stewards’ or ‘Data Flow NiFi Managers’ who are looking forward to automate the flow of data • Installing Building a WorkFLow between systems. Topics Include Introduction to NiFi, Installing and • Working with Processor Groups Configuring NiFi, Detail explanation of NiFi User Interface, • Working with Remote Processor Groups Explanation of its components and Elements associated with each. • How to Build a dataflow, NiFi Expression Language, Understanding • Using NiFi Expression Language. NiFi Clustering, Data Provenance, Security around NiFi, Monitoring • Using Templates. Tools and HDF Best practices. • Working with a NiFi Cluster • Monitoring NiFi Duration • HDF Integration with HDP • Securing 3 days HDF with 2-Way SSL • NiFi User Authentication with LDAP • End Of the Course Project. Format 50% Lecture/Discussion Demos 50% Hands-on Labs • The NiFi User Interface • Anatomy of a Processor Target Audience • Anatomy of a Connection Data Engineers, Integration Engineers and Architects who are • Working with Attributes looking to automate Data flow between systems. • Data Provenance • NiFi Notification Services Course Objectives Prerequisites • Describe HDF, Apache NiFi and its use cases. Students should be familiar with programming principles and • Describe NiFi Architecture have previous experience in software development. Experience • Understand Nifi Features and Characteristics. with Linux and a basic understanding of DataFlow tools would be • Understand System requirements to run Nifi. helpful. No prior Hadoop experience required, but is very helpful. • Understand Installing and Configuring NiFi • Understand NiFi user interface in depth. Certification Understand how to build a DataFlow using NiFi • Hortonworks offers a comprehensive certification program that • Understand Processor and its Elements identifies you as an expert in Apache Hadoop. Visit • Understand Connection and its Elements hortonworks.com/training/certification for more information. • Understand Processor Group and its elements • Understand Remote Processor Group and its Elements • Learn how to optimize a DataFlow Hortonworks University • Learn how to use NiFi Expression language and its use. Hortonworks University is your expert source for Apache Hadoop • Learn about Attributes and Templates in NiFi training and certification. Public and private on-site courses are available for developers, administrators, data analysts and other • Understand Concepts of NiFi Cluster IT professionals involved in implementing big data solutions. • Explain Data Provenance in NiFi Classes combine presentation material with industry-leading • Learn how to Secure NiFi hands-on labs that fully prepare students for real-world Hadoop • Learn How to effectively Monitor NiFi scenarios. • Learn about HDF Best Practices About Hortonworks US: 1.855.846.7866 Hortonworks develops, distributes and supports the International: +1.408.916.4121 only 100 percent open source distribution of www.hortonworks.com Apache Hadoop explicitly architected, built and 5470 Great America Parkway tested for enterprise-grade deployments. Santa Clara, CA 95054 USA .
Recommended publications
  • Use Splunk with Big Data Repositories Like Spark, Solr, Hadoop and Nosql Storage
    Copyright © 2016 Splunk Inc. Use Splunk With Big Data Repositories Like Spark, Solr, Hadoop And Nosql Storage Raanan Dagan, May Long Big Data Architect, Splunk Disclaimer During the course of this presentaon, we may make forward looking statements regarding future events or the expected performance of the company. We cauJon you that such statements reflect our current expectaons and esJmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward- looking statements made in the this presentaon are being made as of the Jme and date of its live presentaon. If reviewed aer its live presentaon, this presentaon may not contain current or accurate informaon. We do not assume any obligaon to update any forward looking statements we may make. In addiJon, any informaon about our roadmap outlines our general product direcJon and is subject to change at any Jme without noJce. It is for informaonal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligaon either to develop the features or funcJonality described or to include any such feature or funcJonality in a future release. 2 Agenda Use Cases: Fraud With Solr, Splunk, And Splunk AnalyJcs For Hadoop Business AnalyJcs With Cassandra, Splunk Cloud, And Splunk AnalyJcs For Hadoop Document Classificaon With Spark And Splunk Network IT With Kaa And Splunk Kaa Add On Demo 3 Fraud With Solr, Splunk, And Splunk AnalyJcs For Hadoop Use Case: Fraud – Why Apache Solr Apache Solr is an open source enterprise search plaorm from the Apache Lucene API.
    [Show full text]
  • Personal Information Backgrounds Abilities Highlights
    Yi Du Computer Network Information Center, Chinese Academy of Sciences Phone:+86-15810134970 Email:[email protected] Homepage: http://yiducn.github.io/ Personal Information Gender: Male Date of Graduate:July, 2013 Address: 4# Building, South 4th Street Zhong Guan Cun. Beijing, P.R.China. 100190 Backgrounds 2015.12~Now Department of Big Data Technology and Application Development, Computer Network Information Center Beijing, China Job Title: Associate Professor Research Interest: Data Mining, Visual Analytics 2015.09~2016.09 School of Electrical and Computer Engineering, Purdue University USA Job Title: Visiting Scholar Research Interest: Spatio-temporal Visualization, Visual Analytics 2013.09~2015.12 Scientific Data Center, Computer Network Information Center, CAS Beijing, China Job Title: Assistant Professor Research Interest: Data Processing, Data Visualization, HCI 2008.09~2013.09 Institute of Software Chinese Academy of Sciences Beijing, China Major: Computer Applied Technology Doctoral Degree Research Interest: Human Computer Interaction(HCI), Information Visualization 2004.08-2008.07 Shandong University Jinan Major: Software Engineering Bachelor's Degree Abilities Master the design and development of data science system, including data collecting, wrangling, analyzing, mining, visualizing and interacting. Master analyzing and mining of large-scale spatio-temporal data. Experiences in coding with Java, JavaScript. Familiar with Python and C++. Experiences in MongoDB, DB2. Familiar with MS SQLServer and Oracle. Experiences in traditional data mining and machine learning algorithms. Familiar with Hadoop, Titan, Kylin, etc. Highlights Participated in an open source project Gephi, contributed several plugins with over 3000 lines of code. An analytics platform named DVIZ, in which I played the leading role, gained several prizes in China.
    [Show full text]
  • Hortonworks Cybersecurity Platform Administration (April 24, 2018)
    Hortonworks Cybersecurity Platform Administration (April 24, 2018) docs.cloudera.com Hortonworks Cybersecurity April 24, 2018 Platform Hortonworks Cybersecurity Platform: Administration Copyright © 2012-2018 Hortonworks, Inc. Some rights reserved. Hortonworks Cybersecurity Platform (HCP) is a modern data application based on Apache Metron, powered by Apache Hadoop, Apache Storm, and related technologies. HCP provides a framework and tools to enable greater efficiency in Security Operation Centers (SOCs) along with better and faster threat detection in real-time at massive scale. It provides ingestion, parsing and normalization of fully enriched, contextualized data, threat intelligence feeds, triage and machine learning based detection. It also provides end user near real-time dashboarding. Based on a strong foundation in the Hortonworks Data Platform (HDP) and Hortonworks DataFlow (HDF) stacks, HCP provides an integrated advanced platform for security analytics. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License. http://creativecommons.org/licenses/by-sa/4.0/legalcode ii Hortonworks Cybersecurity April 24, 2018 Platform Table of Contents 1. HCP Information Roadmap .........................................................................................
    [Show full text]
  • My Steps to Learn About Apache Nifi
    My steps to learn about Apache NiFi Paulo Jerônimo, 2018-05-24 05:36:18 WEST Table of Contents Introduction. 1 About this document . 1 About me . 1 Videos with a technical background . 2 Lab 1: Running Apache NiFi inside a Docker container . 3 Prerequisites . 3 Start/Restart. 3 Access to the UI . 3 Status. 3 Stop . 3 Lab 2: Running Apache NiFi locally . 5 Prerequisites . 5 Installation. 5 Start . 5 Access to the UI . 5 Status. 5 Stop . 6 Lab 3: Building a simple Data Flow . 7 Prerequisites . 7 Step 1 - Create a Nifi docker container with default parameters . 7 Step 2 - Access the UI and create two processors . 7 Step 3 - Add and configure processor 1 (GenerateFlowFile) . 7 Step 4 - Add and configure processor 2 (Putfile) . 10 Step 5 - Connect the processors . 12 Step 6 - Start the processors. 14 Step 7 - View the generated logs . 14 Step 8 - Stop the processors . 15 Step 9 - Stop and destroy the docker container . 15 Conclusions . 15 All references . 16 Introduction Recently I had work to produce a document with a comparison between two tools for Cloud Data Flow. I didn’t have any knowledge of this kind of technology before creating this document. Apache NiFi is one of the tools in my comparison document. So, here I describe some of my procedures to learn about it and take my own preliminary conclusions. I followed many steps on my own desktop (a MacBook Pro computer) to accomplish this task. This document shows you what I did. Basically, to learn about Apache NiFi in order to do a comparison with other tool: • I saw some videos about it.
    [Show full text]
  • Splitting the Load How Separating Compute from Storage Can Transform the Flexibility, Scalability and Maintainability of Big Data Analytics Platforms
    IBM Analytics Engine White paper Splitting the load How separating compute from storage can transform the flexibility, scalability and maintainability of big data analytics platforms 2 Splitting the load Contents Executive summary 2 Executive summary Hadoop is the dominant big data processing system in use today. It is, however, a technology that has been around for 10 3 Challenges of Hadoop design years, and the world of big data has changed dramatically over 4 Limitations of traditional Hadoop clusters that time. 5 How cloud has changed the game 5 Introducing IBM Analytics Engine Hadoop started with a specific focus – a bunch of engineers 6 Overcoming the limitations wanted a way to store and analyze copious amounts of web logs. They knew how to write Java and how to set 7 Exploring the IBM Analytics Engine architecture up infrastructure, and they were hands-on with systems 8 Use cases for IBM Analytics Engine programming. All they really needed was a cost-effective file 10 Benefits of migrating to IBM Analytics Engine system (HDFS) and an execution paradigm (MapReduce)—the 11 Conclusion rest, they could code for themselves. Companies like Google, 11 About the author Facebook and Yahoo built many products and business models using just these two pieces of technology. 11 For more information Today, however, we’re seeing a big shift in the way big data applications are being programmed and deployed in production. Many different user personas, from data scientists and data engineers to business analysts and app developers need access to data. Each of these personas needs to access the data through a different tool and on a different schedule.
    [Show full text]
  • Handling Data Flows of Streaming Internet of Things Data
    IT16048 Examensarbete 30 hp Juni 2016 Handling Data Flows of Streaming Internet of Things Data Yonatan Kebede Serbessa Masterprogram i datavetenskap Master Programme in Computer Science i Abstract Handling Data Flows of Streaming Internet of Things Data Yonatan Kebede Serbessa Teknisk- naturvetenskaplig fakultet UTH-enheten Streaming data in various formats is generated in a very fast way and these data needs to be processed and analyzed before it becomes useless. The technology currently Besöksadress: existing provides the tools to process these data and gain more meaningful Ångströmlaboratoriet Lägerhyddsvägen 1 information out of it. This thesis has two parts: theoretical and practical. The Hus 4, Plan 0 theoretical part investigates what tools are there that are suitable for stream data flow processing and analysis. In doing so, it starts with studying one of the main Postadress: streaming data source that produce large volumes of data: Internet of Things. In this, Box 536 751 21 Uppsala the technologies behind it, common use cases, challenges, and solutions are studied. Then it is followed by overview of selected tools namely Apache NiFi, Apache Spark Telefon: Streaming and Apache Storm studying their key features, main components, and 018 – 471 30 03 architecture. After the tools are studied, 5 parameters are selected to review how Telefax: each tool handles these parameters. This can be useful for considering choosing 018 – 471 30 00 certain tool given the parameters and the use case at hand. The second part of the thesis involves Twitter data analysis which is done using Apache NiFi, one of the tools Hemsida: studied. The purpose is to show how NiFi can be used for processing data starting http://www.teknat.uu.se/student from ingestion to finally sending it to storage systems.
    [Show full text]
  • Apache Hadoop Today & Tomorrow
    Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com © Hortonworks, Inc. All Rights Reserved. Agenda Brief Overview of Apache Hadoop Where Apache Hadoop is Used Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce Where Apache Hadoop Is Going Q&A © Hortonworks, Inc. All Rights Reserved. 2 What is Apache Hadoop? A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases © Hortonworks, Inc. All Rights Reserved. 3 What is it used for? Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook Cutting edge analytics Machine learning, data mining… Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining And lots more! © Hortonworks, Inc. All Rights Reserved. 4 Apache Hadoop Projects Programming Pig Hive (Data Flow) (SQL) Languages MapReduce Computation (Distributed Programing Framework) HMS (Management) HBase (Coordination) Zookeeper Zookeeper HCatalog Table Storage (Meta Data) (Columnar Storage) HDFS Object Storage (Hadoop Distributed File System) Core Apache Hadoop Related Apache Projects © Hortonworks, Inc. All Rights Reserved. 5 Where Hadoop is Used © Hortonworks, Inc.
    [Show full text]
  • Big Business Value from Big Data and Hadoop
    Big Business Value from Big Data and Hadoop © Hortonworks Inc. 2012 Page 1 Topics • The Big Data Explosion: Hype or Reality • Introduction to Apache Hadoop • The Business Case for Big Data • Hortonworks Overview & Product Demo Page 2 © Hortonworks Inc. 2012 Big Data: Hype or Reality? © Hortonworks Inc. 2012 Page 3 What is Big Data? What is Big Data? Page 4 © Hortonworks Inc. 2012 Big Data: Changing The Game for Organizations Transactions + Interactions Mobile Web + Observations Petabytes BIG DATA Sentiment SMS/MMS = BIG DATA User Click Stream Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing ERP Customer Touches User Generated Content Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Product/Service Logs Payment record Offer details Offer history Increasing Data Variety and Complexity Page 5 © Hortonworks Inc. 2012 Next Generation Data Platform Drivers Organizations will need to become more data driven to compete Business • Enable new business models & drive faster growth (20%+) Drivers • Find insights for competitive advantage & optimal returns Technical • Data continues to grow exponentially • Data is increasingly everywhere and in many formats Drivers • Legacy solutions unfit for new requirements growth Financial • Cost of data systems, as % of IT spend, continues to
    [Show full text]
  • View Whitepaper
    INFRAREPORT Top M&A Trends in Infrastructure Software EXECUTIVE SUMMARY 4 1 EVOLUTION OF CLOUD INFRASTRUCTURE 7 1.1 Size of the Prize 7 1.2 The Evolution of the Infrastructure (Public) Cloud Market and Technology 7 1.2.1 Original 2006 Public Cloud - Hardware as a Service 8 1.2.2 2016 - 2010 - Platform as a Service 9 1.2.3 2016 - 2019 - Containers as a Service 10 1.2.4 Container Orchestration 11 1.2.5 Standardization of Container Orchestration 11 1.2.6 Hybrid Cloud & Multi-Cloud 12 1.2.7 Edge Computing and 5G 12 1.2.8 APIs, Cloud Components and AI 13 1.2.9 Service Mesh 14 1.2.10 Serverless 15 1.2.11 Zero Code 15 1.2.12 Cloud as a Service 16 2 STATE OF THE MARKET 18 2.1 Investment Trend Summary -Summary of Funding Activity in Cloud Infrastructure 18 3 MARKET FOCUS – TRENDS & COMPANIES 20 3.1 Cloud Providers Provide Enhanced Security, Including AI/ML and Zero Trust Security 20 3.2 Cloud Management and Cost Containment Becomes a Challenge for Customers 21 3.3 The Container Market is Just Starting to Heat Up 23 3.4 Kubernetes 24 3.5 APIs Have Become the Dominant Information Sharing Paradigm 27 3.6 DevOps is the Answer to Increasing Competition From Emerging Digital Disruptors. 30 3.7 Serverless 32 3.8 Zero Code 38 3.9 Hybrid, Multi and Edge Clouds 43 4 LARGE PUBLIC/PRIVATE ACQUIRERS 57 4.1 Amazon Web Services | Private Company Profile 57 4.2 Cloudera (NYS: CLDR) | Public Company Profile 59 4.3 Hortonworks | Private Company Profile 61 Infrastructure Software Report l Woodside Capital Partners l Confidential l October 2020 Page | 2 INFRAREPORT
    [Show full text]
  • Final HDP with IBM Spectrum Scale
    Hortonworks HDP with IBM Spectrum Scale Chih-Feng Ku ([email protected]) Sr Manager, Solution Engineering APAC Par Hettinga ( [email protected]) Program Director, GloBal SDI EnaBlement Challenges with the Big Data Storage Models IBM Storage & SDI It’s not just one ! type of data – file & oBject Key Business processes now It’s not just ! depend on the analytics ! one type of analytics . Ingest data at Move data to the Perform Repeat! various end points analytics engine analytics More data sources than ever ! It takes hours or days Can’t just throw away data due to Before, not just data you own, ! to move the data! ! regulations or Business requirement But puBlic or rented data 2 Modernizing and Integrating Data Lake Infrastructure IBM Storage & SDI Insurance Company Use Case Value of IBM Power Systems Big SQL Business unit - Performance and Business Business scalability unit unit - High memory, I/O bandwidth - Optimized server for analytics and integration of accelerators ETL-processes DB2 for Warehouse H PD Data a Policy D d Value of IBM Spectrum Scale M Data D o M Customer D o - Separation of compute Data M p and storagepPart - In-place analytics (Posix conform) - Integration of objects and Files Global Namespace - HA / DR solutions - Storage tiering Spectrum Scale (inkl. tape integration) 8 16 9 17 EXP3524 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4 - Flexibility of data movement 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4 8 16 9 17 EXP3524 - Future: Using of common data formats (f.e.
    [Show full text]
  • Hdf® Stream Developer 3 Days
    TRAINING OFFERING | DEV-371 HDF® STREAM DEVELOPER 3 DAYS This course is designed for Data Engineers, Data Stewards and Data Flow Managers who need to automate the flow of data between systems as well as create real-time applications to ingest and process streaming data sources using Hortonworks Data Flow (HDF) environments. Specific technologies covered include: Apache NiFi, Apache Kafka and Apache Storm. The course will culminate in the creation of a end-to-end exercise that spans this HDF technology stack. PREREQUISITES Students should be familiar with programming principles and have previous experience in software development. First-hand experience with Java programming and developing within an IDE are required. Experience with Linux and a basic understanding of DataFlow tools and would be helpful. No prior Hadoop experience required. TARGET AUDIENCE Developers, Data & Integration Engineers, and Architects who need to automate data flow between systems and/or develop streaming applications. FORMAT 50% Lecture/Discussion 50% Hands-on Labs AGENDA SUMMARY Day 1: Introduction to HDF Components, Apache NiFi dataflow development Day 2: Apache Kafka, NiFi integration with HDF/HDP, Apache Storm architecture Day 3: Storm management options, multi-language support, Kafka integration DAY 1 OBJECTIVES • Introduce HDF’s components; Apache NiFi, Apache Kafka, and Apache Storm • NiFi architecture, features, and characteristics • NiFi user interface; processors and connections in detail • NiFi dataflow assembly • Processor Groups and their elements
    [Show full text]
  • Hortonworks Data Platform Release Notes (October 30, 2017)
    Hortonworks Data Platform Release Notes (October 30, 2017) docs.cloudera.com Hortonworks Data Platform October 30, 2017 Hortonworks Data Platform: Release Notes Copyright © 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Software Foundation projects that focus on the storage and processing of Big Data, along with operations, security, and governance for the resulting system. This includes Apache Hadoop -- which includes MapReduce, Hadoop Distributed File System (HDFS), and Yet Another Resource Negotiator (YARN) -- along with Ambari, Falcon, Flume, HBase, Hive, Kafka, Knox, Oozie, Phoenix, Pig, Ranger, Slider, Spark, Sqoop, Storm, Tez, and ZooKeeper. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page.
    [Show full text]