Building a Data Pipeline with Pentaho from Ingest to Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Hitachi NEXT 2018 Building a Data Pipeline With Pentaho – From Ingest to Analytics Contents Page 2: Guided Demonstration: Data Source to Dashboard Page 3: Review the InputData Transformation Page 11: Review and Run the CT2000 Job Page 14: Create an Analysis Using the RenewableEnergy Model Page 16: View the CT2000 Dashboard HITACHI IS A TRADEMARK OR REGISTERED TRADEMARK OF 1 PageHITACHI, 17: Resources LTD. Guided Demonstration: Data Source to Dashboard Introduction In this guided demonstration, you will review a Pentaho Data Integration (PDI) transformation that obtains data about energy generation and usage around the world, prepares the data for analytics by building a data model (cube), and publishes the data to the repository as a data service. You will then review a PDI job that runs the transformation and publishes the cube to the repository so it can be used for analytics. Finally, you will use Analyzer to analyze and visualize the data. Objectives After completing this guided demonstration, you will be able to: • Describe the purpose of a Transformation and the following transformation steps: - Microsoft Excel Input - Select Values - Modified Java Script Value - Filter Rows - Sort Rows - Row Denormaliser - Annotate Stream • Create a Pentaho Data Service from a transformation step • Describe the purpose of a Job and the following job entries: - Start - Transformation - Build Model - Publish Model • Use Pentaho Analyzer to analyzer and visualize data Note The transformation and job reviewed in this demonstration use a sampling of PDI steps and job entries. The steps and job entries used in production vary depending on the incoming data and the business objectives. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 2 Review the InputData Transformation Start Pentaho Data Integration (Spoon) and Connect to the Repository 1. On the desktop, double-click the Data Integration icon. 2. To connect to the repository, at the far right of the toolbar, click Connect, and then click Pentaho Repository. 3. Enter the User Name as admin, and the Password as password, and then click Connect. Open the InputData Transformation Transformations are used to describe the data flows for Extract, Transform, and Load (ETL) processes, such as reading from a source, transforming data, and loading it into a target location. Each “step” in a transformation applies specific logic to the data flowing through the transformation. The steps are connected with “hops” that define the pathways the data follow through the transformation. The data flowing through the transformation is referred to as the “stream.” The InputData transformation receives data from a Microsoft Excel file containing data about energy generation and usage around the world. It then fine tunes the data, creates a data model (OLAP cube), and publishes the data to the repository as a Pentaho Data Service. To open the InputData transformation: 1. From the menu, select File, and then click Open. 2. Navigate to the Public>CT2000>files>KTR folder. 3. Double-click InputDataTransformation. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 3 Review the Microsoft Excel Input Step The Microsoft Excel Input step provides the ability to read data from one or more Excel and Open Office files. In this example, the Excel file contains data about energy generation and usage by country for the years 2000-2015. To review the Microsoft Excel Input step: 1. Double-click the Input Data xls step, and then review the configuration of the Files tab. 2. Click the Fields tab, and then review the configuration. 3. To preview the data, click Preview Rows, and then click OK. 4. To close the preview, click Close, and then to close the step dialog, click OK. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 4 Review the Select Values Step The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields in the stream. In this example, the fields are reordered, and the Technology field is replicated four times to create the Tech1, Tech2, Tech3, and Tech4 fields. You will see the purpose of those fields later in this demonstration. To review the Select Values step: 1. Double-click the Defines fields step, and then review the configuration. 2. To close the step dialog, click OK. Review the Modified Java Script Value Step The Modified Java Script Value step provides an expression based user interface for building JavaScript expressions. This step also allows you to create multiple scripts for each step. The Technology field from the spreadsheet contains the specific type of energy (for example, Renewable Municipal Waste). Since the specific energy sources can be categorized into higher levels, the expressions in this step assign the energy source to various categories to create a hierarchy that will be used in the OLAP cube. For example, the Technology “Renewable Municipal Waste” gets turned into the following four fields: Tech1: Total Renewable Energy Tech2: Bioenergy Tech3: Solid Biofuels Tech4: Renewable Municipal Waste HITACHI is a trademark or registered trademark of Hitachi, Ltd. 5 To review the Modified Java Script Value step: 1. Double-click the Builds tech hierarchy step. 2. Click the Item_0 tab, and then review the script. 3. Click the Script 1 tab, and then review the script. 4. To close the step dialog, click OK. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 6 Review the Filter Rows Step The Filter Rows step filters rows based on conditions and comparisons. The rows are then directed based on whether the filter evaluates to ‘true’ or ‘false.’ In this example, the previous JavaScript step results in some redundant data, so those rows are filtered out of the stream. To review the Filter Rows step: 1. Double-click the Filters out redundancy step, and then review the configuration. 2. To close the step dialog, click OK. Review the Sort Rows Step The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in ascending or descending order. To review the Sort Rows step: 1. Double-click the Sort rows step, and then review the configuration. 2. To close the step dialog, click OK. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 7 Review the Row Denormaliser Step The Row Denormaliser step allows you denormalize data by looking up key-value pairs. It also allows you to immediately convert data types. In this example, the Indicator field is used denormalize the rows and create two additional fields: Total Generated GWh and Total Capacity MW. To review the Row Denormaliser step: 1. Double-click the Denormalises Indicator step, and then review the configuration. 2. To close the step dialog, click OK, and then click Close. Review the Second Filter Rows Step The second Filter Rows step removes rows with Total Capacity MW of zero. To review the Filter Rows step: 1. Double-click the Remove Capacity = 0 step, and then review the configuration. 2. To close the step dialog, click OK. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 8 Review the Annotate Stream Step The Annotate Stream step helps you refine your data for the Streamlined Data Refinery by creating measures, link dimensions, or attributes on stream field(s) which you specify. In this example, the Total Generated GWh and Total Capacity MW are defined as measures, and the remaining fields are defined as dimensions within hierarchies for the location and the technologies. The Annotate Stream modifies the default model produced from the Build Model job entry. You will review the Build Model job entry later in this demonstration. To review the Annotate Stream step: 1. Double-click the Sets measures and hierarchies step, and then review the configuration. 2. To close the step dialog, click OK. Review the Output Step Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback. One way to streamline this process is to make the output of a transformation step a Pentaho Data Service. The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database. Instead, results are published to the Pentaho Server as a virtual table. The results of this transformation are being used to create a Pentaho Data Service called DataServiceCT2000. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 9 To review the Data Service: 1. Right-click the OUTPUT step, then click Data Services, and then click Edit. 2. To close the Data Service dialog, click OK. HITACHI is a trademark or registered trademark of Hitachi, Ltd. 10 Review and Run the CT2000 Job Open the CT2000 Job Jobs are used to coordinate ETL activities such as defining the flow and dependencies for what order transformations should be run, or prepare for execution by checking various conditions such as ensuring a source file is available. The CT2000 job executes the InputDataTransformation, builds the data model (cube) based on the Annotate Stream step, and then publishes the model to the repository. After the job runs, the data service and model are available for reporting, analysis, and dashboarding. To open the CT2000 job: 1. From the Menu, select File, and then click Open. 2. Double-click CT2000JOB. Review the Build Model Job Entry The Build Model job entry creates Data Source Wizard (DSW) data models. In this example, the RenewableEnergy model is created from the DataServiceCT2000 data service based on the annotations defined in the Annotate Stream step.