<<

Hitachi NEXT 2018 Building a Data Pipeline With – From Ingest to Analytics

Contents Page 2: Guided Demonstration: Data Source to Page 3: Review the InputData Transformation Page 11: Review and Run the CT2000 Job Page 14: Create an Analysis Using the RenewableEnergy Model Page 16: View the CT2000 Dashboard HITACHI IS A TRADEMARK OR REGISTERED TRADEMARK OF 1 PageHITACHI, 17: Resources LTD.

Guided Demonstration: Data Source to Dashboard

Introduction In this guided demonstration, you will review a Pentaho (PDI) transformation that obtains data about energy generation and usage around the world, prepares the data for analytics by building a data model (cube), and publishes the data to the repository as a data service. You will then review a PDI job that runs the transformation and publishes the cube to the repository so it can be used for analytics. Finally, you will use Analyzer to analyze and visualize the data.

Objectives After completing this guided demonstration, you will be able to:

• Describe the purpose of a Transformation and the following transformation steps: - Microsoft Excel Input - Select Values - Modified Script Value - Filter Rows - Sort Rows - Row Denormaliser - Annotate Stream • Create a Pentaho Data Service from a transformation step • Describe the purpose of a Job and the following job entries: - Start - Transformation - Build Model - Publish Model • Use Pentaho Analyzer to analyzer and visualize data

Note The transformation and job reviewed in this demonstration use a sampling of PDI steps and job entries. The steps and job entries used in production vary depending on the incoming data and the business objectives.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 2

Review the InputData Transformation

Start Pentaho Data Integration (Spoon) and Connect to the Repository

1. On the desktop, double-click the Data Integration icon. 2. To connect to the repository, at the far right of the toolbar, click Connect, and then click Pentaho Repository. 3. Enter the User Name as admin, and the Password as password, and then click Connect.

Open the InputData Transformation

Transformations are used to describe the data flows for Extract, Transform, and Load (ETL) processes, such as reading from a source, transforming data, and loading it into a target location. Each “step” in a transformation applies specific logic to the data flowing through the transformation. The steps are connected with “hops” that define the pathways the data follow through the transformation. The data flowing through the transformation is referred to as the “stream.”

The InputData transformation receives data from a Microsoft Excel file containing data about energy generation and usage around the world. It then fine tunes the data, creates a data model (OLAP cube), and publishes the data to the repository as a Pentaho Data Service.

To open the InputData transformation: 1. From the menu, select File, and then click Open. 2. Navigate to the Public>CT2000>files>KTR folder. 3. Double-click InputDataTransformation.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 3

Review the Microsoft Excel Input Step

The Microsoft Excel Input step provides the ability to read data from one or more Excel and Open Office files. In this example, the Excel file contains data about energy generation and usage by country for the years 2000-2015.

To review the Microsoft Excel Input step: 1. Double-click the Input Data xls step, and then review the configuration of the Files tab.

2. Click the Fields tab, and then review the configuration.

3. To preview the data, click Preview Rows, and then click OK.

4. To close the preview, click Close, and then to close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 4

Review the Select Values Step

The Select Values step is useful for selecting, removing, renaming, changing data types and configuring the length and precision of the fields in the stream. In this example, the fields are reordered, and the Technology field is replicated four times to create the Tech1, Tech2, Tech3, and Tech4 fields. You will see the purpose of those fields later in this demonstration.

To review the Select Values step: 1. Double-click the Defines fields step, and then review the configuration.

2. To close the step dialog, click OK.

Review the Modified Java Script Value Step

The Modified Java Script Value step provides an expression based user interface for building JavaScript expressions. This step also allows you to create multiple scripts for each step. The Technology field from the spreadsheet contains the specific type of energy (for example, Renewable Municipal Waste). Since the specific energy sources can be categorized into higher levels, the expressions in this step assign the energy source to various categories to create a hierarchy that will be used in the OLAP cube.

For example, the Technology “Renewable Municipal Waste” gets turned into the following four fields: Tech1: Total Renewable Energy Tech2: Bioenergy Tech3: Solid Biofuels Tech4: Renewable Municipal Waste

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 5

To review the Modified Java Script Value step:

1. Double-click the Builds tech hierarchy step. 2. Click the Item_0 tab, and then review the script.

3. Click the Script 1 tab, and then review the script.

4. To close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 6

Review the Filter Rows Step

The Filter Rows step filters rows based on conditions and comparisons. The rows are then directed based on whether the filter evaluates to ‘true’ or ‘false.’ In this example, the previous JavaScript step results in some redundant data, so those rows are filtered out of the stream.

To review the Filter Rows step: 1. Double-click the Filters out redundancy step, and then review the configuration.

2. To close the step dialog, click OK.

Review the Sort Rows Step

The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in ascending or descending order.

To review the Sort Rows step:

1. Double-click the Sort rows step, and then review the configuration.

2. To close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 7

Review the Row Denormaliser Step

The Row Denormaliser step allows you denormalize data by looking up key-value pairs. It also allows you to immediately convert data types. In this example, the Indicator field is used denormalize the rows and create two additional fields: Total Generated GWh and Total Capacity MW.

To review the Row Denormaliser step: 1. Double-click the Denormalises Indicator step, and then review the configuration.

2. To close the step dialog, click OK, and then click Close.

Review the Second Filter Rows Step

The second Filter Rows step removes rows with Total Capacity MW of zero.

To review the Filter Rows step: 1. Double-click the Remove Capacity = 0 step, and then review the configuration.

2. To close the step dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 8

Review the Annotate Stream Step

The Annotate Stream step helps you refine your data for the Streamlined Data Refinery by creating measures, link dimensions, or attributes on stream field(s) which you specify. In this example, the Total Generated GWh and Total Capacity MW are defined as measures, and the remaining fields are defined as dimensions within hierarchies for the location and the technologies. The Annotate Stream modifies the default model produced from the Build Model job entry. You will review the Build Model job entry later in this demonstration.

To review the Annotate Stream step: 1. Double-click the Sets measures and hierarchies step, and then review the configuration.

2. To close the step dialog, click OK.

Review the Output Step

Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback. One way to streamline this process is to make the output of a transformation step a Pentaho Data Service. The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database. Instead, results are published to the Pentaho Server as a virtual table. The results of this transformation are being used to create a Pentaho Data Service called DataServiceCT2000.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 9

To review the Data Service: 1. Right-click the OUTPUT step, then click Data Services, and then click Edit.

2. To close the Data Service dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 10

Review and Run the CT2000 Job

Open the CT2000 Job

Jobs are used to coordinate ETL activities such as defining the flow and dependencies for what order transformations should be run, or prepare for execution by checking various conditions such as ensuring a source file is available.

The CT2000 job executes the InputDataTransformation, builds the data model (cube) based on the Annotate Stream step, and then publishes the model to the repository. After the job runs, the data service and model are available for reporting, analysis, and dashboarding.

To open the CT2000 job: 1. From the Menu, select File, and then click Open. 2. Double-click CT2000JOB.

Review the Build Model Job Entry

The Build Model job entry creates Data Source Wizard (DSW) data models. In this example, the RenewableEnergy model is created from the DataServiceCT2000 data service based on the annotations defined in the Annotate Stream step.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 11

To review the Build Model job entry: 1. Double-click the Build Model job entry.

2. To close the job entry dialog, click OK.

Review the Publish Model Job Entry

The Publish Model job entry allows you to publish the data model created with the Build Model job entry so it is available for use on the Pentaho Server.

To review the Publish Model job entry: 1. Double-click the Publish Model job entry.

2. To close the job entry dialog, click OK.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 12

Run the CT2000 Job

To run the CT2000 job:

1. On the sub-toolbar, click the Run button. 2. Verify the Run Options, and then click Run.

Notice the green checkmarks indicating that each job entry successfully completed.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 13

Create an Analysis Using the RenewableEnergy Model

Start the Pentaho User Console

1. On the desktop, double-click the User Console Login icon. 2. In the User Name field, type admin, then in the Password field, type password, and then click Login.

Create an Analysis Using the RenewableEnergy Model

To create a new analysis: 1. From the Home Perspective, click Create New>Analysis Report. 2. In the Select Data Source window, click Renewable Energy:Renewable Energy, and then click OK. 3. Review the RenewableEnergy model/cube.

4. To add Total Generated (GWh) to the Measures, double-click Total Generated (GWh). 5. To add Continent to the Rows, double-click Continent.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 14

6. To add Tech2 to the Columns, select Tech2 and drag it to the Columns drop zone on the Layout panel. 7. To drill down to the Tech3 level for Bioenergy, double-click the Bioenergy column header. 8. To drill down to the Tech4 level for Solid biofuels, double-click the Solid biofuels column header. 9. To keep only the Renewable municipal waste data, right-click the Renewable municipal waste column header, and then click Keep Only Renewable municipal waste. 10. To drill down to the Country level for Europe, double-click the Europe row header. 11. To view the analysis as a chart, on the toolbar, click the Choose chart type icon, and then click Column.

12. To return to the table, on the toolbar, click the Switch to table format icon. 13. To close the analysis, on the Analysis Report tab, click the X, and then click Yes. (It is not necessary to save this analysis.)

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 15

View the CT2000 Dashboard

The CT2000 dashboard was created with the CTools using the RenewableEnergy data model and the DataServiceCT2000 data service to provide an interactive dashboard that allows users to explore the data from various perspectives. To view the CT2000 dashboard: 1. From the Home Perspective, click Browse Files. 2. In the Folders panel, navigate to the Public>CT2000>dashboards folder. 3. In the Files panel, double-click the CDE sample file.

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 16

Resources

Hitachi Vantara Web Site https://www.hitachivantara.com

Innovate with Data and Analytics https://www.hitachivantara.com/en-us/solutions/data-analytics.html

Pentaho Data Integration https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-data- integration.html

Pentaho Business Analytics https://www.hitachivantara.com/en-us/products/big-data-integration-analytics/pentaho-business- analytics.html

Training https://www.hitachivantara.com/en-us/services/training-certification/training/pentaho.html

Pentaho Data Integration

DI1000: Pentaho Data Integration Fundamentals

DI1500: Pentaho Data Integration Advanced

Pentaho Business Analytics

BA1000: Business Analytics User Console

BA2000: Business Analytics Report Designer

BA3000: Business Analytics Data Modeling

CTools

CT1000: CTools Fundamentals

CT1500: CTools Advanced

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 17

HITACHI is a trademark or registered trademark of Hitachi, Ltd. 18