Index

A Active@ ISO Burner, 22 absolute positioning, PRD, 398 Ad Hoc Report Wizard, 373–375 access control Add Cube button, 466–467 layer refining privileges, 349 Add Job process action, Scheduler, schema design and, 483 418–419 accumulating periodic snapshot , Add Levels, 474–476 149–150 Add Parameter function, PRD, 386–389 action definitions, 68 Add Sequence step, PDI action sequence editor, 80–82 loading date dimension, 268–269 Action Sequence Wizard, 80, 86 loading demography dimension, 283, 284 action sequences additive measures, 150 Addresses tab page, Mail job, 290 adding as templates for Design Studio, 88 Ad-Hoc Report component, 192 creating with PDS. See PDS ( admin user Design Studio) creating slave servers, 340 Customers per Website pie chart, managing PDI repository accounts, 548–551 326–327 executed by solution engine, 68 PDI repository, 324 executing in background, 422–423 administration-console, 38, 44 functionality of, 78 administrative tasks inputs for, 83–85 data sources, 60–61 for location data, 559–561 managing schedules and subscriptions, outputs for, 85 COPYRIGHTED MATERIAL61 process actions and, 85–89 Pentaho Administrative Console. See programming Scheduler with, 412, PAC (Pentaho Administrative 417–420 Console) running jobs within, 335 user management, 58–60 solution repository containing, 68 Administrator profile, PDI repository, 327 subscribing to, 423–426 Advanced category, Connection using transformations in, 334–336 dialog, 250 actions, process, 85–89 Advisor button, PAD, 501 Active Directory (AD), and EE single AgeandAgeSequencestep,demography sign-on, 77 dimensions, 282

571 572 Index ■ A

Age Group step, demography dimensions, ARFF (Attribute Relation File Format), 282, 284–285 511, 519 Aggregate Designer, 130, 442 AS. See action sequences aggregate tables ASPs (Application Service Providers), creating manually, 500 144 drawbacks of, 502 assignments, user management, 59 extending Mondrian with, 497–500 association, as tool, 507–508 generating and populating, 445 at utility, job scheduling, 421 Pentaho Analysis Services, 445 Attribute Relation File Format (ARFF), aggregation 511, 519 alternatives to, 502 attributes benefits of, 496–497 dimensions, 476–477 data integration process, 229 global data model, 206–207 design, 163–164 hierarchies, 472–473 data warehouse performance, 130 level, 474–475 Mondrian, 496 measures, 469–470 PRD reports, 393–395 Mondrian cubes, 467 restricting results of, 157 not fitting into dimension tables, Slice and Dice example of, 180–181 17–18 audit columns, 163–164 using subreports for different, 404–406 authentication WAQR reports, 374 hibernate database storing data on, AJAX technology, CDF dashboards, 45, 47 529–530 JDBC security configuration, 50 algorithms, as data mining tool, 508–509 Mail job entry configuration, 290–291 aliases, 152–153, 384–385 Pentaho Administrative Console all level, MDX hierarchies, 450–451 configuration, 57–58 all member, MDX hierarchies, 450 slave server configuration, 340 All Schedules panel, server admin’s SMTP configuration, 53–54 workspace, 428 Alves, Pedro, 530 Spring Security handling, 60, 69 analysis authorization hibernate examples of, 16–19 database storing data on user, views in Pentaho BI Server, 484–485 47, 60 views in user console, 73–74 JDBC security configuration, 50 analytical , 142–143 managing user accounts, 327–328 analytics, business, 503 Spring Security handling, 60, 69 AND operator, multiple restrictions, 157 user configuration, 58–60 appliances, data warehouse, 143–144 automated knowledge discovery, 503. Application Service Providers (ASPs), 144 See also data mining architecture automatic startup, UNIX/Linux systems, Community Framework, 40–41 532–534 availability, remote execution, 338–339 data warehouse. See data warehousing averages, calculating, 217–218 architecture axes Pentaho Analysis Services, 442–444 controlling dimension placement on, Pentaho BI, 64 489–490 reporting, 371–373 dimensions on only one axis, 455 archiving MDX representing information on data warehouse performance and, 132 multiple axes, 453 transaction data, 128 Azzurri Clay, 32 Index ■ B–C 573

B bridge tables back office maintenance of, 229 data warehouse architecture, 117 multi-valued dimensions and, 182–183 database support for, 95–96 navigating hierarchies using, 184–185 back-end programs, Pentaho BI stack, 66 browsers, logging in, 6–7 background directory, content repository, Building the Data Warehouse (Inmon), 413 113–114 background execution, 422–423, 426–429 built-in variables, 314 backup, PDI repository, 329 bum, managing Linux init scripts, 42 banded report tools, 376 Burst Sales Report, 54 bar charts, 400–402 bursting Base concept, metadata layer, 357 defined, 430 batch-wise ETL, 118 implementing in Pentaho, 430 BC (business case), 192. See also World other implementations, 438 Class Movies overview of, 430 BDW (Business Data Warehouse), 111, 115 rental reminder e-mails example, BETWEEN...AND operators, 151–152 430–438 BI (). See also Pentaho bus architecture, 179–189 BI Server business analysts, 193–195 analytics and, 503 business analytics, 503 business case (BC), 192. See also WCM components, 70–73 (World Class Movies), example dashboards and, 529 business case data mart design and. See data marts, Business Columns, at logical layer, design 362–363 data mining and, 505–506 Business Data Warehouse (BDW), 111, 115 definition of, 107 business intelligence. See BI (business example business case. See WCM (World intelligence) Class Movies), example business Business Intelligence server. See Pentaho case BI Server importance of data, 108–109 business layer, metadata model, 71 platform, Pentaho BI stack, 64 business modeling, using star schemas. See purpose of, 105–109 star schemas real-time data warehousing and, 140–142 Business Models, logical layer, 362 reports and. See reports business rules engines, 141 using , 127–128 Business Tables, 362–365 BI Developer examples Business Views, 71, 362 button-single-parameter.prpt,13–14 button-single-parameter.prpt sample, CDF section, 530 13–14 overview of, 8–9 Regional Sales - HTML reporting, 11–12 Regional Sales - Line/Bar Chart, 16–17 C Slice and Dice Analysis, 17–18 c3p0 connection pool, 49–50 BIRT reports, 72, 372 C4.5 decision tree algorithm, 508, 512 biserver-ce, 38. See also Pentaho home C5.0 decision tree algorithm, 508–509 directory caching, Join Rows (Cartesian product) bitmap indexes, and data warehousing, step, 280 129–130 Calculate and Format Dates step, 269–273 blob field, reports with images, 401–403 Calculate Time step, 277, 281 BOTTOMCOUNT function, MDX queries, calculated members, 459–460, 483 457 calculations, PRD functions for, 395 574 Index ■ C

Calculator step, PDI summary, 569–570 age and income groups for demography synergy with community and Pentaho dimension, 285 corporation, 529–530 Calculate Time step, 281 templates, 538 current and last year indicators, 276 .xcdf file, 537–538 loading date dimension, 269–273 central data warehouse, 117, 119–121 loading demography dimension, 282 CEP (complex event processing), 141 calendar days, calculating date Change Data Capture. See CDC (Change dimensions, 270 Data Capture) Carte server CHAPTERS, axes, 453 clustering, 341–342 Chart editor, 398–399 creating slave servers, 340–341 charts. See also Customers per Website pie as PDI tool, 231 chart remote execution and, 337–339, 341 adding bar charts to PRD reports, 400 running, 339–340 adding images to PRD reports, 401–404 carte.bat script, 339–340 adding pie charts to PRD reports, carte.sh script, 339–340 400–402 Cartesian product, 153 adding to PRD reports, 397–400 Cascading Style Sheets. See CSS bursting. See bursting (Cascading Style Sheets) examples, 14–16 catalog, connecting DataCleaner with, 202 including on dashboards, 548 Catalog of OMG Modeling and Metadata JPivot, 494–496 Specifications, 352 not available in WAQR, 375 CategorySet collector function, 399–400 reacting to mouse clicks on pie, 554–555 causality, vs. correlation in data mining, Check if Staging Table Exists step, 293–294 508 child objects, and inheritance, 356 CDC (Change Data Capture), 133–137 Citrus, 13–14 choosing option, 137 class path, Weka data mining, 512–515 intrusive and non-intrusive processes, classification 133 algorithms, 508–509 log-based, 136–137 as data mining tool, 506–507 methods of implementing, 226 with Weka Explorer, 524 overview of, 133 client, defined, 65 snapshot-based, 135–136 cloud computing, 144 source data-based, 133–134 clustering trigger-based, 134–135 as data mining tool, 507 CDF (Community Dashboard database connection options for, 250 Framework), 542–569 remote execution with Carte using, 337, concepts and architecture, 532–534 341–342 content template, 541–542 snowflaking and, 186–187 customer and websites dashboard. See collapsed dimensions, 498 customer and websites dashboard collector functions, 397, 399–400 dashboarding examples, 19–20 colors, for PRD reports, 390–391 document template, 538–541 column profiling, 197–198, 199 history of, 530–531 columnar databases, 142–143 home directory, 534–535 column-oriented databases, 502 JavaScript and CSS resources, 536–537 columns overview of, 529 date dimension, 213–216 plug-in, 534 meaningful names for, 163 plugin.xml file, 535–536 quickly opening properties for, 210 skills and technologies for, 531–532 SCD type 3, 174 Index ■ C 575

using dictionary for dependency checks content template, CDF, 533, 541–542 on, 205 contents, of document templates, 540 COLUMNS, axes, 453–454 conversion functions, PRD, 393 Columns section, OLAP Navigator, 19 Copy Tables Wizard, 210 columns stores, analytical databases, correlation, vs. causality in data mining, 142–143 508 Combine step, PDI count_rows transformation, 315 Join Rows (Cartesian product) step, cousins, cube family relationships, 452 278–281 Create dim_date step, date dimension, loading demography dimension, 283 265–267 loading time dimension, PDI, 277 Create dim_demography step, command line demography dimension, 282 creating symbolic links from, 26 Create dim_time step, time dimension, installing SDK from, 27–28 277 Linux systems using, 24–25 Create Kettle Job option, 210 running jobs and transformations from, Create stage_demography step, 330–334 demography dimension, 282 setting up MySQL schemas from, 46 Create Staging Table step, 293, 295–296 starting desktop programs from, 76 CREATE TABLE statement comment lines, startup scripts, 52 Create dim_date, loading date Common Warehouse Metamodel (CWM), dimension, 265–267 70, 352 creating staging table, 296 Community Dashboard Framework. See dim_demography dimension table, CDF (Community Dashboard 283–284 Framework) simplified date dimension table, 263–265 Community Edition of Pentaho stage_promotion table, 305 default document templates with, 539 create_quartz_mysql. script, 46 Java servlet technology, 74 create_repository_mysql.sql script, 46 overview of, 76–77 create_sample_datasource.sql script, comparing data, DataCleaner, 199, 205 46 Competing on Analytics, 503 credentials, Pentaho Administrative Complete pane, Workspace, 427 Console, 57–58 complex event processing (CEP), 141 CRISP-DM (CRoss Industry Standard compression, analytical databases and, 142 Process for Data Mining), 505 concepts, Pentaho metadata, 356–357 cron implementations, 415, 421 conditional formatting, schema design CROSS JOIN, 153, 155–156 and, 483 CROSSJOIN function, MDX queries, 457 conflicting data, data warehouses, 124 cross-tables, cube visualization, 447–448 conformed dimensions, 115, 158–160 crosstabs, cube visualization, 447–448 conformed rollup, 498 cross-validation connect by prior statement, SQL, 184 compensating for biases, 509 connection pools stratified, 509–510 adding to Hibernate configuration, 49–50 with Weka Explorer, 524 database connection options, 250 CSS (Cascading Style Sheets) managing, 69 building CDF dashboards, 532 connections. See database connections CDF and, 536–537 consistency, and transformations, 247 styling dashboards, 566–567 consolidated fact tables, 189 Ctrl+Alt+N, 287 consultants, hiring external analysts, CTRL+C, 25 193–194 CTRL+R, 25 content repository, 412–413, 429 Ctrl+Spacebar, 312, 317 576 Index ■ C–D

cubes showing on customer and websites adding dimensions to Mondrian dashboard, 557–562 schemas, 470 customers, WCM example adding measures to cube fact tables, customer order fulfillment, 94 469–470 developing data model, 101–102 analyzing. See JPivot main process flows, 96 associating with shared dimensions, orders and promotions, 102–105 476–477 websites targeting, 94–95 creating, 466–467 Customers per Website pie chart fact tables for, 468 action sequences, 548–551 family relationships, 451–452 overview of, 548 FILTER function, 455–456 XactionComponent, 551–553 MDX queries operating on, 445–446 Customize Selection screen, WAQR, 374 overview of, 446–447 CWM (Common Warehouse Metamodel), publishing, 482–483 70, 352 testing, 481–482 visualization of, 447–448 curly braces {}, set syntax, 458–459 D current job scope, 313 DaaS (DataWarehouse as a Service), 144 current year indicator, loading date Dashboard Builder, Enterprise Edition, 77 dimension, 276–277 dashboard content template, CDF, 533, current_record column, 172–173 541–542 current_week column, 167–168, 216–218 dashboard document template. See customer and websites dashboard, document templates (outer templates), 542–569 CDF adding TextComponent, 555–557 dashboards boilerplate code for dashboard CDF. See CDF (Community Dashboard components, 546–547 Framework) boilerplate code for dashboard components, 543 parameters, 546 customers and websites. See customer boilerplate code for dashboard solution and websites dashboard and path, 545–546 examples of, 19–20 custom document template for, 568–569 overview of, 529 Customers per Website pie chart, summary, 569–570 548–553 Dashboards object, 534 dynamically changing dashboard title, Dashboards.js, 536 553–554 data, marker options, 562–565 .html file for, 545 data acquisition and preparation, Weka MapComponent data format, 557–562 data mining, 521–522 marker options for data, 562–565 data analysis reacting to mouse clicks on pie chart, data profiling for, 197–198 554–555 data warehousing and, 195–197 setting up, 544 using DataCleaner. See DataCleaner showing customer locations, 557 data changes, promotion dimension, styling the dashboard, 565–568 301–302, 304–306 testing, 547 data cleansing, source data, 228 .xcdf file for, 544 data governance, 125 customer dimensions, Mondrian schemas, data integration 478–480 activities. See ETL (Extraction, customer locations. See also MapComponent Transformation, and Loading) marker options for showing, 563–565 defined, 223 Index ■ D 577

engine, 230, 232 data sources overview, 223–224 managing, 60–61 using PDI. See PDI (Pentaho Data working with subreports for different, Integration) 404–406 data lineage, 114 data staging, 226–227 data marts data timeliness, 114 bus architecture, 119–121 data validation, ETL and, 227–228 data warehouse architecture, 117 data vault (DV), 125–127 independent, 119 The Data Warehouse Lifecycle Toolkit Inman vs. Kimball approach, 115–116 (Kimball), 158, 170, 194 OLAP cubes, 121–122 data warehousing, 111–145 overview of, 121 analytical databases, 142–143 storage formats and MDX, 122–123 appliances, 143–144 data marts, design, 191–220 Changed Data Capture and, 133–137 data analysis, 195–198. See also changing user requirements, 137–139 DataCleaner data quality problems, 124–128 data modeling with Power*Architect, data volume and performance problems, 208–209 128–133 developing model, 206–208 debate over Inmon vs. Kimball, 114–116 requirements analysis, 191–195 on demand, 144 source to target mapping, 218–219 example of. See WCM (World Class WCM example. See WCM (World Class Movies), example business case Movies), building data marts need for, 112–114 data mining. See also Weka data mining overview of, 113–114 algorithms, 508–509 real-time, 140–142 association in, 507–508 using star schemas. See star schemas classification in, 506–507 virtual, 139–140 clustering in, 507 data warehousing architecture, 116–123 defined, 504 central data warehouse, 119–121 engine, 72–73 data marts, 121–123 further reading, 527 overview of, 116–118 numeric prediction (regression), 508 staging area, 118–119 process, 504–506 The Data Warehousing Institute (TDWI), stratified cross-validation, 509–510 119 summary, 527 database connection descriptors, 252–253, toolset, 506 359–360 training and testing, 509 Database Connection dialog, 249–252, data models 257–258, 360 developing global data mart, 206–208 database connections as forms of metadata, 114 adding to data mart, 201–202 normalized vs. dimensional, 115–116 configuring, 256–257 with Power*Architect, 208–209 creating, 249–252 reference for understanding, 208 establishing to PSW, 462 Data Profiler, Talend, 206 generic, 257–258 data profiling ‘‘Hello World’’ example, 253–256 alternative solutions, 205–206 JDBC and ODBC, 248 overview of, 197–198 JNDI, 319–322 using DataCleaner. See DataCleaner managing in Repository Explorer, 326 using Power*Architect tool, 208 managing with variables, 314–318 data quality. See DQ (data quality) to PDI repository, 322–324 data sets, PRD reports, 381–386 to PDI repository, automatically, 324–325 578 Index ■ D

database connections (continued) Date Mask Matcher profile, DataCleaner, at physical layer of metadata domain, 200 359–360 date_julian, relative time, 167–168 testing, 252 day of week number, date dimensions, using, 252–253 271–272 for Weka data mining, 512–514 Days Sequence step, date dimensions, database segmentation (clustering), 507 268–269 database-based repository, 358–359 Debian-based Linux, automatic startup databases in, 42 column-oriented databases, 502 decision tree algorithms, 508–509 connection pool management, 69 decoding, data integration and, 228–229 managing drivers, 44–45 default member, MDX hierarchies, policies prohibiting extraction of data, 450–451 226 Define process tab, action sequence editor, system, 45–52 81–83 tools for, 31–34 degenerate dimensions, 181 used by WCM example, 95–97 Delete Job process action, Scheduler, 420 DataCleaner, 198–206 Delete link, Public Schedules pane of adding database connections, 201–202 Workspace, 428 adding profile tasks, 200–201 deleting schedules, 417 alternative solution, 205–206 delivery layer, metadata model, 355, doing column dependency checks, 205 365–366 doing initial profile, 202 demography dimension, loading, 281–286 overview of, 198–200 generating age and income groups, 284–285 profiling and exploring results, 204–205 multiple ingoing and outgoing streams, selecting source tables, 218–219 285–286 validating and comparing data, 205 overview of, 281–283 working with regular expressions, stage_demography and dim_demography 202–204 tables, 283–284 datacleaner-config.xml file, 201 demography dimension, Orders data mart, DataWarehouse as a Service (DWaaS), 144 212 date and time, modeling, 165–168 Demography Key sequence step, 283, date dimension 285–286 generating, 213–216 de-normalization, 115–116 role-playing, 182 dependencies, ‘‘Hello World’’ special date fields and calculations, transformation, 247 216–218 dependency profiling, 197–198 date dimension, loading, 263–277 deployment Add sequence step, 268–269 of PDI. See PDI (Pentaho Data Calculator step, 269–273 Integration), deployment current and last year indicators, 276–277 of Pentaho metadata, 366–368 Execute SQL script step, 265–267 descendants, cube family relationships, 452 Generate Rows step, 267–268 descriptive text, for schedules, 415 internationalization and locale support, design tools, schema, 444 277 desktop programs, 65, 74–76, 196 ISO week and year attributes, 276 Details Body, PRD reports, 378 overview of, 263–265 development skills, CDF dashboards, 531 Table Output step, 275–276 Devlin, Barry, 113–114 using stored procedures, 262–263 dicing, defined, 441 Value Mapper step, 273–275 dictionary, DataCleaner, 205 Index ■ D 579

Dictionary Matcher profile, DataCleaner, cubes and, 446 200 data marts with conformed, 115 dim_demography table, 282–284 defined, 17 dim_promotion dimension table. See DVD and customer, 478–480 promotion dimension on only one axis, 455 dimension keys, 162, 169 slicing with OLAP Navigator, 490–491 dimension tables static, 213–216 aggregating data, 229 directories attributes in small vs. large, 206–207 navigation commands for, 24–25 choosing for Mondrian schemas, 471–474 Repository Explorer managing, 326 fact tables vs., 148–150 server installation, 38 loading demography dimension, for UNIX-based systems, 39 281–286 Disconnect repository option, 324 loading simple date dimension. See date disks, saving schemas on, 464–465 dimension, loading distribution of class values, prediction loading simple time dimension, 277–281 models, 509 maintenance of, 229 Document structure, styling dashboard, overview of, 262 566 and, 148 document templates (outer templates), using stored procedures, 262–263 CDF dimension usage, 476–477 contents of, 540 dimensional model, advanced concepts, customizing, 568–569 179–189 default templates shipping with building hierarchies, 184–186 Community Edition, 539 consolidating multi-grain tables, 188–189 examples of reusable content, 538–539 junk, heterogeneous and degenerate naming conventions, 540–541 dimensions, 180–181 overview of, 533 monster dimensions, 179–180 placeholders, 540 multi-valued dimensions and bridge documentation tables, 182–183 CDF, 531 outriggers, 188 Pentaho Data Integration, 261–262 overview of, 179 DOM specification, HTML standards, 532 role-playing dimensions, 181–182 domains, metadata, 359 snowflakes and clustering dimensions, DQ (data quality) 186–187 categories of, 124–125 dimensional model, capturing history, data vault and, 125–127 170–179 using reference and master data, 127–128 overview of, 169–170 Dresner, Howard, 107 SCD type 1: overwrite, 171 drill down, OLAP and, 441 SCD type 2: add row, 171–173 drill member action, 487 SCD type 3: add column, 174 drill position, 488 SCD type 4: mini-dimensions, 174–176 drill replace method, 488 SCD type 5: separate history table, drill through action, 488 176–178 drill up, OLAP and, 441 SCD type 6: hybrid strategies, 178–179 drilling, 486–488 dimensional model, defined, 17 drivers dimensions for data profiling in DataCleaner, adding to Mondrian schemas, 470–471 201–202 associating cubes with shared, 476–477 for JNDI connections, 320–321 calculated members, 459–460 managing database, 44–45 controlling placement on axes, 489–490 Drools, 141 580 Index ■ D–E

DROP TABLE statement, date dimensions, eobjects.org. See DataCleaner 265–267 ERD (Entity Relationship Diagram), dual-boot systems, 23 102, 104 Dummy step, 296–297 ERMaster, 32 duplicate data, and data warehouses, 124 error execution path, 288 DV (data vault), 125–127 Esper, 141 DVD dimensions, Mondrian schemas, E-stats site, U.S. Census, 109 478–480 ETL (Extraction, Transformation, and DVDs Loading). See also PDI (Pentaho Data inventory management, 104–105 Integration) rental reminder e-mails example, for back office, 118 430–438 building ‘‘Hello World!’’. See Spoon, WCM data model, 99–102 ‘‘Hello World!’’ DWaaS (DataWarehouse as a Service), 144 data warehouse architecture, 117–118 engine, 72 overview of, 224 E scheduling independent of Pentaho BI ECHO command, 29 Server, 420 Eclipse IDE, 77–80 staging area optimizing, 118–119 ECMAScript, 532 ETL (Extraction, Transformation, and Edit link, Public Schedules pane of Loading), activities Workspace, 427–428 aggregation, 229 edit modes, schema editor, 465–466 Change Data Capture, 226 EE (Enterprise Edition) of Pentaho, 76–77, data cleansing, 228 530 data staging, 226–227 elements, PRD report, 380–381 data validation, 227–228 ELT (extract, load, transform), 224 decoding and renaming, 228–229 e-mail dimension and bridge table maintenance, configuring, 52–54 229 example of bursting. See rental reminder extraction, 226 e-mails example key management, 229 Pentaho BI Server services, 70 loading fact tables, 229 testing configuration, 54 overview of, 225 Email Message tab, Mail job entry, 290–291 ETLT (extract, transform, load, and EMAIL process action, 436–437 transform), 224 email_config.xml, 52, 70 EUL (end user layer), 117 employees examples, shipped with Pentaho collecting requirements from, 194 analysis, 16–19 developing data model, 101, 103 charting, 14–16 inventory management, 105 dashboarding, 19–20 Encr.bat script, 334 other types of, 20 encr.sh script, 334 overview of, 8–9 end user layer (EUL), 117 reporting, 11–14 end users, defined, 192 understanding, 9–11 Enterprise Edition (EE) of Pentaho, 76–77, using repository browser, 9 530 Excel exports, JPivot, 494 Enterprise Resource Planning (ERP), in Execute a Job dialog, Carte, 341 data analysis, 195 Execute a Transformation dialog, Carte, environment variables, installing 341 Java, 28 Execute privileges, 424 Index ■ E–F 581

Execute SQL Script step, PDI F creating date dimension table, 265–267, F3 keyboard shortcut, 249 277 fact tables creating demography dimension table, conformed dimensions linking to, 115 282 consolidated, 189 creating staging table, 295–296 creating Orders data mart, 212 Execution Results pane, Spoon, 245–246, developing global data mart data model, 255 208 Experimenter, Weka, 510, 517–518 dimension tables vs., 148–150 Explorer, Weka loading, 230 creating and saving data mining model, Mondrian schemas, 468 523–524 types of, 149–150 overview of, 510 using smart date keys to partition, 166 working with, 516–517 Fake Name Generator, 97, 101–102 exporting family relationships, cubes, 451–452 data to spreadsheet or CSV file with Feature List, JNDI connections, 320–321 WAQR, 375 federated architecture, data warehousing, jobs and transformations, 326 119–120 metadata to server, 367 fields PRD reports, 408 creating SQL queries using JDBC, 383 XMI files, 366 developing global data mart data model, expression string function, 207–208 TextComponent, 556 formatting when using images in reports, expressions 404 MQL Query Builder limitations, for Missing Date step, dimension tables, 385 267–268 in Pentaho reporting formulas, 396 special date, 216–218 eXtensible attribute-Relation File Format (XRFF), 511–512 WAQR report, 374 Extensible Markup Language. See XML file formats, Weka, 511–512 (Extensible Markup Language) files file-based repository, 358–359 FILTER extract, 306–307 function, MDX queries, 455–456 extract, load, transform (ELT), 224 Filter rows step, 293–295 extract, transform, load, and transform filters (ETLT), 224 OLAP Navigator, 19 extract_lookup_type transformation, schedules belonging to same group, 287, 292–293 414–415 extract_lookup_value transformation, WAQR reports, 374 287, 292–293 Flash Charts, Pentaho, 15–16 extract_promotion transformation, flow-oriented report tools, 376 302–304 folder contents pane, user console, 8 Extraction, Transformation, and Loading. folders, 68 See ETL (Extraction, Transformation, folds, cross-validation and, 509–510 and Loading) foreign key constraints, 150–151, 383 extraction process, ETL Format Row-Banding option, PRD Change Data Capture activities, reports, 391 226 formatting data staging activities, 226–227 dates, 271 defined, 224 email body in HTML, 292 overview of, 226 metadata layer applying consistency supportive activities of, 225 of, 360 582 Index ■ F–H

formatting (continued) Get System Info step, 333 PRD report with images, 404 getting started, 3–20 PRD reports, 389–390 downloading and installing software, report previews, 376 4–5 WAQR reports, 374, 375 logging in, 6–7 Formula step, loading date dimensions, Mantle (Pentaho user console), 7–8 276–277 overview of, 3 formulas, Pentaho reporting, 395–396 starting Pentaho VI server, 5–6 Forums link, PRD Welcome screen, 377 working with examples. See examples, forward engineering, 208 shipped with Pentaho Freakonomics, 503 global data mart data model, 206–208 Free Java Download button, 28 global functions, PRD reports, 393–395 Free memory threshold, staging lookup global input source, 84 values, 299 global variables, user-defined, 312 FROM statement, SQL, 151–152 Gmail, configuring, 70 front office, data warehouse architecture, GNOME terminal, 24 117–118 grand-parent job scope, 313 front-end granularity Pentaho BI stack, 66 consolidating multi-grain tables, 188–189 user console as, 73 data warehouse design, 163–164 FULL OUTER JOIN, 155 dimension tables and fact tables, 149 full table scans, 129 global data mart data model, 208 functionality, Pentaho BI stack, 65 star schema modeling, 163–164 functions, report, 393–395 time dimensions, 165 graphs adding to PRD reports. See charts G not available in WAQR, 375 GA (generally available) release, 4 Greenplum, 143 Gender and Gender Sequence step, grids, field, 244 demography dimension, 282 GROUP BY statement, SQL, 151–153 Gender label step, demography Group Header/Footer, PRD reports, 378 dimension, 282 groups General category, Database Connection creating schedule using, 414–415 dialog, 249–251 MQL Query Builder limitations, 385 General tab, action sequence editor, 81, for PRD reports, 378, 391–393 86–89 for UNIX-based systems, 39 Generate 24 hours step, time dimension, for WAQR reports, 373–374, 375 277 guest user, PDI repository, 324, 326–327 Generate 60 Minutes step, time dimension, GUI tools, MySQL, 31 277 Generate Rows step date dimension, 267–268 H demography dimension, 282–283 HAVING statement, SQL, 151, 157 Generate Rows with Initial Date step, date headers, PRD reports, 378 dimension, 267–268 ‘‘Hello World!’’. See Spoon, ‘‘Hello generating age and income groups, World!’’ demography dimension, 284–285 helper tables, 184 generic database connections, 257–258, 316 heterogeneous dimensions, 181 geography dimension table, Hibernate database MapComponent, 558–559 configuring JDBC security, 50–51 Index ■ H–I 583

defined, 45 I overview of, 47–50 icon, variable, 311 hierarchies identification, for user accounts, 327–328 adding, 471 IDs, defining custom, 387 adding levels, 474–476 images attributes, 472–473 adding to PRD reports, 401–404 cube aggregation and, 448–449 creating bootable CDs from downloaded cube family relationships, 451–452 files, 22 levels and members, 449–451 virtualization software and, 23–24 multiple, 451 importing navigating through data using, 184–186 parameter values to subreports, 405–406 history XMI files, 366 capturing in data warehouse. See IN operator, 151 dimensional model, capturing Income and Income Sequence step, history demography dimension, 282 creating separate table for, 176–178 Income Group step, demography data warehouse advantages, 113 dimension, 282, 284–285 data warehouse retaining, 128 Income Statement sample report, 12 storing commands, 25 incomplete data, data warehouses, 124 HOLAP (Hybrid OLAP), 122 incorrect data, data warehouses, 124 home directory, CDF, 534–535 independent data marts, 119–120 Home Theater Info site, 100–101 indexes analytical databases not needing, 142 home-grown systems, for data analysis, creating in SQL, 212–213 195 improving data warehouse performance, hops 129 connecting transformations and jobs, InfoBright, 23, 143, 502 233–234, 287 inheritance creating in ‘‘Hello World’’, 242–243 Pentaho metadata layer and, 356–357 Filter row steps in staging lookup values, PRD reports using style, 389–390 295 init script (rc file), 40–42 splitting existing, 254 initialization phase, transformations, horizontal partitioning, 180 266–267 Hour Sequence step, time dimension, 277 inline styles, HTML e-mail, 436 HSQLDB database server Inmon, Bill, 113–116 managing system databases by default, INNER JOIN, 153–154 45 input formats, Weka data mining, 511–512 migrating system databases from, input streams, PDI transformation sets, 46–52 285–286 start-pentaho script holding, 6 inputs, action sequence, 83–85 HTML (HyperText Markup Language) installation, Pentaho BI Server building CDF dashboards, 531 directory, 38 DOM specification, 532 overview of, 4 formatting mail body in, 292 setting up user account, 38–39 for layout, 566–567 subdirectories, 38 .html file, 545 integer division, date dimensions, 272 hub and spoke architecture, 119–121 Interface address parameter, Carte, 340 hubs, for data vault models, 125 International Organization for Hybrid OLAP (HOLAP), 122 Standardization. See ISO (International hybrid strategies, 178–179 Organization for Standardization) 584 Index ■ I–J

internationalization, 277, 484 overview of, 248–249 Internet Movie Database, 99 Weka reading data from databases using, interview process, collecting requirements, 512–513 194 JDBC Explorer, 463 intrusive CDC, 133–135 JDBC-ODBC bridge, 249 invalid data, cleansing, 228 jdbc.properties file, 320–321 inventory, WCM example, 94–95, 104–105 Jetty, 55, 57–58 ISO (International Organization for JFreeChart, 15, 397–404 Standardization) JFreeReports, 72. See also PRD (Pentaho images, 22 Report Designer) standards, 97 JNDI (Java Naming and Directory week and year attributes, 276 Interface) _iso columns, date dimension, 182–183 configuring connections for metadata issue management, CDF, 531 editor, 360 Item Band, running reminder report, 435 creating and editing data sources, 60–61 creating database connections, 319–322 job entries J defined, 287 J48 tree classifier, 508–509 hops connecting, 287–288 Mail Success and Mail Failure, 289–292 .jar files, 44–45 START, 288 .jar files, 201 Transformation, 288–289 JasperReports, 72, 372 using variables. See variables Java jobs installing and configuring, 27–29 adding notes to canvas, 264–265 Pentaho programmed in, 4, 66 creating, 287–288 servlet technology, 67, 74 creating database connections, 249 java - java weka.jar command, 514 data integration engine handling, 232 Java Database Connectivity. See JDBC exporting in Repository Explorer, 326 (Java Database Connectivity) overview of, 235 Java Naming and Directory Interface. See remotely executing with Carte, 341–342 JNDI (Java Naming and Directory running from command line, 330–334 Interface) running inside Pentaho BI Server, Java Virtual Machine (JVM), 43, 314, 334–337 333–334 running with Kitchen, 332 JAVA_HOME variable, 28 running within action sequences, 335 JavaMail API, 52, 54 storing in repository, 232 JavaScript transformations vs., 232–233, 287 building CDF dashboards, 532 user-defined variables for, 312 CDF and, 536–537 using database connections, 252–253 evaluation, 205 with variable database connection, JDBC (Java Database Connectivity) 317–318 configuring security, 50–51 JOIN clauses, 151–154 connection parameters, 462 join paths, 364–365 connection pooling, 69 join profiling, 197–198 creating and editing data sources, 60–61 Join Rows (Cartesian product) step, creating database connections, 250–251 277–281, 283 creating generic database connections, JPivot 257–258 analysis view, 484–485 creating SQL queries, 382–385 charts, 494–496 managing database drivers, 44–45 drilling with pivot tables, 486–488 Index ■ J–L 585

MDX query pane, 493 levels NON EMPTY function and, 458 adding hierarchy levels, 474–476 overview of, 442, 484 attributes, 474–475 PDF and Excel exports, 494 MultiDimensional eXpressions, 449–451 toolbar, 485 lib directory, CDF, 535 JQuery, 532 Library for Support Vector Machine jquery..js, 536 (LibSVM), 512 js directory, CDF, 535 LibSVM (Library for Support Vector jsquery.js, 536 Machine), 512 junk dimension table, 181 line charts, 543 JVM (Java Virtual Machine), 314, 333–334 links, for data vault models, 125 Linux, 24–25, 40–42 list, picking variables from, 312, 317 K List Scheduled Jobs process action, K.E.T.T.L.E. See Pentaho Data Integration Scheduler, 420 Kettle jobs, 73, 210 listeners, TextComponent, 556–557 Kettle.exe script, 236 Live mode, running Ubuntu in, 23 kettle.properties file, 312–313, 324–325 Load dim_demography step, 283 key management, data integration and, 229 Load dim_time step, 277 Kickfire appliance, 144 LOAD FILE command, MySQL, 402 Kimball, Ralph Load stage_demography step, 283 load_dim_promotion back and back office definitions, 117–118 job, 302–303, data warehouse bus architecture, 158 306–307 loading process, ETL. See also ETL Inmon data warehouse model vs., (Extraction, Transformation, and 114–116 Loading) SCD strategies, 170 defined, 224 snowflaking and, 186–187 dimension table maintenance, 229 Kitchen tool loading fact tables, 230 generic command-line parameters, supportive activities of, 225 330–332 local time, UTC (Zulu) time vs., 165–166 as PDI tool, 230–231 locale support, date dimensions, 277 running jobs/transformations from localization, in metadata layer, 349, 357 command line, 330–334 locations, customer. See customer using PDI repository with. See repository, locations; MapComponent PDI log-based CDC, 136–137 Klose, Ingo, 530 logging in, getting started, 6–7 KnowledgeFlow, Weka, 510, 518–519 logical consistency, 247 known values, data mining outcomes, 506 logical layer, metadata model Business Models, 362 Business Tables and Business Columns, L 362–363 last day in month, date dimensions, 273 defined, 355 last year indicator, date dimensions, purpose of, 362 276–277 relationships, 364–365 last_week column, 167–168, 216–218 Login button, getting started, 6–7 latency, remote execution reducing, 339 login modules, Jetty, 57–58 layout logos, reports with, 401–404 dashboard, 566 lookup values, staging, 286–300 PRD reports, 389–390, 393 Check if Staging Table Exists step, 294 LEFT OUTER JOIN, 154 Create Staging Table step, 295–296 586 Index ■ L–M

lookup values, staging, (continued) maps, dashboard, 543 Dummy step, 296–297 marker options, for customer locations, extract_lookup_type/extract_ 562–565 lookup_value transformations, Market Analysis By Year example, 18–19 292–293 market basket analysis, and association, Filter rows step, 294–295 507 Mail Success and Mail Failure job entries, massive parallel processing (MPP) cluster, 289–292 142 overview of, 286–287 master data management (MDM), 126–128 Sort on Lookup Type step, 299–300 Mastering Data Warehouse Design (Imhoff stage_lookup_data job, 287–288 et al.), 208 stage_lookup_data transformation, MAT (moving annual total), 217–218 293–294 materialized views, data warehouse START job entry, 288 performance, 130–131 Stream Lookup step, 297–299 MDM (master data management), 126–128 Table Output step, 300 MDX (MultiDimensional eXpressions), transformation job entries, 288–289 445–460 lookup_value table, promotion mappings, calculated members, 459–460 301 WITH clause for working with sets, looping, bursting example, 432–434 458–459 lost dimensions, 498 CROSSJOIN function, 457 LucidDB analytical database, 140, 142–143, cube family relationships, 451–452 502 cubes, 446–448 FILTER function, 455–456 hierarchies, 448–449, 451 M levels and members, 449–451 machine learning, 503. See also data mining NONEMPTY function, 457–458 Mail Failure job entries, 288–292 ORDER function, 456–457 Mail Success job entries, 288–292 overview of, 445–446 mainframe systems, data analysis using, query syntax, 453–455 195–196 storage formats and, 122–123 managers, collecting requirements from, TOPCOUNT and BOTTOMCOUNT functions, 194 457 Mantle. See Pentaho user console (Mantle) MDX query pane, JPivot, 493 MapComponent, 557–562 MDX query tool, PSW, 481–482 adding geography dimension table, dimensions, 447 558–559 measures code for including on dashboard, adding to cube fact tables, 469–470 561–562 cubes and, 447 data format, 556–557 OLAP Navigator displaying multiple, location data action sequence, 559–561 493 marker options for showing distribution ORDER function, 456 of customer locations, 563–565 Slice and Dice pivot table example, 17 overview of, 557 transaction and snapshot fact tables, 150 mapping member sets, OLAP Navigator specifying, adding schema independence, 348–349 492 dim_promotion table, 301 members, MDX, 449–451 planning loading of dimension table, 300 menu bar, user console, 7 relational model to multi-dimensional message fields, images in reports, 404 model, 444 message queues, real-time data source to target, 218–219 warehousing, 141 Index ■ M 587

Message Template process action, 438 Modified JavaScript Value step, PDI, 277 metadata. See also Pentaho metadata layer Mogwai ERDesigner, 32 in data warehouse environment, 113–114 MOLAP (Multidimensional OLAP), 122 displaying in DataCleaner, 200 Mondrian refreshing after publishing report to Aggregate Designer, 130 server, 407–408 aggregation and, 496 transformation, 233 alternatives to aggregation, 502 unclear, 124 benefits of aggregation, 496–497 using ERP systems for data analysis, 195 data warehousing with, 123 Metadata Data Source Editor, 385 downloading, 460 metadata domains, 359 extending with aggregate tables, 497–500 Metadata Query Language. See MQL Pentaho Aggregate Designer and, (Metadata Query Language) 500–502 metadata repository, 358–359 as Pentaho’s OLAP engine, 72 metadata.xmi file, 367 types of users working with, 192 MetaEditor.bat file, 358 using aggregate tables, 229 metaeditor.sh file, 358 Mondrian schemas MetaMatrix solution, 140 adding measures to cube fact tables, metrics. See measures 469–470 Microsoft Windows creating and editing basic, 466 automatic startup in, 43 creating with PSW, 444 creating symbolic links in Vista, 26 cube fact tables, 468 how PDI keeps track of repositories, cubes, 466–467 328–329 cubes, associating with dimensions, installing Java on, 28–29 476–477 installing MySQL GUI tools on, 31 dimension tables, 471–474 installing MySQL server and client dimensions, adding, 470–471 on, 30 DVD and customer dimensions, installing Pentaho Server, 38 478–480 installing Squirrel on, 33 editing tasks, 466 job scheduling for, 421 hierarchies, 471–476 obfuscated database passwords in, 334 other design topics, 483–484 Pentaho Metadata Editor in, 357–358 overview of, 444, 460 running Carte, 339–340 Pentaho Schema Workbench and, starting and stopping PAC in, 56 460–463 starting Pentaho BI Server, 5–6 publishing cubes, 482–483 starting Pentaho Design Studio, 10 testing, 481–482 starting Spoon application, 236 using schema editor, 463–466 mini-dimensions XML source for, 480–481 improving monster dimensions with, MonetDB, 142, 502 174–176 Moneyball, 503 junk dimension vs., 181 monitoring, defined, 309 vertical partitioning vs., 179–180 monster dimensions Minute Sequence step, time dimensions, mini-dimensions handling, 174–176 277 partitioning, 179–180 missing data, data warehouses, 124 month number, date dimensions, 271–272 Missing Date step, date dimensions, mouse clicks, reacting to on pie chart, 267–268 554–555 mission, managing business from view of, moving annual total (MAT), 217–218 105–106 MPP (massive parallel processing) models, creating/saving Weka, 523 clusters, 142 588 Index ■ M–O

MQL (Metadata Query Language) native mode, Ubuntu in, 23 generating SQL queries with, 351, natural keys, 161, 229 355–356 Nautilus, 26 Pentaho Metadata Layer generating SQL navigation, of data mart data, 184–186 from, 70–71 network traffic, reduced by remote storing query specifications as, 352 execution, 339 MQL Query Builder, 385–386 New Action Sequence wizard, 80 MSI installer, 30 New option, PRD, 378 multi-click users (builders), 192–193 New Project Wizard, PDS, 80–82 multi-database support, 208 New Task panel, DataCleaner, 199 MultiDimensional eXpressions. See MDX No Data, PRD reports, 379 (MultiDimensional eXpressions) NON EMPTY function, MDX queries, multi-dimensional model, mapping to 457–458 relational model, 444 non-additive facts, 150 Multidimensional OLAP (MOLAP), 122 nonintrusive CDC, 133, 136–137 multi-disciplinary data warehouse team, normalization, 115–116, 186–187 192 North American Industry Classification multiple hierarchies, MDX, 451 System (NAICS), 128 multiple ingoing and outgoing streams, notes, transformation or job canvas, demography dimensions, 285–286 264–265 multi-valued dimensions, and bridge NULL values, 124, 169 tables, 182–183 Number Analysis profile, DataCleaner, Murphy, Paul, 113–114 200, 202 My Schedules pane, Workspace, 427 numeric prediction (regression), data MySQL mining, 508 GUI tools, 31 installation, 29–31 Kickfire for, 144 migrating system databases to. See O system databases OASI (One Attribute Set Interface), nonsupport for window functions, 132 184–186 Rollup function, 132 obfuscated passwords, 314, 334 setting up database connections for Weka object attributes, editing with schema data mining, 512–515 editor, 465 MySQL Administrator, 31 Object Management Group (OMG), 352 command-line tool, 46 ODBC (Open Database Connectivity) MySQL Query Browser, 31 creating database connections, 251 MySQL Workbench, 32 creating generic database connections, mysqlbinlog, 137 258 overview of, 249 offsets, and relative time, 167 N OLAP (Online Analytical Processing) NAICS (North American Industry cubes, 121–122 Classification System), 128 data mining compared with, 503 Components.js, 537 Navigator, 18–19 naming conventions as Pentaho BI Server component, 72 data integration process, 228–229 storage formats and MDX, 122–123 data warehouses, 162–163 OLAP Navigator document templates (outer templates), controlling placement of dimensions on 540–541 axes, 489–490 steps, 240 displaying multiple measures, 493 Index ■ O–P 589

overview of, 488–489 creating data sources with, 60–61 slicing with, 490–491 home directory, 38 specifying member sets, 492 home page, 56–57 OMG (Object Management Group), 352 overview of, 55 on demand data warehousing, 144 pluggable authentication, 58 One Attribute Set Interface (OASI), security and credentials, 57–58 184–186 starting and stopping, 56 one-click users (consumers), 192–193 testing dashboard from, 547 Online Analytical Processing. See OLAP user management, 58–60 (Online Analytical Processing) PAC (Pentaho Administrative Console), online data, and data analysis, 196 schedules Open Database Connectivity. See ODBC creating new schedule, 414–416 (Open Database Connectivity) deleting schedules, 417 Open Symphony project, 69–70, 411–412 overview of, 413 OpenRules, 141 running schedules, 416 Operational BI, 140–141 suspending and resuming schedules, Options category, Database Connection 416–417 dialog, 250 Package Manager, 29–30 OR operator, 157 PAD (Pentaho Aggregate Designer) ORDER BY statement, SQL, 151, 158 benefits of aggregation, 496–497 ORDER function, MDX queries, 456–457 defined, 75 order numbers, as degenerate dimensions, enhancing performance with, 496 181 extending Mondrian with aggregate ordering data, 158 tables, 497–500 Orders data mart, creating, 210–212 generating and populating aggregate organizational performance, analytics and, tables, 445 503 overview of, 500–502 OUTER JOIN, 153–155 using aggregate tables, 229 outer templates, CDF. See document page breaks, WAQR reports, 375 templates (outer templates), CDF output Page Header/Footer, PRD reports, 378 PAGES action sequence, 85 , axes, 453 formats, 11–14 Pan tool ‘‘Hello World’’ example, 246 as PDI tool, 230–231 streams, PDI, 285–286 running jobs/transformations from outriggers, 188 command line, 330–334 Over clause, window functions, using PDI repository. See repository, PDI 131–132 parameters override, 356 custom command-line, 333–334 dashboard, 546 job scheduler, 418–419 P report, 13–14, 386–389 P*A (Power*Architect) tool running Carte, 340 building data marts using, 210–212 running jobs with Kitchen, 332 creating database connections to build running transformations with Pan, 332 data marts, 210 specifying value of, 330–331 data modeling with, 208–209 subreport, 405–407 generating databases, 212–213 when to use instead of variables, 316–317 overview of, 30–31 parent job scope, variables with, 313 PAC (Pentaho Administrative Console) parent object, and inheritance, 356 basic configuration, 55–56 parent-child relationship, cubes, 452 590 Index ■ P

Partition By clause, window functions, loading data from source systems. See 131–132 lookup values, staging; promotion partitioning dimension for data warehouse performance, 129 plug-in architecture, 235 database connection options, 250 repository, 232 horizontal, 180 Reservoir Sampling, 520 using smart date keys, 166 tools and utilities, 230–231 vertical, 179 Weka, getting started with, 520–521 PAS (Pentaho Analysis Services). See also Weka and, 519–520 OLAP (Online Analytical Processing) Weka data acquisition and preparation, aggregate tables, 445 521–522 architecture, 442–444 Weka model, creating and saving, 523 components, 442–443 Weka scoring plugin, 523–524 overview of, 441 working with database connections, schema, 444 248–258 schema design tools, 444 PDI (Pentaho Data Integration), passwords deployment, 309–343 connecting to repository, 324 configuration, using JNDI connections, creating slave servers, 340 319–322 installing MySQL on Linux, 29–30 configuration, using PDI repository, installing MySQL on Windows, 30 322–330 not storing in plain-text files, 202 configuration, using variables, 310–319 obfuscated database, 314, 334 configuration management, 310 PAC home page, 56 overview of, 309 publisher, 54–55 remote execution with Carte, 337–342 publishing report to Pentaho BI Server, running from command line, 330–334 407 running inside Pentaho BI Server, user account, 327–328 334–337 path global variable, dashboard path, using variables. See variables 545–546 Pattern Finder profile, DataCleaner, 200 PDI (Pentaho Data Integration), designing, Pause button, PAC, 416 261–308 PDF files generating dimension table data. See generated by Steel Wheels, 12 dimension tables implementing bursting in reports, 438 loading data from source systems. See JPivot exports, 494 lookup values, staging; promotion PDI (Pentaho Data Integration), 223–259 dimension adding Weka plugins, 520 overview of, 261–262 checking consistency and dependencies, PDI repository. See repository, PDI 247–248 PDM (Pentaho Data Mining). See Weka concepts, 224–230 data mining PDI (Pentaho Data Integration) PDS (Pentaho Design Studio), 77–89 data integration engine, 232 Action sequence editor, 80–82 data integration overview, 223–224 anatomy of action sequence, 83–89 defined, 76 defined, 75 designing solutions, 261–262 Eclipse, 78–80 Enterprise Edition and, 77 overview of, 77–78 generating dimension table data. See studying examples using, 10 dimension tables Peazip, 4–5 getting started with Spoon. See Spoon Pentaho Administrative Console. See PAC jobs and transformations, 232–235 (Pentaho Administrative Console) Index ■ P 591

Pentaho Aggregate Designer. See PAD front-end/back-end aspect, 66 (Pentaho Aggregate Designer) functionality, 65 Pentaho Analysis Server. See Mondrian overview of, 63–65 Pentaho Analysis Services. See OLAP Pentaho BI Server. See Pentaho BI Server (Online Analytical Processing); PAS Pentaho EE and Community Edition, (Pentaho Analysis Services) 76–77 Pentaho BI Server, 66–74 server, client and desktop programs, 65 analysis view, 484–485 underlying technology, 66–67 building dashboards for, 529 Weka data mining. See Weka data mining charting examples, 14–16 Pentaho community, maintaining CDF configuring e-mail, 70 dashboards, 529–530 example solutions included in, 8–9 Pentaho Corporation, CDF and, 530 incorporating jobs in action sequences, Pentaho Data Integration. See PDI 336 (Pentaho Data Integration) incorporating transformations in action Pentaho Data Mining (PDM). See Weka sequences, 334–336 data mining installing, 4, 38–43 Pentaho Design Studio. See PDS (Pentaho logging in, 6–7 Design Studio) overview of, 67 Pentaho home directory, 38 and PDI repository, 336–337 Pentaho Metadata Editor. See PME Pentaho user console, 7–8 (Pentaho Metadata Editor) platform, 67–70 Pentaho metadata layer, 347–369 presentation layer, 73–74 advantages of, 348–350 publishing cubes to, 482–483 concepts, 356 publishing metadata to, 367 creating metadata queries, 385–386 publishing reports to, 406–407 creating PRD data sets from, 381–382 reporting examples, 11–14 database and query abstraction, response to dashboard requests, 533 352–355 starting, 5–6 defined, 347–348 underlying Java servlet technology, 74 delivery layer, 365–366 Pentaho BI Server, components deploying and using metadata, 366–368 data mining engine, 72–73 inheritance, 356–357 ETL engine, 72 localization of properties, 357 OLAP engine, 72 logical layer, 362–365 PML (Pentaho Metadata Layer), 70–72 metadata domains, 359 reporting engines, 72 metadata repository, 358–359 Web Ad Hoc Query and Reporting overview of, 70–72 Service (WAQR), 72 Pentaho Metadata Editor, 357–358 Pentaho BI Server, configuring physical layer, 359–362 administrative tasks. See administrative PRD using as data source, 373 tasks properties, 355–356 e-mail, 52–54 scope and usage of, 350–352 installation, 38–43 Pentaho Metadata Layer (PML), 70–72 managing database drivers, 44–45 Pentaho Report Designer. See PRD overview of, 37–38 (Pentaho Report Designer) publisher password, 54–55 Pentaho Schema Workbench. See PSW system databases. See system databases (Pentaho Schema Workbench) Pentaho BI stack, 63–90 Pentaho user console (Mantle) creating action sequences with PDS. See getting started, 7–8 PDS (Pentaho Design Studio) presentation layer, 73 desktop programs, 74–76 refreshing metadata with, 367–368 592 Index ■ P

Pentaho user console (Mantle) (continued) PME (Pentaho Metadata Editor) testing dashboard from, 547 defined, 75 welcome screen and login dialog, 6–7 defining metadata with, 350 Pentaho User Portal, 376 editing contents of metadata repository, pentaho-init.sh,41–42 358–359 PentahoSystemVersionCheck schedule, overview of, 357–358 413 PML (Pentaho Metadata Layer), 70–72 performance policies, scheduling tool, 420 analytics and, 503 Pooling category, Database Connection data volume and, 128–133 dialog, 250 file-based vs. database repositories, port 8080, Tomcat, 39–40 358–359 port 8099, PAC, 56 periodic snapshot fact table, 150 ports, running Carte, 340 periodical data loading, 118 PostgreSQL, 132, 143 permissions, managing, 327 power users (analysts), 192–193 perspectives, Eclipse IDE, 80 Power*Architect. See P*A Petabyte, 128 (Power*Architect) tool Physical Columns, 360–362 PRD (Pentaho Report Designer) physical data warehousing, 139 adding and modifying groups, 391–393 physical layer, metadata model adding and using parameters, 386–389 connections, 359–360 adding charts and graphs, 397–404 defined, 71, 355 as banded report editor, 376 overview of, 359 creating data sets, 381–386 Physical Tables and Column Tables, creating metadata-based report, 366 360–362 defined, 76 Physical Tables, 360–362 exporting reports, 408 pie charts layout and formatting, 389–390 adding to PRD reports, 400–402 modifying WAQR reports in, 375 Customers per Website, 548–553 overview of, 376–377 as dashboard component, 543 publishing reports, 406–408 reacting to mouse clicks on, 554–555 report elements, 380–381 pieclicked(), 554–555 report structure, 378–380 PieSet collector function, 399–402 row banding, 390–391 pivot tables using formulas, 395–396 for cube visualization, 447–448 using functions, 393–395 drilling and, 486–488 using subreports, 404–406 OLAP Navigator and. See OLAP Welcome screen, 377–378 Navigator PRD Query Designer, 397–398 Slice and Dice Analysis example, 17–18 predictions Steel Wheels Analysis examples, 18 data mining for, 506 PivotCategorySet collector function, 399 non-numeric, 506–508 placeholders, document templates, 540 numeric, 508 platform, Pentaho BI Server, 67–70 prerequisites, 21–36 plug-ins basicsystemsetup,22–25 adding Weka, 520 database tools, 31–34 dashboard requests handled by, Java installation and configuration, 534–536 27–29 installing Squirrel on Ubuntu with, 33 MySQL installation, 29–31 PDI, 235–236 overview of, 21–22 Weka scoring, 523–524 using symbolic links, 25–26 plugin.xml file, CDF, 535–536 Preview menu option, PRD reports, 408 Index ■ P–Q 593 primary key indexes, 129 pruning, analytical databases, 142 primary keys, 160–161, 229 PSW (Pentaho Schema Workbench), private schedules, 412, 418–419 460–463 privileges creating Mondrian schemas, 444 creating repository, 323 defined, 75 refining access, 349 downloading, 460 Product Line Analysis example, 18–19 establishing connection to, 462 product management, databases for, 95–96 installing, 461 product_type, 217 JDBC Explorer and, 463 profiling MDX query tool in, 481–482 alternative solutions, 205–206 overview of, 442 authorization of user accounts, 327–328 specifying aggregate tables, 499 overview of, 197–198 starting, 461 using DataCleaner. See DataCleaner public schedules using Power*Architect tool, 208 allowing users to subscribe to, 423–424 Program Files directory, 5 creating, 414 projects, Eclipse IDE, 79–82 defined, 412 promotion dimension Public Schedules pane, Workspace, 427 data changes, 301–302 Publish to Server dialog, 367 determining promotion data changes, publisher password, 54–55 304–306 publisher_config.xml file, 55, 407 extract_promotion job, 303–304 publishing load_dim_promotion job, 302–303 cubes, Mondrian schemas, 482–483 mappings, 301 defined, 68 overview of, 300–301 directly to Pentaho BI Server, 372, picking up file and loading extract, 406–408 306–307 purchase orders, WCM example, 101–105 saving extract and passing on file name, 306 synchronization frequency, 302 promotion table, 301–302 Q promotions, WCM example, 94–95, quarter number, date dimensions, 272, 273 102–105 Quartz prompt, running in background, 422–423 configuring, 47 Prompt/Secure Filter action, process defined, 45 actions, 85 Enterprise Job Scheduler, 411–412 properties task scheduling with, 69–70 accessing report, 435 Query Designer, SQL, 382–384 Column Tables, 361–362 query governor, 164 database connections, 249–251 query performance, data warehousing, generic database connections, 257–258 128–133 Pentaho metadata layer, 355–356 aggregation, 130 Physical Tables, 361 archiving, 132 PRD report, 378 bitmap indexes, 129–130 quickly opening for tables or columns, indexes, 129 210 materialized views, 130–131 slave servers, 340 partitioning, 129 SMTP e-mail configuration, 53 window functions, 131–132 subscribing to public schedules, 423–424 query redirection, with materialized variables, 311–312 views, 130 .prpt file format, PRD reports, 376 query syntax, MDX, 453–455 594 Index ■ Q–R

querying star schemas. See star schemas, getting DVDs due to be returned, 434 querying looping through customers, 432–433 Quick Start Guide, PRD, 377 overview of, 430–431 running reminder report, 434–436 sending report via e-mail, 436–438 R replication, data warehousing and, 134 Raju, Prashant, 51 report bursting, 85–89 randomness of data, prediction models, Report Header/Footer, PRD, 378 509 Report Wizard, PRD, 378 rc file (init script), 40–42 reporting engines Read-Only profile, PDI repository, 327 defined, 72 real-time data warehousing, 140–142 Pentaho Report Designer, 76 real-time ETL, 118 reporting architecture, 372 record streams, transformation, 233–234 reports. See also PRD (Pentaho Report record types, transformation, 233–234 Designer) records alternatives for creating, 64 distributing through output streams, 286 attaching output to e-mail message, transformation, 233–234 436–438 Recurrence options, schedules, 415–416 bursting. See bursting recursion, 184 collecting requirements with existing, Red Hat-based Linux, automatic startup, 194 42 examples of, 11–14 reference data, master data vs., 128 multi-click users working with, 192 refreshing metadata Pentaho handling JasperReports or BIRT, after publishing report, 407–408 372 with user console, 367–368 Pentaho metadata layer for. See Pentaho Regex Matcher profile, DataCleaner, 200, metadata layer 202–204 power user tools for, 192 regexes (regular expressions), in Data practical uses of WAQR, 375–376 Cleaner, 202–204 reminder, 434–435 RegexSwap, 203 reporting architecture, 371–373 region, Orders data mart, 212 WAQR, 72 Regional Sales - HTML reporting example, Web-based reporting, 373–375 11–12 repositories Regional Sales - Line/Bar Chart example, content, 412–413 16–17 content, managing, 429 regression (numeric prediction), data launching Spoon with database, 236 mining, 508 metadata, 358 relational model, mapping, 444 solution, 68 Relational OLAP (ROLAP), 122, 442 storing jobs and transformations in relationships database, 232 Pentaho metadata, 362, 364–365 repositories.xml file, 328–329 relative time, 452 repository, PDI, 322–330 relative time automatically connecting to default, handling, 166–168 324–325 relationships, 452 configuring for Pentaho BI Server, reminder report, running, 434–436 336–337 remote execution, 337–339 connecting to, 323–324 rental reminder e-mails example, 430–438 creating, 322–323 finding customers with DVDs due this keeping track of, 328–329 week, 431–432 managing user accounts, 326–328 Index ■ R–S 595

opening Repository Explorer, 325–326 S overview of, 322 SaaS (Software as a Service), 144 upgrading existing, 329–330 Sakila sample database, 97 repository browser, 8–9 sampledata database, 45 Repository dialog, 322–324 samples directory, PDI, 261 Repository Explorer, PDI, 325–328 satellites, data vault models based on, 125 request input source, action sequences, SBI (Serialized Binary Instances), 512 83–84 scalability requirements analysis of remote execution, 338 collecting requirements, 193–195 scale-out, 338 for data warehouse solution, 191–192 scale-up, 338 getting right users involved, 192–193 working with chart, 401–402 reserved words, avoiding for databases, SCDs (Slowly Changing Dimensions) 163 creating Orders data mart, 212 Reservoir Sampling, Weka, 520 developing global data mart data model, Resource option, PRD Welcome screen, 207 377 overview of, 170 resources directory, CDF, 535 type 1: overwrite, 171 Resources option, PRD Welcome screen, type 2: add row, 171–173 377 type 3: add column, 174 resources.txt, CDF, 537 type 4: mini-dimensions, 174–176 restrictions, 156–157 type 5: separate history table, 176–178 Resume Job process action, Scheduler, type 6: hybrid strategies, 178–179 420 Schedule privileges, 424 Resume Scheduler process action, Scheduler Scheduler, 420 concepts, 412–413 resuming schedules, 416–417, 420 creating and maintaining schedules with reverse engineering, 208 PAC, 413–417 RIGHT OUTER JOIN, 155 programming action sequences, 417–420 ROLAP (Relational OLAP), 122, 442 Scheduler Status process action, role-playing dimensions, 181–182 Scheduler, 420 roles schedules, defined, 412 managing server, 59–60 scheduling, 411–422 Mondrian supporting security, 72 alternatives to using Scheduler, 420–422 schema design and, 483 background execution and, 422–423 Rollup function, MySQL, 132 content repository, 412–413 root job scope, variables, 313–314 creating/maintaining with PAC, 413–417 row banding, reports, 390–391 how subscriptions work, 423–426 rows managing, 61 color banding in PRD reports, 390–391 overview of, 411–412 OLAP Navigator, 19 programming Scheduler with action SCD type 2, 171–173 sequences, 417–420 working with field grids, 244 public and private schedules, 412 ROWS, axes, 453–454 Scheduler concepts, 412 RTF files, exporting reports, 408 user’s workspace, 426–429 rule algorithm, for classification, 508 schema. See also Mondrian schemas Run Now link, Public Schedules pane of clustering and, 342 Workspace, 427 design tools, 444 running functions, PRD reports, 393–395 metadata layer limiting impact of runtime input source, action sequences, 84 changes to, 348–349 596 Index ■ S

schema. (continued) servlet container, 74 Pentaho Analysis Services and, 444 servlets, Java, 67, 74 setting up MySQL, 46 session input source, action sequences, 84 using schema editor, 463–466 SET command, 29 schema editor, 463–466 Set Environment Variables dialog, 312 changing edit modes, 465–466 set up, customer and websites dashboard, creating new schema with, 463–464 544 editing object attributes with, 465 Set Variables step overview of, 463 choosing variable scope, 313–314 saving schema on disk, 464–465 dynamic database connection example, scope 314–318 choosing variable, 313–314 limitations of, 319 of metadata layer, 350–351 user-defined variables for, 312 scoring plugin, Weka, 519, 525–526 working with, 318–319 SECTIONS, axes, 453 sets Secure Sockets Layer (SSL), 290–291 in MDX queries, 458–459 security specifying member, 492 automatically connecting to PDI settings.xml file, 336–337 repository and, 325 shared dimensions, 471, 476–477 JDBC configuration, 50–51 shared objects files, 256–257 Mondrian supporting roles, 72 shared.xml file, 256–257 PAC configuration, 57–58 shell scripts, starting PSW, 461 slave server configuration, 340 Show repository dialog, Spoon, 322–323 SMTP configuration, 54 siblings, cube family relationships, 452 user configuration, 58–60 Simple Mail Transfer Protocol. See SMTP using obfuscated database passwords (Simple Mail Transfer Protocol) for, 314 single sign-on, Enterprise Edition, 77 security input source, action sequences, Single Version of Truth, 126 84–85 sizing Select a repository dialog, Spoon, data warehouse advantages, 112–113 322–323 report objects, 398 SELECT statement, SQL, 151, 156–157 slave servers, and Carte, 337, 340–341 self-join, 184 Slice and Dice analysis example, 17–18 semi-additive measures, 150 slices, pie chart, 400–401 sequence number, relative time, 167 slicing/slicers sequence with offset 0, relative time, 167 defined, 441 Serialized Binary Instances (SBI), 512 looking at part of data with, 454 server with OLAP Navigator, 490–491 administrator’s workspace, 428 Slowly Changing Dimensions. See SCDs Business Intelligence. See Pentaho BI (Slowly Changing Dimensions) Server smart date keys, 166 Carte. See Carte server SMTP (Simple Mail Transfer Protocol) client and desktop programs, Pentaho BI authenticating request, 53–54 stack, 65 basic configuration, 52–53 defined, 65 e-mailing using, 52, 70 HSQLDB database, 6, 45–52 Mail job entry configuration, 289–291 slave, 337, 340–341 secure configuration, 54 Tomcat. See Tomcat server snapshot-based CDC, 135–136 Server Administration Console, 368 snowflake technique Service Manager, Windows, 43 creating Orders data mart, 212 service.bat script, 43 outriggers, 188 Index ■ S 597

overview of, 186–187 building blocks for selecting data, schema design and, 483 151–153 software, downloading and installing, 4–5 Create dim_date, loading date Software as a Service (SaaS), 144 dimensions, 265–267 solution engine, Pentaho BI Server, 68 creating custom scripts for profiling, 206 solution global variable, dashboard, creating date dimension table, 265–267, 545–546 277 solution repository, Pentaho BI Server, 68 creating demography dimension table, Sort directory property, staging lookup 282 values, 300 creating queries using JDBC, 382–385 Sort on Lookup Type step, 293, 299–300 creating staging table, 295–296 Sort Rows step, 299–300 creating tables and indexes, 212–213 sorting examining metadata layer, 353–354 PRD reports, 385–386 join types, 153–156 WAQR reports, 374 loading aggregate table, 498 source code, jobs and transformation vs., MDX compared with, 453 233 Pentaho Metadata Layer generating, 70 source data-based CDC, 133–134 querying star schemas and, 151–152 Source Forge website, 4 SQL Editor, 255 source systems, 286–307 SQLeonardo, 33–34 data vault reproducing information SQLite database connections, 254, 323 stored in, 126 SQLPower, 205–206, 208 data warehouse architecture, 117 Squirrel, 32, 46 extraction of data, 226 SSL (Secure Sockets Layer), 290–291 loading data from. See lookup values, stack. See Pentaho BI stack staging; promotion dimension staff members. See employees mappings to target data warehouse, stage_demography table, 282, 283–284 218–219 stage_lookup_data transformation, staging area for, 118–119 287–288, 293–294 SourceForge website, 460 stage_promotion table, 304–305 splitting existing hops, 254 staging area Spoon for extraction of data, 226–227 Pentaho Data Integration utility, 230–231 overview of, 118–119 user-defined variables, 312 Staging table Exists step, as Dummy step, using PDI repository with. See repository, 296–297 PDI Standard Measures profile, DataCleaner, Spoon, ‘‘Hello World!’’ 200, 202 building transformation, 237–244 standards Execution Results pane, 245–246 data warehousing, 113, 139 launching application, 236–237 ISO, 97 output, 246 star schemas overview of, 237 building hierarchies, 184–186 running transformation, 244–245 consolidating multi-grain tables, 188–189 working with database connections, creating SQL queries using JDBC, 383 253–256 dimension tables and fact tables, 148–150 Spoon.bat script, 236 junk, heterogeneous and degenerate spoon.sh script, 236 dimensions, 180–181 spreadsheets, and data analysis, 196 MDX compared with, 447–448 Spring Security, 69–70 monster dimensions, 179–180 SQL (Structured Query Language) multi-valued dimensions and bridge applying query restrictions, 156–158 tables, 182–183 598 Index ■ S

star schemas (continued) steps, transformation outriggers, 188 building transformation in ‘‘Hello overview of, 147–148, 179 World!’’, 238–244 role-playing dimensions, 181–182 creating, moving and removing, 239 snowflakes and clustering dimensions, hops connecting, 287 186–187 horizontally or vertically aligning, 243 star schemas, capturing history, 170–179 job entries vs., 287 overview of, 169–170 overview of, 233–235 SCD type 1: overwrite, 171 types of, 239–240 SCD type 2: add row, 171–173 using variables. See variables SCD type 3: add column, 174 stop.bat,56 SCD type 4: mini-dimensions, 174–176 stop-pentaho.bat script, 51 SCD type 5: separate history table, stop-pentaho.sh script, 52 176–178 stop.sh,56 SCD type 6: hybrid strategies, 178–179 Store to Staging Table step, 294 star schemas, design principles, 160–169 stored procedures, date dimensions, audit columns, 164 262–263 granularity and aggregation, 163–164 stovepipe solution, 119–120 modeling date and time, 165–168 stratified cross-validation, data mining, naming and type conventions, 509–510 162–163 Stream Lookup step, 293, 297–299 unknown dimension keys, 169 String Analysis profile, DataCleaner, using surrogate keys, 160–162 200, 202 star schemas, querying, 150–158 structure, PRD report, 378–380 applying restrictions, 156–157 structured external data, data analysis, 196 combining multiple restrictions, 157 Structured Query Language. See SQL join types, 153–156 (Structured Query Language) ordering data, 158 style inheritance, PRD reports, 389 overview of, 150–153