<<

Index

Symbols parameters, 322 : (colon) , 28 object members, 522 / (slash), parameters, 322 parameters, 322 [] (square brackets), arrays, 522 , (comma) _ (underscore), tokens, 536 arrays, 522 row metadata, 28 A . (dot) Abort job utility, 185–186 row metadata, 28 Abort step, 186 UNIX metacharacter, 507 acc, 354 * (asterisk), RSS Output step, 565 Access @ (at symbol), tokens, 536 Connection, 38, 93 ^ (caret sign), UNIX metacharacter, 507 file extraction, 134 {} (curly brackets) Find Duplicate Query Wizard, 117 object literals, 522 accumulating snapshot fact , 260–261 object members, 522 loader, 121 - (dash) loading, 264–265 parameters, 322 partitioning, 264–265 tokens, 536 action sequence $ (dollar sign), UNIX metacharacter, 507 schedule, 333 $[] (dollar sign/square brackets), variable transformations, 328–330 hexadecimal COPYRIGHTEDvalues, 45 ActiveMQ, MATERIAL 459 = (equals sign), parameters, 322 Add a mini-dimension, SCD, 119 # (hash mark), chat channels, 633 Add constant step, 178 %% (percent sign-double), variable names, 45 JavaScript, 396 | (pipe sign), FIFO, 247–248 add crc_src, 484 ? (question mark), XPath, 543 add crc_vault, 484 “ (quotes-double) Add File to result, RSS Output step, 567 object members, 522 Add new , SCD, 119 parameters, 322 Add sequence step string literals, 522 internal counters, 211–213 ‘ (quotes-single) rows, 405

643

635179bindex.indd 643 8/17/10 10:07:25 PM 644 Index n A–B

Spoon, 211–217 Java, 570–574 surrogate keys, 211–217 jobs, 573–574 Add to result filename, Get data from XML parameters, 579–580 step, 535 transformations, 572–573 Add validation msg in output, XSD Validator variables, 579–580 step, 529 web services, 516 Add XML step, 518, 540 Arbor Essbase, 269 export_xml_from_db, 538 architecture streams, 538 DWH, 8 XML, 538–541 logging, 364–367 addExport, 416 plugins, 593–599 Additional fields, Split field to rows step, 499 transformations, 452 Additional steps, Enterprise Edition, 636 archive files, 60–61 addJob, 416 arrays, JSON, 522 Addresses tab, Mail, 337 Aschauer, Bernd, 140 addResultFile(), StepInterface, 618 ASCII, 15, 18 addTrans, 415 staging area, 8 ADempiere, 15 at, 327 adjacency list, 242 atlassian.com, 632 ADLER 32, 484 Attached Files tab, Mail, 339–340 Administration Console, 330–333 attribute domain constraints, data quality, aggregate builder, ETL, 123 168 aggregate tables, 266–267 attributes Aggregation Designer, Mondrian, 123, 267 HTML, 520 Agile BI, 12–14, 302 hubs, 468 Enterprise Edition, 636 multi-valued, 498–500 agile development Palo, 285 BI, 12–14 Root XML element, 541 ETL, 301–306 rows, 497 Spoon, 301–302 SQL, 484 Agile Manifesto, 12–13 Attributes , satellites, 470 allocateSocket, 415 Attunity Stream, Oracle, 163 Amazon, EC2, 125, 427–447 audit dimension, 117, 192 Amazon Machine Image (AMI), 439–442 auditing. See also logging master, 442–443 data quality, 191–192 slaves, 443–444 ETL, 22 Ubuntu, 442 authentication, Carte, 67 AMI. See Amazon Machine Image authkeypassphrase, 641 analyseImpact(), StepMetaInterface, authorization, Palo, 285 599 auto_increment, 217 Ant build tool, Apache, 571 Apache B ActiveMQ, 459 babbelirc.com, 633 Ant build tool, 571 backtracking, 32–33 DBCP, 400 Backup System, 124 Log4J, 366 Badard, Thierry, 591 Subversion, 343, 570 balanced hierarchy, 120 VFS, 42, 349, 517, 619 BAPI. See Business Application variables, 641–642 Programming Interface API. See application programming interface BasePartitioner, 625 application programming interface (API), BaseStep, 614 569–571 .bat, 58 documentation, 596

635179bindex.indd 644 8/17/10 10:07:25 PM Index n B–C 645

batch processing, 449 Pentaho BI, 322, 327–333 batch run, 449 real-time, 450 batch_id (=audit key), 192 Business key, 468 batch-level lineage extraction, 358–359 business keys, 209–210 BCNF. See Boyce-Codd normal form dimension tables, 210, 527 Beginning XML (Hunter), 518 DV, 467 behavioral testing, 306 DWH, 209 BI. See looking up, 210 The Big Debate, 465 natural keys, 210 BigNumber, 28 Sakila, 527 Binary, 28 storing, 210 BIReady, 6 surrogate keys, 210 black box testing, 21 XML, 527 Blocking Step, multi-threading, 410 Business Objects, SQL, 9 Boolean, 28 Business Process Management (BPM), 344 Database Connection, 38 flags, 94 C job entries, 366 caching, data, 453 JSON, 522 Calculate Dimension Attributes, String, 30 transformations, 85–86 valid, 160 calculations boolean isUnconditional(), casing, 170 JobEntryInterface, 622 Validator step, 180–181 bottlenecks, transformations, 379–382 Calculator step, 108–109, 170, 171 Bouman, Roland, 228, 319, 327 Formula step, 201 Boyce-Codd normal form (BCNF), 218, 226 keys, 214 BPM. See Business Process Management callback, 550 branches/, 342 canvas bridge tables, 121–122 jobs, 56 browsers, web, 413 Spoon, 318 buffers transformations, 56 logging, 364, 365–366 capture groups, regular expressions, 200, 205 performance, 380 Carlton, 6 RowSet, 579 Carte, 41, 55, 57 transformations, 406–407 authentication, 67 Build Model, Spoon, 302 clustering, 57 built-in variables and properties, 637–642 dynamic clustering, 434 Bulk Loader, MySQL, 248, 249 master, 441 bulk loading slave servers, 411–416, 435 database, 390 TCP/IP, 57, 417 fact tables, 246–251 carte-config-master-8080., 434–435 LucidDB, 249 Cartesian join step, RSS Ouput step, 564 PostgresSQL, 250 Cartesian product, 87 Table output step, 250 Caserta, Joe, 113 bundling, 442 case-sensitivity, 507 messages, 601 casing, calculations, 170 Architecture, SCD, 118–119 Casters, Matt, 167, 315 Business Application Programming cat, 248 Interface (BAPI), 140 Catalog location, Mondrian Input step, 274 business intelligence (BI), 2 catalogs, Database Connection, 39 Agile BI, 12–14, 302, 636 CDC. See Change Data Capture ETL, 12 Central storage, 41

635179bindex.indd 645 8/17/10 10:07:25 PM 646 Index n C

Change Data Capture (CDC), 16 transformations, 417–425 , 450–451 partitioning, 430 database triggers, 157–158 CMSs. See content management systems dimension table keys, 80–81 Codd, E.F. (Ted), 269 ETL, 115, 154–163 Cognos, Powerplay, 269 logging, 162–163 column profiling, 17, 146 MySQL, 162–163 Combination lookup / update step, 99 real-time , 450–451 import_xml_into_db.ktr, 527 relational , 450 Insert / Update step, 241 RFCs, 146 junk dimensions, 241 Sakila, 108 Spoon, 241 snapshots, 146, 158–162 ComboVar, 609 source data, 155–157 Comma Separated Values (CSV), 47–50, 128 , 227–228 dynamic transformations, 580–583 timestamps, 155–157, 163, 450 sets, 498 triggers, 450 command-line Change number of copies to start, User jobs, 322–326 Defined Java Class step, 404 parameters, 323–324, 325–326 CHANNEL, 641 transformations, 322–326 channel, RSS, 558–559 commands.jar, 596 Channel tab, RSS Output step, 565 commentary, ETL, 298–299 channel-log-table, 346 comments, item, 559 channels commit size, Table output step, 390 log table, 372 common.jar, 596 logging, 366 Community Edition, 635–636 chat channels, 633 Community Wiki, 631 check() compatibility mode, JavaScript, 395 JobEntryInterface, 622 complex join condition, XML Join step, 543 StepMetaInterface, 599 Compliance Reporter, ETL, 125 Check if XML file is well formed, job entries, concatenation 519 denormalization, 100–101 , 309 Validator step, 179 JavaScript, 395 Concurrent Versions System (CVS), 343 partitioning, 426 ConditionEditor, 609 child key, 242 configuration, 63–72 chmod, UNIX, 322 slave servers, 411–412 CI. See Continuous Integration conformation. See data conformation classpath, 70–71 connect by prior, Oracle, 20 cleansing. See Connection, Mondrian Input step, 274 clear-box testing, 306 Connection Name, Database Connection, 38, clone(), StepInterface, 616 92–93 Closure Generator step, 242–243 connection pools, 400 closure table, 242 Connection Type, Database Connection, 38, cloud computing, 433–447 92–93 cluster, kettle.pwd, 67 constraints clustering, 18–20. See also dynamic clustering dependency, data validation, 183 Carte, 57 domain attribute constraints category, 179 data pipelining, 425 performance, 392 database, 40 content management systems (CMSs), 344, schema, 417–418 626 sorting, 394 Content tab TCP/IP, 423 Add XML step, 539

635179bindex.indd 646 8/17/10 10:07:25 PM Index n C–D 647

Get data from XML step, 532–535 D Regex Evaluation step, 506–507 Damerau-Levenshtein , 171 RSS Input step, 561–562 data Continuous Integration (CI), 13, 450, 630 acquisition challenges, 14–16 testing, 311 caching, 453 control files, 247 delivery, 118 COPY, PostgresSQL, 250 federation, 10–11 Copy rows to result step, 162 governance, 168 CDC, 164 late-arriving, 255–260 job entries, 399 dimensions, 256–260 transformations, 576–577 ETL, 122 Copy Table, 9 facts, 256 Copy Tables, 9 migration, 9 Copy tables wizard, Spoon, 584 paths, 25 Core, Java API, 571 SAP, 140–145 Counter name field, Step sequence step, 212 semi-structured, 501–508 CPU performance, 394–398 sorting, performance, 392–394 CRC. See static, 397–398 Create custom RSS, RSS Output step, 565, 567 streams, 577 Create new rows, SCD, 119 synchronization, 9 Create Parent folder, RSS Output step, 567 traceability, 467 CRM. See Customer Relationship transformations, 576–580 Management unstructured, 501–508 Crockford, Douglas, 520 Data Cleaning and Quality Screen Handler cron, UNIX, 326–327 System, ETL, 116–117 crontab, UNIX, 326–327 Data Cleanse step, 169 CSV. See Comma Separated Values data cleansing CSV File Input step, 384–385 data governance, 168 dynamic templates, 584 data quality, 168 partitioning, 428 data validation, 179–183 rows, 394 ETL, 168–183 StepInterface, 617 reference tables, 172–179 CsvFileReader, 584 regular expressions, 203–205 CsvFileReader.java, 580–583 source data, 173 cubes, 123 data conformation dimensions, 270–271 ETL, 118 OLAP, 270–271 lookup tables, 172–175 XML/A, 278 reference tables, 175–179 current_flag, 265 data conversion CURRENT_TIMESTAMP, 480 JavaScript, 395 Customer Relationship Management (CRM), transformations, 29–30 14–15, 138–146 data extraction, 127–165 deduplication, 192–199 Access files, 134 customizations.jar, 591 CDC, 450–451 CVS. See Concurrent Versions System database, 134–136 Cyclic Redundancy Check (CRC) ETL, 114–116 Filter rows step, 485 Excel files, 134 Merge Join step, 485 Freebase, 553–558 NULL, 484 HTML, 520 Update step, 485 HTTP client, 137–138 input steps, 128 lineage, 358–359

635179bindex.indd 647 8/17/10 10:07:25 PM 648 Index n D

metadata, 359 challenges, 16–17 parallelism, 538 data cleansing, 168 real-time, 138 Data Quality Assessment (Maydanchik), 169 SOAP, 138 Data Quality Lifecycle, 191 Spoon, 128 data types, Validator step, 179 streams, 138 data validation text files, 128–132 data cleansing, 179–183 Web, 137 DataCleaner, 148, 153 Web, 137–138 dates, 182 XBase files, 134 deduplication, 183 XML, 525–536 dependency constraints, 183 XML files, 133 error handling, 187–190 data formats metadata, 182 HTTP, 517 NULL, 17, 180–181 non-relational, 498 rules, 180–183 non-tabular, 498 Unknown, 17 web services, 517–523 XML Schema, 530 XML, 518–520 XSD Validator step, 530 Data Grid step, 135, 173 Data Validator step, 179, 182, 187–188 SAP Input step, 142 Data Vault (DV), 9, 168, 465–495 Table input step, 132 business keys, 467 testing, 311 , 486–495 data integration, 8, 569–592 database accounts, 477 challenges, 11–17 dependency, 471 continuous, 450 ETL, 477 ETL, 123 extensibility, 471 import_xml_into_db.ktr, 527 hubs, 467–468 near real-time, 450 links, 468–469 real-time, 449–461 NULL, 471 Spoon, 302 Sakila, 472–486 streaming, 450 satellites, 469–471 streams, 450 tables, 485–486 Data Integration Server, Enterprise Edition, 636 3NF, 469 data lineage, 21, 357–363 timestamps, 480 data extraction, 358–359 traceability, 471 impact analysis, 361–363 , EII, 10–11 Data Manipulation Language (DML), 246 (DWH), 2 data mapping, 297–298 architecture, 8 fields, 25 business keys, 209 Sakila, 524–525 EDW, 465 XML, 524–525 jobs, 31 data mart response time, 4 DV, 486–495 surrogate keys, 210 data pipelining The Data Warehouse ETL Toolkit (Kimball and clustering, 425 Caserta), 113, 191 multi-threading, 407–408 Data Warehouse Lifecycle Toolkit (Kimball), 11, Data Profiler, Talend, 154 113–114 data profiling, 16–17, 127–128, 146–154 Data Warehouse Toolkit (Kimball and Ross), metadata, 17 221, 228 Data Profiling System, 115 Database, Java API, 571 data quality database auditing, 191–192 accounts, 82 categories, 168–169 DV, 477

635179bindex.indd 648 8/17/10 10:07:25 PM Index n D 649

bulk loading, 390 Date, 28 CDC, 155 Integer, conversion, 30 clustering, 40 String, conversion, 29 connection pools, 400 Date mask matcher, DataCleaner, 149 connections, multi-threading, 408–409 Date of last insert (without stream field as extraction, 134–136 source), Dimension lookup / update step, metadata, 588–590 238 OLTP, 75 Date of last insert or update (without stream partitioning, 40, 429–430 field as source), Dimension lookup / performance, 388–392 update step, 238 plugins, 627–628 Date of last update (without stream field as repositories, 348–349 source), Dimension lookup / update step, Sakila, 73–110 239 sequence, 211 datetime-stamp, 484 surrogate keys, 217 DBCP, Apache, 400 sharding, 40 dbhost, 354 shared objects, 589 deadlock, 383 sorting, 393 Debian, 59 time-outs, 453 Debug option, 312–315 triggers, CDC, 157–158 debugging Database Connection, 37–41 ETL, 21, 312–315 DataCleaner, 150–151 jobs, 56 Enable Connection Pooling, 400 logging, 364 Sakila, 90–95 real-time transformation streaming, transformations, 37, 90–95 457–478 Database join, 105 rows, 314 Database lookup step, 99, 103, 105, 178 transformations, 56 data caching, 453 decision support systems (DSS), 2 denormalization, 226 deduplication, 104 Enable cache, 253 CRM, 192–199 failure, 224–225 data validation, 183 late-arriving data, 257–258 ETL, 117–118 Load all date from table, 253 exact duplicates, 193–194 source systems, 222 non-exact duplicates, 194–195 Stream lookup step, 253, 255 transformations, 195–199 pipeline, 252 delays, parameters, 457 Database Name, Database Connection, 38 DELETE, 157 Database repository, 41 deleted records, CDC, 155 DatabaseMeta, 589 delimited text files, 128 DataCleaner DeMarco, Tom, 12 data validation, 153 denormalization, 99 Database Connection, 150–151 concatenation, 100–101 dependency, 153–154 Database lookup step, 226 dictionary, 153–154 Palo, 288 eobjects.org, 147–154 star schema, 226 JavaScript, 153 Denormalize Special Features, 104 regular expressions, 151–152 dependency, 125 Run profiling, 152 constraints, data validation, 183 data-integration, 61 data quality, 168 date DataCleaner, 153–154 data validation, 182 dimension tables, 219 dimensions, 239 DV, 471 profiling, 146

635179bindex.indd 649 8/17/10 10:07:25 PM 650 Index n D–E

description, metadata, 37 dimension(s)static dimensions, special Description, 332 dimension builder, 120 description, 346 dir, 324 channel, 559 directory, metadata, 36 item, 559 dispose(), StepInterface, 614 descriptive fields, 318 distrib/, 571 design distributed version control systems (DVCS), building blocks, 25–42 344 flexibility, 19 DML. See Data Manipulation Language principles, 23–25 Do Not Proceed, 90 Design ETL, Spoon, 302 Do not raise an error if no files, Get data dev, 354 from XML step, 534 dictionary, DataCleaner, 153–154 docs/api, 571 Dictionary matcher, DataCleaner, 149 Document template XML step dimension(s) export_xml_from_db, 537 cubes, 270–271 XML Join step, 542 hierarchies, 270 document type definition (DTD), 519 junk, 120, 241 document-all, 319 late-arriving data, 256–260 documentation, API, 596 mini-dimensions, 120, 239–240 domain attribute constraints category, 179 Palo, 285 dotall mode, 507 special dimension builder, 120 Double Metaphone algorithm, 171 static, 84–87 DQGuru, 118 user maintained, 120 drill-down, 314 Dimension lookup / update step, 99, 238 driver, jdbc.properties, 65 data caching, 453 DSS. See decision support systems history, 235–236 DTD. See document type definition indexes, 390–391 DTD Validator step Keys tab page, 234 job entries, 519 late-arriving data, 259 XML, 133 SCD, 232–237 DV. See Data Vault SOF, 266 DVCS. See distributed version control surrogate keys, 234–235 systems dimension manager system, ETL, 122 DVD rental business, 73–110 dimension tables DWH. See data warehouse business keys, 210, 527 dynamic clustering dependency, 219 Carte, 434 ETL, 207–244 cloud computing, 433–447 fact tables, 251–260 master, 434 keys, 109, 208–217 schema, 434 CDC, 80–81 dynamic ETL, 586–587 loading, 218–228 dynamic jobs, 584–586 natural keys, 99 Dynamic Slave, 445 OLTP, 226 dynamic templates, 583–584 rental star schema, 79–80 dynamic testing, 307 rows, 90 dynamic transformations SCD, 118 CSV, 580–583 snowflakes, 97, 218–225 Spoon, 580–583 star schema, 226–228 DynamicJob.java, 584 static, 84–87 surrogate keys, 209, 251–260 E surrogate primary keys, 80 E4X. See ECMAScript for XML EBS. See Elastic Block Service

635179bindex.indd 650 8/17/10 10:07:25 PM Index n E 651

E-Business Suite, Oracle, 18 EnterSelectionDialog, 609 EC2. See Elastic Computing Cloud entries, 346 ec2-ami-tools, 440 environment, 354 ec2-bundle-vol, 442 environmentSubstitute(), ec2-describe-instances, 444 StepInterface, 618–619 ECCD. See Extract, Cleanse, Conform, and eobjects.org, DataCleaner, 147–154 Deliver ERP. See Enterprise Resource Planning Eclipse, 302, 597–598, 607 Error code, Data Validator step, 187 ECMAScript, 394–396 Error description, Data Validator step, 187, ECMAScript for XML (E4X), 520 188 ecosystem, 629–634 Error fields, Data Validator step, 188 Edit button, Database lookup step, 222 error handling Edit mapping button, Insert / Update step, data acquisition, 15 231–232 data validation, 187–190 EDW. See enterprise data warehouse dynamic templates, 584 Elastic Block Service (EBS), 438 ETL, 117, 183–190 Elastic Computing Cloud (EC2), Amazon, process errors, 184–187 125, 427–447 StepInterface, 617 elements transformations errors, 186–187 HTML, 520 XSD Validator step, 530 user interface, 609 error messages, logging, 364 Elements name, Add XML step, 540 ErrorDialog, 609 ELT. See extract, load, and transform /etc/init.d/carte, 441 e‑mail ETI, 6 HTML, 339 ETL. See extract, transform, and load notifications, 336–340 ETLT. See extract, transform, load, transform Email Message tab, Mail, 339 eval(), JavaScript, 556 embedding, 574–590 Excel libraries, 574 file extraction, 134 Enable cache, Database lookup step, 253 OLAP, 273 Enable Connection Pooling, Database Excel Output step, 297 Connection, 400 excludeFromCopyDistribute Encoding list box, Get data from XML step, Verification(), StepMetaInterface, 533 600 Encr.bat, 58 excludeFromRowLayoutVerification(), encr.sh, 58 StepMetaInterface, 600 Engine, Java API, 571 ExecTrans.java, 576 EnterListDialog, 609 execute(), JobEntryInterface, 622 EnterMappingDialog, 609 Execute a transformation, Spoon, 413 EnterNumberDialog, 609 Execute SQL script job, Spoon Copy tables EnterPasswordDialog, 609 wizard, 584 enterprise data warehouse (EDW), 465 Execute SQL script step Enterprise Edition, 124, 302, 635–636 aggregate tables, 267 Enterprise Information Integration (EII) foreign keys, 252 data virtualization, 10–11 import_xml_into_db.ktr, 526 LucidDB, 10 Execute SQL step, multi-threading, 409–410 Enterprise Repository, Enterprise Edition, ExecuteJob.java, 573–574 636 ExecuteTrans.java, 572 Enterprise Resource Planning (ERP), 127, export 138–146 metadata, StepMetaInterface, 600 metadata, 14, 139 repositories, 350–351 plugins, 140 resource exporter, 444

635179bindex.indd 651 8/17/10 10:07:25 PM 652 Index n E

exportResources() Data Cleaning and Quality Screen Handler JobEntryInterface, 622 System, 116–117 StepMetaInterface, 600 data cleansing, 168–183 export_xml_from_db, 537–538 data conformer, 118 expressions, Java, 70–71 data delivery, 118 extended description, metadata, 37 data integration manager, 123 extensibility, 593–628. See also plugins data migration, 9 DV, 471 data paths, 25 ETL, 19–20 Data Profiling System, 115 eXtensible Markup Language (XML) data synchronization, 9 Add XML step, 538–541 debugging, 21, 312–315 business keys, 527 deduplication, 117–118 data extraction, 525–536 definition, 5 data format, 518–520 design flexibility, 19 data mapping, 524–525 development lifecycle, 295–320 document construction, 538 dimension manager system, 122 document structure, 523–524 dimension tables, 207–244 documents, generating, 537–544 DV, 477 ETL metadata, 24 dynamic, 586–587 examples, 523–544 error handling, 117, 183–190 file extraction, 133 evolution, 5–6 job entries, 519–520 extensibility, 19–20 jobs, metadata, 346–347 extraction, 114–116 JSON, 520 Extraction System, 115–116 metadata, 345–347 fact table, loader, 121 repositories, 344 fact table, provider system, 122 Sakila, 523–544 flow design, 300 slave servers, 413 hierarchy dimension builder, 119–120 surrogate keys, 527 impact analysis, 125 transformations, metadata, 345–346 Job Scheduler, 124 VCS, 352 jobs, 12, 30–36 Version Migration System, 352 late-arriving data handler, 122 web services, 518–520 Lineage and Dependency eXtensible Stylesheet Language (XSL), 133 Analyzer, 125 Extension, RSS Output step, 567 logging, 22 external sort, 393 maintainability, 300–301 Extract, Cleanse, Conform, and Deliver metadata, 21, 344–350 (ECCD), 167 metadata, graphical user interface, 24 extract, load, and transform (ELT), 9 metadata, XML, 24 extract, transform, and load (ETL) Metadata Repository Manager, 125 aggregate builder, 123 MOLAP, 123 agile development, 301–306 monitoring, 333–340 audit dimension assembler, 117 multi-valued dimension bridge table auditing, 22 builder, 121–122 Backup System, 124 names, 24, 298–299 best practices, 296–300 Parallelizing/Pipelining System, 125 BI, 12 platform independence, 18 building blocks, 7–8 Problem Escalation System, 125 CDC, 115, 154–163 RDBMS, 497 commentary, 298–299 Recovery and Restart System, 124 Compliance Reporter, 125 reuse, 19, 300–301 connectivity, 17–18 Sakila, 73–110, 81–84

635179bindex.indd 652 8/17/10 10:07:25 PM Index n E–F 653

scalability, 18–19 fields SCD, 118–119 JavaScript, 395 scheduling, 321–333 maps, 25 scripts, 5, 200–205 rows, 27 Security System, 125 Select values step, 397 solution documentation, 315–320 text files, 384 Sort System, 124–125 Fields grid, Regex Evaluation step, 506 special dimension builder, 120 Fields tab, 237 Spoon, 81–84 Add XML step, 539–541 subsystems, 113–126 Get data from XML step, 535–536 surrogate key creation system, 119 The fields that make up the grouping, testing, 21, 306–312 Denormalize Special Features, 104 tools, 6 FIFO. See first in, first out tools, requirements, 17–22 file, 324 transformations, 12, 25–30 File locking, 42 transformations, challenges, 20 File Output step, export_xml_from_db, 538 transparency, 24 File repository, 41 Version Control System, 124 File tab Version Migration System, 124 Get date from XML step, 531–532 Workflow Monitor, 124 RSS Ouput step, 566–567 extraction. See data extraction file-based version control systems, 342–344 Extraction System, 115–116 filename, metadata, 36 Extreme Programming (XP), 301 filename, PostgresSQL, 250 Filename defined in a field, RSS Output step, F 567 fact tables Filename field, RSS Output step, 567 accumulating snapshot fact table, 260–261 FileObject, 619 fact table loader, 121 Filter rows step, 198, 199 loading, 264–265 CRC, 485 partitioning, 264–265 UI, 609 bulk loading, 246–251 Filter step, 108 dimension tables, 251–260 Find Duplicate Query Wizard, Access, 117 Insert / Update step, 109 first in, first out (FIFO), 247–248 loader, 121 Fixed File Input step, 385 loading, 245–267 StepInterface, 617 periodic snapshot fact tables, 260–261 fixed width text files, 129 fact table loader, 121 flags, Boolean, 94 loading, 263–264 floating point numbers, 28 provider system, 122 JSON, 522 rental star schema, 79 “follow when result is false” job hop, 32 snapshots, 260–261 “follow when result is true” job hop, 31–32 SOF, 261–263 Force all to lower case, Database Connection, loading, 265–266 38 transaction grain fact tables, 121 Force all to upper case, Database facts, late-arriving data, 256 Connection, 39 failure hops, 90 , satellites, 470 Fetch Customer Address, 96–98 foreign keys Field description, transformation log Execute SQL script step, 252 tables, 369 parent key, 242 Field Splitter step, 145 , 251–252 Field to split, Split field to rows step, 499 Sakila, 105 SOF, 265 tables, 77, 208

635179bindex.indd 653 8/17/10 10:07:25 PM 654 Index n F–G

forks, 591–592 late-arriving data, 259 FormLayout, 607–608 transformation log tables, 367, 370 Formula step, Calculator step, 201 Get update fields, Insert / Update step, forums, 631–632 231–232 Free Software Foundation (FSF), 570 Get variables step, JavaScript, 396 Freebase, 549–558 Get XPath nodes, Get data from XML step, data extraction, 553–558 522 MQL, 551–553 getCopy(), 617 performance, 552 getDialogClassName(), 622 read service, 550–551 getEntryNr, 587 scalability, 552 getExitStatus, 587 web services, 550 getFields() Wikipedia, 549–550 StepInterface, 619 FSF. See Free Software Foundation StepMetaInterface, 599, 604 full table scan, 391 getInputRowMeta(), 615 Browser, SAP, 141 getLogChannelId, 588 functional testing, 21, 306 getNrErrors, 587 fuzzy logic, 195 getNrFilesRetrieved, 588 Fuzzy match step, 170, 171, 195–198 getNrLinesDeleted, 588 getNrLinesInput, 587 G getNrLinesOutput, 587 GA. See General Availability getNrLinesRead, 587 gap penalty, 171 getNrLinesRejected, 588 gender, coding for, 175–176 getNrLinesUpdated, 588 GENERAL, job entry results, 34 getNrLinesWritten, 588 General Availability (GA), 629–630 getOptionalStreams(), 600 Generate Row with URLs, Generate Rows getPartition(), 625 step, 526 getRequiredFields(), 600 Generate Rows step, 314 getResourceDependencies() Add sequence step, 213 JobEntryInterface, 622 export_xml_from_db, 537 StepMetaInterface, 600 Freebase, 554 getResult, 587 import_xml_into_db.ktr, 526 getResultFilesList, 588 RSS Ouput step, 564 getRow(), 614–615, 618 SAP Input step, 142 getRowFrom(), 616 SOAP, 547 getRows, 588 generateJobMeta(), 586 getRunThread(stepname, copy), 576 GeoKettle, 591 getSlaves, 416 GeoNames, 137, 172 getSQLStatements() GET, SOAP, 548 JobEntryInterface, 622 get(), User Defined Java Class step, 620 StepMetaInterface, 599 Get closer value, 196–197 getStepIOMeta(), 600 Get data from XML step, 361, 518, 530–536 getUniqueStepCountAcrossSlaves, import_xml_into_db.ktr, 527 StepInterface, 617 SOAP, 548 getUniqueStepNrAcrossSlaves(), 617 Get Fields button, 129 getUsedDatabaseConnections(), 622 Insert / Update step, 231 getusedLibraries(), 600 Get File names step, 203 getXML(), 583 Get Lookup Fields button, Database lookup JobEntryInterface, 622 step, 223 StepMetaInterface, 599 Get rows from result step, 576–577 GetXMLData - Different Options.ktr, Get System Info step, 191 299 command-line parameters, 325–326 getXulOverlayFile(), 628

635179bindex.indd 654 8/17/10 10:07:25 PM Index n G–I 655

GIS. See Graphical Information Systems Host Name, Database Connection, 38 Globally Unique identifiers (GUID), HTML JavaScript, 395 attributes, 520 globalreplace.sh, 347 data extraction, 520 GNOME, launchers, 62–63 elements, 520 GNU Public License (GPL), 343 e‑mail, 339 GoldenGate, Oracle, 163 JavaScript, 520 good-enough solutions, regular expressions, web pages, 520 501–502 web services, 520 Goodman, Nicholas, 19 HTTP. See Hypertext Transfer Protocol Google Wave, 626 Http Authentication, Web services lookup GPL. See GNU Public License step, 545 Graphical Information Systems (GIS), 591 HTTP client step, 516 graphical user interface (GUI), 24 extraction, 137–138 Java API, 571 Freebase, 555 grid-based services, 437 import_xml_into_db.ktr, 526–527 Guess button, Insert / Update step, 232 SOAP, 547–548 GUI. See graphical user interface HTTP GET, 548 GUID. See Globally Unique identifiers Freebase, 550 guid, item, 559 HTTP Post step, 516, 548 .gzip, 517 Hub Surrogate Keys(), 469 hubs H attributes, 468 handleStreamSelection(), DV, 467–468 StepMetaInterface, 600 Sakila, 472–473 hard disks, 386–387 surrogate keys, 469 heaps, maximum size, 71 tables, 467 HellowworldStepDialog.java, 609–613 Hudson, 60 hierarchy Hunter, David, 518 dimension builder, 119–120 Hybrid, SCD, 119 dimensions, 270 Hybrid OLAP (HOLAP), 272 flattener, 20 Hyde, Julian, 458 ragged, 120 Hypertext Transfer Protocol (HTTP), 515–517 recursion, 120, 242–243 data formats, 517 variables, 120 XML/A, 277, 278 Hillyer, Mike, 74 history I data quality, 168 IaaS. See Infrastructure as a Service Dimension lookup / update step, 235–236 ../id, XPath, 535 transformation log tables, 367–368 ../@id, XPath, 535 HOLAP. See Hybrid OLAP IDE. See integrated development /home/ubuntu/runCarte.sh, 441 environment hop, 346 identifiers, 317 hops, 7 IDENTITY, 217 failure, 90 identity, 641 jobs, 31–32 id_existing, 480 loops, 27 Ignore comments?, Get data from XML step, rows, 26, 27 533 success, 90 Ignore empty file, Get data from XML step, transformations, 25, 26–27 534 unconditional, 88 il8n, 590 hops, 346 image, 601

635179bindex.indd 655 8/17/10 10:07:26 PM 656 Index n I

Imhoff, Claudia, 296 Mondrian, 274–275 impact analysis, 22 OLAP, 274, 278–279, 281 date lineage, 361–363 process, 278 ETL, 125 Input Table step, 297 StepMetaInterface, 599 Input vault step, 479 import, repositories, 350–351 input_id, 101 Import partitions button, Partitioning input/output, 380 schema, 429 $, 67 import_xml_into_db, Get data from XML inputRowMeta, StepMetaInterface, 604 step, 532 Insert, Dimension lookup / update step, 238 import_xml_into_db.ktr, 525–527 INSERT, 157 Execute SQL script step, 526 Insert, bulk loading, 251 “in” tab, Web services lookup step, 545–546 Insert / Update step, 101–102 Include date in filename, RSS Output step, 567 accumulating snapshot fact tables, 264 Include filename in result and Filename CDC, 155, 163 fieldname, Get data from XML step, 534 Combination lookup / update step, 241 Include rownum in output, RSS Input step, fact table, 109 561–562 keys, 230–231 Include rownum in output?, Split field to SCD, 229–230 rows step, 500 Update fields, 231–232 Include stepnr in filename, RSS Output step, installation, 58–63 567 rental star schema, 81 Include time in filename, RSS Output step, 567 Sakila database, 77 Include url in ouput, RSS Input step, 561–562 Installer, Enterprise Edition, 636 incoming hops, 26 Integer, 27 Increment by field, Step sequence step, 212 Date, conversion, 30 incrementLinesInput(), 618 integrated development environment (IDE) incrementLinesOutput(), 618 plugins, 596–597 incrementLinesRead(), 618 Spoon, 55–57 incrementLinesRejected(), 618 Integrated Scheduling, Enterprise Edition, incrementLinesSkipped(), 618 636 incrementLinesUpdated(), 618 integration. See also data integration incrementLinesWritten(), 618 testing, 307 indexes internal counters, Add sequence step, performance, 390–392 211–213 tables, 392 internal variables, 428–429 IndexOfValue(), 606–607 Internal.Cluster.Master, 639 InfiniDB, 123 Internal.Cluster.Size, 639 info, 345 ${Internal.Job.Filename.Directory}, InfoBright, 123 96–97 Informatica, 9 Internal.Job.Filename.Directory, 638 Infrastructure as a Service (IaaS), 437 Internal.Job.Name, 638 init() Internal.Job.Repository.Directory, StepInterface, 614 638 User Defined Java Class step, 459 Internal.Kettle.Build.Date, 638 InjectDataIntoTransformation.java, Internal.Kettle.Build.Version, 637 578–579 Internal.Kettle.Version, 637 Injector step, 578 Internal.Slave.Server.Name, 639 Inmon, Bill, 465 Internal.Slave.Transformation. Input source step, 479 Number, 639 SQL, 483 Internal.Step.CopyNr, 639 Input step Internal.Step.Name, 639 extraction, 128 $(Internal.Step..ID), 429

635179bindex.indd 656 8/17/10 10:07:26 PM Index n I–J 657

Internal.Step.Partition.ID, 638 expressions, 70–71 $(Internal.Step.Partition.Number), installation, 58–59 429 user-defined expressions and classes, 520 Internal.Step.Partition.Number, 639 Java 2 Enterprise Edition (J2EE), 459 Internal.Step.Unique.Count, 639 Java Authentication and Authorization Internal.Step.Unique.Number, 639 Service (JAAS), 414 ${Internal.Transformation.Filename. Java Content Repository (JCR), 350, 626 Directory}, 96–97 Java Development Kit (JDK), 58 $(Internal.Transformation.Filename. jar, 587 Directory), 530 Java Message Service (JMS), 449, 459–461 Internal.Transformation.Filename. Java Naming and Directory Interface (JNDI), Directory, 638 64–65, 93 Internal.Transformation.Name, 638 Java Runtime Environment (JRE), 58 Internal.Transformation.Repository. variables, 642 Directory, 638 Java Virtual Machine (JVM), 19, 58 internationalization, 590 logging, 453 Internet Relay Chat (IRC), 633 rows, 397 inter-table dependencies, 183 variables, 43 interval logging, 453 java.io.tmpdir, 642 intra-table dependencies, 183 JavaScript, 20, 202 intrusive CDC, 16, 155 DataCleaner, 153 IRC. See Internet Relay Chat eval(), 556 Is a file HTML, 520 filename is defined in a field, XSD job entries, 35–36 Validator step, 529–530 logging, 366 let me specify filename, XSD Validator Mondrian, 276–277 step, 529–530 performance, 394–396 Is defined inside XML, XSD Validator step, variables, 43 529–530 XML/A, 281 ISNULL, 484 JavaScript Object Notation (JSON) ISO8601, 170 example, 549–558 isStopped, Result, 588 Modified Java Script Value step, 522 item, RSS, 559–560 plugins, 522 Item tab, RSS Ouput step, 565–566 syntax, 521–522 transformations, 523 J web services, 520–523 J2EE. See Java 2 Enterprise Edition XML, 520 JAAS. See Java Authentication and java.security.auth.login.config, Authorization Service kettle.properties, 414 Jackrabbit, 350 java.version, 642 jar, 587 JCR. See Java Content Repository .jar, 70, 517 JDBC libext/, 587 Database Connection, 93 Jaro and Jaro-Winkler algorithm, 171 drivers, 72 Jaro-Winkler algorithm, 198 MySQL, 127 Java jdbc.properties, 64–65 AMI, 440 kettle.properties, 67 API, 570–574 JD/Edwards, 18, 139 jobs, 573–574 JDK. See Java Development Kit parameters, 579–580 jedox.com, 282 transformations, 572–573 jface.jar, 596 variables, 579–580 Jira, 632 JMS. See Java Message Service

635179bindex.indd 657 8/17/10 10:07:26 PM 658 Index n J–K

JNDI. See Java Naming and Directory Join condition properties, XML Join step, 542 Interface join profile, 146 JOB, 640 Join Rows step, 214 job, 324 Main step to read from, 398 job(s), 7 jpalo.com, 283 canvas, 56 JRE. See Java Runtime Environment command line, 322–326 JSON. See JavaScript Object Notation Database Connection, 37 JSONP, 550 debugging, 56 jtwitter, 454 DWH, 31 junk dimensions, 241 dynamic, 584–586 special dimension builder, 120 ETL, 12, 30–36 JVM. See Java Virtual Machine hops, 31–32 Java API, 573–574 K Kitchen, 57 Kalido, 6 log tables, 373–374 .kettle, repositories.xml, 68 loops, 399–400 Kettle Logging Level, 330 metadata, 36–37, 574 KettleDatabaseRepository, 627 Pan, 57 kettle-database-types.xml, 595 parallelism, 33–34, 411 KETTLE_EMPTY_STRING_DIFFERS_FROM_ performance, 399–400 NULL, 640 Run button, 83–84 NULL, 28 shared objects, 69 Kettle.exe, 55 slaves, 445 /kettle/getSlaves, 444 Spoon, 82 KETTLE_HOME, 64 variables, 89 $KETTLE_HOME/.kettle/. XML, metadata, 346–347 languageChoice, 601 job entries, 31 $KETTLE_HOME/.kettle/shared.xml, 589 backtracking, 32–33 kettle-job-entries.xml, 595 Boolean, 366 KETTLE_..._LOG_DB, 641 Copy rows to result step, 399 KETTLE_..._LOG_SCHEMA, 641 flow of execution, 90 KETTLE_LOG_SIZE_LIMIT, 640 JavaScript, 35–36 KETTLE_..._LOG_TABLE, 641 logging, 366 KETTLE_MAX_LOG_SIZE_IN_LINES, 365, 454, log table, 373–374 640 Mail, 90, 301, 337–340 KETTLE_MAX_LOG_TIMEOUT_IN_MINUTES, plugins, 570, 621–624 454, 640 results, 34–36 kettle-partition-plugins.xml, 595 serial execution, 90 KETTLE_PASSWORD, 67 START, 88 KETTLE_PLUGIN_CLASSES, 619, 640 transformations, 88 kettle.properties, 58, 66–67, 414 XML, 519–520 logging, 365 Job Scheduler, ETL, 124 variables, 43 JOBENTRY, 641 kettle.pwd, 58, 67 @JobEntry, 622 kettle-repositories.xml, 595 JobEntryDialogInterface, 622–624 KETTLE_REPOSITORY, 67 JobEntryInterface, 622–624 KETTLE_SHARED_OBJECTS, 589, 640 jobentry-log-table, 346 KETTLE_STEP_PERFORMANCE_SNAPSHOT_ job-log-table, 346 LIMIT, 640 JobMeta, 574, 586 kettle-steps.xml, 595 jobStatus, 416 KETTLE_USER, 67 Join comparison field, XML Join step, 543 KETTLEVFS, 619

635179bindex.indd 658 8/17/10 10:07:26 PM Index n K–L 659

The key field, Denormalize Special Features, libraries 104 embedding, 574 keys. See also specific key types plugins, 596 Calculator step, 214 sapjco3.jar, 141 dimension table, CDC, 80–81 StepMetaInterface, 600 dimension tables, 109, 208–217 libswt/, 598 Insert / Update step, 230–231 lightweight principle, 446–447 JSON, 522 Limit, Get data from XML step, 534 SCD, 217 Lindstedt, Dan, 9 source systems, 209 lineage. See data lineage Keys tab page, Dimension lookup / update Lineage and Dependency Analyzer, 125 step, 234 link key/value pairs, 508–513 channel, 559 object members, 522 item, 559 Regex Evaluation step, 510–511 links text files, 509–510 DV, 468–469 Kimball, Ralph, 11, 113, 167, 191, 221, 228, 295, Sakila, 473–474 465 link-to-link, 472, 474 Kitchen, 41, 44, 54, 322–326 Linstedt, Dan, 466 jobs, 57 listdir, 324 level, 336 listjobs, 324 logfile, 334 listrep, 324 logging, 364 listtrans, 325 transformations, 57 .lnk, 62 Kitchen.bat, 57 Load all date from table, 253 kitchen.sh, 57 Load DTS, 468, 469, 470 .ktr, 345 Load End DTS, 470 load_data, 298 L loading. See also bulk loading LAF. See Look and Feel lazy, 605 LAFpackage, 591 loadXML(), 599, 603 Last version (without stream field as source), location outriggers, 97 Dimension lookup / update step, 239 Locking, Enterprise Edition, 636 lastmodifiedtime, 183 logging, 333–336, 363–374 last_update, 80 architecture, 364–367 late-arriving data, 255–260 avoiding, 398 dimensions, 256–260 buffers, 364, 365–366 ETL, 122 CDC, 162–163 facts, 256 channels, 366 launchers, GNOME, 62–63 debugging, 364 Lazy Conversion, 385, 387 error messages, 364 lazy loading, 605 ETL, 22 Lesser GNU Public License (LGPL), 569–570 interval, 453 forks, 591 JavaScript job entries, 366 level, 324 JVM, 453 Kitchen, 336 kettle.properties, 365 Pan, 336 Kitchen, 364 Levenshtein algorithm, 171 levels, 335–336 LGPL. See Lesser GNU Public License memory, 365 libext, 70 Pan, 364 libext/, 591, 598 parameters, 364 .jar, 587 rows, 363

635179bindex.indd 659 8/17/10 10:07:26 PM 660 Index n L–M

Spoon, 57, 333–334, 364, 365 Management Console, Enterprise Edition, transformations, 453–454 636 variables, 367 Manufacturing Requirements Planning log data change processing, 451 (MRP), 138 log tables, 367–374 mapping. See data mapping channels, 372 Mark Attribute rows with id of header row, job, 373–374 Modified Java Script Value step, 511 job entries, 373–374 master, 417 performance, 371 AMI, 442–443 step log tables, 370–371 Carte, 441 transformation, 367–370 dynamic clustering, 434 Log4J, Apache, 366 transformations, 421–422 logfile, 324, 334 , 435 loginmodulename, 414 Mastering Data Warehouse (Imhoff), 296 Look and Feel (LAF), 590 Max % errors allowed, Data Validator step, lookup cascade, 100 188 Lookup Language, 103 Max nr errors allowed, Data Validator step, lookup mode, 232 188 Lookup Original Language, 103 Max number of articles, RSS Input step, Lookup schema field, Database lookup step, 561–562 222–223 MAX(id) FROM test_sequence, 213–214 lookup tables, 172–175 maximum heap size, 71 lookup values, Validator step, 180 Maximum nr of lines in logging windows, Loop Xpath, Get data from XML step, Spoon, 365 532–533, 535 max_log_lines, 412 loops max_log_timeout_minutes, 412 Freebase, 557 Maydanchik, Arkady, 168–169 hops, 27 MD5, 484 jobs, 399–400 MDA. See Model Driven Architecture ls, 248 MDX. See Multi Dimensional eXpressions LucidDB, 10, 123 measures bulk loading, 249 Palo, 289 EII, 10 performance, 380–382 SQL, 10 SOF, 265 wrappers, 10 Mechanical Turk, 437 memory M logging, 365 Mail, 301, 336–340 lookups, 253 Addresses tab, 337 performance, 393 Attached Files tab, 339–340 Sort rows step, 453 Email Message tab, 339 Stream lookup step, 453 job entries, 90, 301, 337–340 streams, 577 Server tab, 337–338 transformations, 452, 453 Mail Failure step, 90, 185–186, 336–337 Merge join step, 479 Mail Success step, 90, 336–337 CRC, 485 Main step to read from, Join Rows step, 398 Merge Rows step, 160–161 maintainability, ETL, 300–301 MERGE/UPDATE, 249 Make transformation database transactional, message bundles, 601 Database Connection, 40 metadata man crontab, 327 data extraction, 359 Manage thread priorities?, Transformation data profiling, 17 Settings, 397 data validation, 182 database, 588–590

635179bindex.indd 660 8/17/10 10:07:26 PM Index n M–N 661

description, 37 Split-field step, 276 directory, 36 Strings cut step, 276 ERP, 14, 139 MonetDB, 123 ETL, 21, 344–350 monitoring, ETL, 333–340 graphical user interface, 24 Monitoring tab, Transformations settings, XML, 24 381 export, StepMetaInterface, 600 MQL. See Metaweb extended description, 37 MRP. See Manufacturing Requirements filename, 36 Planning jobs, 36–37, 574 MSAS. See Microsoft SQL Server 2008 names, 36 Analysis Services replacing, 588–590 Multi Dimensional eXpressions (MDX), repositories, 348–350 269–270 rows, 557, 606–607 Query, Mondrian Input step, 274 steps, 28 Multi-dimensional OLAP (MOLAP), 123, spreadsheets, 297 269, 272 StepMetaInterface, 599 multiline mode, 507 transformations, 36–37, 421–425, 572–573 multi-paths, backtracking, 32–33 User Defined Java Class step, 620 multiple updates, CDC, 155 values, 605–606 multi-threading, 403–411 XML, 345–347 Blocking Step, 410 jobs, 346–347 data pipelining, 407–408 transformations, 345–346 database connections, 408–409 Metadata Repository Manager, 125 Execute SQL step, 409–410 Metaphone algorithm, 171 order of execution, 409–410 Metaweb Query Language (MQL), 551–553 row distribution, 404–407 methods row merging, 405–406 partitioning, 425 multi-valued attributes, 498–500 plugins, partitioning, 624–626 multi-valued dimension bridge table builder, micro-batches, 450 ETL, 121–122 Microsoft SQL Server 2008 Analysis Services -Mxx, 60 (MSAS), 271, 277–280 MySQL, 73–74 Milestone, 630 Bulk Loader, 248, 249 Min nr of rows to read before doing % CDC, 162–163 evaluation, Data Validator step, 188 JDBC, 127 mini-dimensions, 239–240 NOW(), 484 special dimension builder, 120 RDBMS, 77, 134 mirc.com, 633 SET, 103, 498 Model, Spoon, 302–303 SUPER, 82 Model Driven Architecture (MDA), 6 mysqlbinlog, 163 Modified Java Script Value step, 314 mysql_native.xul, 628 DynamicJob, 586 Freebase, 556–557 N JSON, 522, 549 name, 346 Mark Attribute rows with id of header row, names 511 ETL, 24, 298–299 MOL A P. See Multi-dimensional OLAP job entry results, 34 Mondrian, 242 metadata, 36 Aggregation Designer, 123, 267 parameters, 44 Input step, 274–275 pipes, 248 JavaScript, 276–277 Namespace aware?, Get data from XML step, OLAP, 271, 273–277 533 Split Time step, 276

635179bindex.indd 661 8/17/10 10:07:26 PM 662 Index n N–O

natural keys OLTP. See OnLine business keys, 210 Omit values from XML result, Add dimension tables, 99 XML step, 539 junk dimensions, 241 Omit XML, Add XML step, 539 near real-time data integration, 450 One Attribute Set interface (OASI), 303 Needleman-Wunsch algorithm, 171 online analytical processing (OLAP), 123 network latency, 369–370 aggregate tables, 266 network speed, 390 cubes, 270–271 New validation button, 179 Input step, 274, 278–279, 281 NIO buffer size, 386 process, 278 non-intrusive CDC, 16, 155 Mondrian, 271, 273–277 non-relational data formats, 498 multidimensional, 269 non-relational tabular formats, 498–501 Palo, 282–291 non-tabular data formats, 498 positioning, 272–273 norep, 323 storage types, 272 normalization, 218 XML/A, 277–282 Normalize Special Features, 104 OnLine Transaction Processing (OLTP), 2, notepads, 346 269–291 notes, 318 database, 75 transformations, 25 dimension tables, 226 NOW(), 484 Open Office Calc, 297 Nr of errors fieldname, Data Validator step, OLAP, 273 188 Open Symphony, 327 NULL OpenERP, 15 Add XML step, 539, 541 operating systems, scheduling, 322 CRC, 484 (ODS), 4, 10 data profiling, 146 Oracle data validation, 17, 180–181 Attunity Stream, 163 Database lookup step, 225 connect by prior, 20 DV, 471 E-Business Suite, 18 KETTLE_EMPTY_STRING_DIFFERS_FROM_ GoldenGate, 163 NULL, 28 RDBMS, 134 source data, 179 Spatial, 591 String, 28 SQL*Loader, 247, 249 Number, 27 Warehouse Builder, 6 Palo, 285 ETLT, 9 String, 29–30 Oracle Call Interface (OCI), 247 Number analysis, DataCleaner, 149 order, 346 numbers, JSON, 522 ORDER BY, 225, 389, 479 org.pentaho.di.core.database O .DatabaseInterface, 627 OASI. See One Attribute Set interface org.pentaho.di.trans.step OBF, 414 .StepInterface, 614 obfuscated passwords, 67–68, 414 original transformations, 421 object literals, JSON, 522 os.arch, 642 object members, JSON, 522 os.name, 642 object_timeout_minutes, 412 os.version, 642 OCI. See Oracle Call Interface OUPUT_DIR, 319 ODBC, 93 Out of Memory, 71 ODS. See operational data store outgoing hops, 26 OEM version, PDI, 590–591 output directory, 319 OLAP. See online analytical processing Output Fields, XSD Validator step, 529

635179bindex.indd 662 8/17/10 10:07:26 PM Index n O–P 663

Output one row, concatenate errors with methods, 425 separator, Data Validator step, 187 plugins, 624–626 Output String Field, XSD Validator step, 529 plugins, 426 Output Value, Add XML step, 539 methods, 624–626 Overwrite, SCD, 119 round robin, 425 schema, 425–427 P tables, performance, 392 PAD. See Pentaho Aggregation Designer Partitioning schema, Import partitions pagila, 74 button, 429 Pair letters similarity algorithm, 171 pass, 323 Palo, 123, 273, 274, 282–291 password, jdbc.properties, 65 Palo Cell Output step, 289–291 passwords, 58 Palo Cells Input step, 285–289 obfuscated, 67–68, 414 Palo Dimension Input step, 285–289 UI, 609 Palo Dimension Output step, 289–291 Pattern finder, DataCleaner, 149 Pan, 41, 44, 54, 322–326 patterns, 501 jobs, 57 pauseTrans, 415 level, 336 PDI. See Pentaho Data Integration logfile, 334 Pearson, William, 270 logging, 364 peer/expert reviews, 297 transformations, 57 ##pentaho, 633–634 Pan.bat, 57 Pentaho Aggregation Designer (PAD), 123 pan.sh, 57 Pentaho BI, Quartz scheduler, 322, 327–333 parallelism, 18–19 Pentaho Data Integration (PDI), 60, 328–330 data extraction, 538 AMI, 440 jobs, 33–34, 411 DataCleaner, 148 performance, 385–386 enterprise repository, 350 sorting, 393–394 Java API, 571 text files, 385–386, 387 OEM version, 590–591 transformations, 27, 404 slave servers, 413 Parallelizing/Pipelining System, ETL, 125 Pentaho Report Designer (PRD), 574–575 parameters, 318 Pentaho repository, 41 command-line, 323–324, 325–326 Pentaho Solutions (Bouman and van Dongen), delays, 457 228, 327 Java API, 579–580 PentahoSystemVersionCheck, 330–331 logging, 364 Peoplesoft, 15 named, 44 perf-log-table, 345 queries, 135–136 performance SQL, 99 buffers, 380 transformation log tables, 369 constraints, 392 transformations, 579–580 CPU, 394–398 Validator step, 179 data sorting, 392–394 Version Migration System, 353–355 database, 388–392 Parameters tab, 44 Freebase, 552 parent key, foreign keys, 242 hard disks, 386–387 Partitioner, 625–626 indexes, 390–392 partitioning, 18–19, 425–430 JavaScript, 394–396 accumulating snapshot fact tables, 264–265 jobs, 399–400 checksums, 426 log table, 371 clustered transformations, 430 measures, 380–382 CSV File Input step, 428 memory, 393 database, 40, 429–430 parallelism, 385–386 relational databases, 390

635179bindex.indd 663 8/17/10 10:07:26 PM 664 Index n P–R

rows, 382–383 Primary key, 468, 469, 470 SQL, 388 primary keys table partitioning, 392 satellites, 470 text files, 384–387 source system, 210 transformations, 377–398 surrogate primary keys triggers, 392 dimension tables, 80 tuning, 377–401 tables, 77 periodic snapshot fact tables, 260–261 UPDATE, 470 fact table loader, 121 Prism, 6 loading, 263–264 privacy, 308 perspective, Spoon, 302 private schedule, 331 pipes, named, 248 Problem Escalation System, 125 PIT. See Point-In-Time process, 295 pivot fields, Palo, 288 error handling, 184–187 platform independence, ETL, 18 process, OLAP Input step, 278 Plug-in Registry, 595 processRow(), 614, 615 @Step, 601 processRows(), 456, 460 plugins, 20 profiling architecture, 593–599 column profiling, 17, 146 database, 627–628 data profiling, 16–17, 127–128, 146–154 ERP, 140 metadata, 17 IDE, 596–597 dependency profiling, 146 JavaScript, 395 join profile, 146 job entries, 570, 621–624 properties @JobEntry, 622 built-in, 637–642 JSON, 522 JSON, 522 LGPL, 570 Proxy Host, Web services lookup step, 545 libraries, 596 Proxy Port, Web services lookup step, 545 methods, partitioning, 624–626 Prune Path to handle large files, Get data partitioning, 426 from XML step, 534 repositories, 626–627 pubDate, item, 559 steps, 570, 619 public schedules, 331 transformation step, 599–619 Punch through, Dimension lookup / update types of, 594–595 step, 238 plugins, 440 putError(), 617 plugins/, 595 putRow(), 362, 557, 579, 616, 618 plugins/steps, 619 putRowTo(), 616 Point-In-Time (PIT), 472 pwd/, 434 Port Number, Database Connection, 38 POST Q Freebase, 550 Quartz scheduler, Pentaho BI, 322, 327–333 SOAP, 548 queries PostGIS, 591 aggregate tables, 266 PostgresSQL, 74 parameters, 135–136 bulk loading, 250 SQL, SELECT, 553 Power*Architect, 79 Query, MDX, Mondrian Input step, 274 Powerplay, Cognos, 269 Quote all in database, Database Connection, 38 PRD. See Pentaho Report Designer prd, 354 R preparation of statements, 388 ragged hierarchy, 120 prepareExecution, 415 RC. See Release Candidate Preview option, 312–315 -RCxx, 60 PreviewRowsDialog, 609

635179bindex.indd 664 8/17/10 10:07:26 PM Index n R 665

RDBMS. See Relational Database performance, 390 Management System transformations, 497 RDS. See Relational Database Service Relational OLAP (ROLAP), 242, 272, 274 Read articles from, RSS Input step, 561–562 Release Candidate (RC), 60, 630 read service, Freebase, 550–551 remote execution, slave servers, 413 Read source as Url, Get data from XML Remote Function Calls (RFCs), 140, 146 step, 532 Remote Steps, 422 readRep(), StepMetaInterface, 599, 603 Rename fields step, XML/A, 280 Really Simple Syndication (RSS), 18, 558–567 rental star schema channel, 558–559 dimension tables, 79–80 item, 559–560 fact table, 79 transformations, 563 installation, 81 web services, 517 Sakila, 78–81 real-time business intelligence, 450 rep, 323 real-time data integration, 449–461 repeating groups, 500–501 CDC, 450–451 Replace in string step, 170, 203 source system, 451 Report all errors, not only the first, streaming, 452–461 Validator step, 187 real-time extraction, 138 , 436 CDC, 155, 163 repositories, 41–42 real-time transformation streaming, database, 348–349 debugging, 457–478 export, 350–351 Record source, 468, 469 files, 349 satellites, 470 import, 350–351 Recovery and Restart System, ETL, 124 managing, 350–352 Recurrence, 332 metadata, 348–350 recursion, hierarchies, 120, 242–243 plugins, 626–627 reference tables upgrade, 351–352 data cleansing, 172–179 Version Migration System, 352–353 data conformation, 175–179 XML, 344 Referencing, 42 RepositoriesMeta.readData(), 573 referential integrity, 42 repositories.xml, 68, 573 data quality, 168 Repository, 626–627 foreign keys, 251–252 Repository.loadTransformation(), 572 RefinedSoundEx algorithm, 171 RepositoryMeta, 573 Regex Evaluation step, 204, 504–508 resetStepIOMeta(), StepMetaInterface, key/value pairs, 510–511 600 Regex matcher, DataCleaner, 149 resource exporter, 444 registerSlave, 416 response time, DWH, 4 regression tests, 307 Result, 576–577, 587–588 regular expressions, 503–508 Result Fieldname, XSD Validator step, 529 capture groups, 200, 205 Result stream properties, XML Join step, 543 data cleansing, 203–205 results tab, Web services lookup step, 546 DataCleaner, 151–152 Return/remove digits, data cleansing, 170 good-enough solutions, 501–502 reuse Validator step, 180 ETL, 19, 300–301 Relational Database Management System shared objects, 589 (RDBMS), 134 Revision management, 42 ETL, 497 RFC_READ_TABLE, 143–144 MySQL, 77 RFCs. See Remote Function Calls Relational Database Service (RDS), 438 ROLAP. See Relational OLAP relational databases, 39, 127 root, 82 CDC, 450

635179bindex.indd 665 8/17/10 10:07:26 PM 666 Index n R–S

Root XML element, Add XML step, 539, S 540–541 SaaS. See Software as a Service Ross, Margy, 221, 228 Sakila round robin, 386 business keys, 527 partitioning, 425 CDC, 108 sorting, 394 data mapping, 524–525 roundtrips, 388 database, 73–110 row(s) installation, 77 Add sequence step, 405 subject areas, 75–76 attributes, 497 Database Connection, 90–95 CSV File Input step, 394 DV, 472–486 debugging, 314 ETL, 73–110, 81–84 dimension tables, 90 foreign keys, 105 fields, 27 hubs, 472–473 hops, 26, 27 links, 473–474 JavaScript, 395 rental star schema, 78–81 job entry results, 34 satellites, 474 JVM, 397 snowflakes, 219–221 logging, 363 Spoon, 81–84 metadata, 557, 606–607 surrogate keys, 527 steps, 28 XML, 523–544 multi-threading, 404–407 SalesForce.com input step, 140 performance, 382–383 SalesForce.com output steps, 140 Sort rows step, 419 SAP static data, 397–398 data, 140–145 Table input step, 424 Function Browser, 141 Text File Output step, 405 SAP Input step, 140 UI, 609 Data Grid step, 142 User Defined Java Class step, 404 Generate Rows step, 142 Row denormaliser step, 511–512 sapjco3.jar, 141 Palo, 288 SAP Java Connector library (sapjco3.jar), Row normaliser step, 500–501 SAP Input step, 141 RowDataUtil, 616 sapjco3.jar. See SAP Java Connector RowListener, 576 library RowMetaInterface, 604, 606–607 SAP/R3, 14, 18, 141 Rownum fieldname, Split field to rows step, Sarbanes-Oxley Act, 308 500 satellites Rownum in output and Rownum fieldname, DV, 469–471 Get data from XML step, 535 primary keys, 470 RowProducer, 577, 579 Sakila, 474 RowSet, 579, 617 WHERE, 484 RSS. See Really Simple Syndication saveRep() rss, 558–559 JobEntryInterface, 622 RSS Input step, 561–562 StepMetaInterface, 599, 603 RSS Output step, 562–567 scalability R_STEP, 348 ETL, 18–19 R_TRANSFORMATION, 348 Freebase, 552 Run button, 83–84 SCD. See Slowly Changing Dimension Run profiling, 152 Schedule Creator, 331–332 running, 439 Scheduling, Spoon, 302 runtime.jar, 596 scheduling action sequence, 333

635179bindex.indd 666 8/17/10 10:07:26 PM Index n S 667

ETL, 321–333 jobs, 69 operating systems, 322 Spoon, 69 schema. See also XML Schema transformations, 69 clustering, 417–418 shared.xml, 68–69 Database Connection, 39 shortcuts, Spoon, 62 DataCleaner, 148 shrunken or rolled dimensions, special dynamic clustering, 434 dimension builder, 120 partitioning, 425–427 Simple Object Access Protocol (SOAP) Schema name field, Add sequence step, 217 accessing services directly, 546–549 SCM. See software configuration examples, 544–549 management extraction, 138 screens, 191 OLAP, 274 Script Values step, 394–395 WDSL, 517 scripts, 20. See also JavaScript web services, 517 ETL, 5, 200–205 Web services lookup step, 544–546 startup, 70 XML/A, 277 Scrum, 13, 301 slave(s) searchInfoAndTargetSteps(), AMI, 443–444 StepMetaInterface, 600 jobs, 445 Secure Sockets Layer (SSL), 337 transformations, 421–422 Security, Enterprise Edition, 636 Slave Browser tab, Spoon, 457 Security repository, 42 slave servers Security System, 125 Carte, 411–416, 435 sed, 347 configuration, 411–412 SELECT, 553 PDI, 413 Select values step, 94, 100, 397 remote execution, 413 semi-additive, 260 services, 414–416 SOF, 265 Sort rows step, 419 semi-structured data, 501–508 Spoon, 413 Separate history table, SCD, 119 Table input step, 424 sequence_value, 213 XML, 413 serial execution, job entries, 90 slices, 271 Serialize to file step, CDC, 164 Slowly Changing Dimension (SCD), 20, Server tab, Mail, 337–338 228–239 services. See also web services Bus Architecture, 118–119 grid-based, 437 Dimension lookup / update step, 232–237 slave servers, 414–416 dimension tables, 118 SET, 499 Dimensional Data Warehouse, 118–119 MySQL, 103, 498 ETL, 118–119 Set Environment Variables step, 354–355 hybrid, 238–239 Set Variables step, 216 Insert / Update step, 229–230 setDefault(), 599, 604 keys, 217 SETI@Home, 433 type 1, 229–232 setOuputDone(), 615 type 2, 232–237 sets, CSV, 498 type 3, 237–238 Settings tab page, Regex Evaluation step, Small and Medium Business (SMB), 139 504–506 small periodic batches, 450 .sh, 58 smart keys, 80, 108 SHA-1, 484 SMB. See Small and Medium Business shadow copies, 31 SMTP, 337 sharding, database, 40 snapshots shared objects, 68–69 CDC, 146, 158–162 database, 589 fact tables, 121, 260–261, 263–264

635179bindex.indd 667 8/17/10 10:07:26 PM 668 Index n S

Sniff test during execution, Spoon, 457–478 Split-field step, Mondrian, 276 sniffing, 314–315 Spoon, 41, 54 sniffStep, 416 Add sequence step, 211–217 snippets, User Defined Java Class step, 620 agile development, 301–302 snowflakes canvas, 318 dimension tables, 97, 218–225 Combination lookup / update step, 241 Sakila, 219–221 Copy tables wizard, 584 SOAP. See Simple Object Access Protocol dynamic transformations, 580–583 soapUI.org, 547 ETL, 81–84 SOF. See state-oriented fact tables Execute a transformation, 413 Software as a Service (SaaS), 437 extraction, 128 software configuration management (SCM), IDE, 55–57 626 jobs, 82 sorting logging, 57, 333–334, 364, 365 clustering, 394 perspective, 302 data, performance, 392–394 Sakila, 81–84 database, 393 shared objects, 69 parallelism, 393–394 shortcuts, 62 round robin, 394 Slave Browser tab, 457 Sort rows step, 479 slave servers, 413 memory, 453 Sniff test during execution, 457–478 rows, 419 transformations, 57, 82 slave servers, 419 variables, 44 Sort size (rows in memory), 393 Spoon.bat, 55, 62 Sort size (rows in memory), 393 .spoonrc, 64 Sort System, 124–125 spoon.sh, 55 Sorted Merge step, 419 spreadsheets Soundex algorithm, 171 data acquisition, 15 source code metadata, 297 Java API, 570 testing, 311 plugins, 594 SQL source data attributes, 484 CDC, 155–157 Business Objects, 9 data cleansing, 173 dynamic jobs, 584 NULL, 179 ELT, 9 PRD, 574 Informatica, 9 RSS Ouput step, 564 Input source step, 483 tabular format, 497 LucidDB, 10 source system ORDER BY, 225, 479 Database lookup step, 222 parameters, 99 keys, 209 performance, 388 primary keys, 210 query, SELECT, 553 real-time data integration, 451 StepMetaInterface, 599 Source XML field, XML Join step, 542 streams, 99 sourceforge.net, 59–60, 570 WHERE, 553 source_system, 178 SQL Server Spatial, Oracle, 591 RDBMS, 134 special dimension builder XML/A, 278 dimensions, 120 SQL statements to execute after connecting, ETL, 120 Database Connection, 39 special_features, 103–104 SQLEditor, 609 Split field to rows step, 104, 499–500 SQL*Loader, Oracle, 247, 249 Split Time step, Mondrian, 276 SQLPower, 118, 154

635179bindex.indd 668 8/17/10 10:07:26 PM Index n S 669

SQLStream, 458 shared objects, 589 src/, 597 transformations, 26 SSL. See Secure Sockets Layer VPLs, 47–49 -stable, 60 stopJob, 416 staging area, 8 stopTrans, 415 ODS, 10 stream(s), 83 standard input (STDIN), 247–248, 250 Add XML step, 538 Standard measures, DataCleaner, 149 data, 577 standardization, 297 data integration, 450 star schema, 78–81. See also rental star editor, 347 schema extraction, 138 CDC, 227–228 memory, 577 denormalization, 226 SQL, 99 dimension tables, 226–228 StepMetaInterface, 600 tables, 495 Table output step, 538 START, job entries, 88 transformations, 452–461, 577 Start at value field, Step sequence step, 212, Web services lookup step, 517 216 XML Join step, 541 STARTDATE, 369 Stream Datefield, Dimension lookup / STARTDATE-ENDDATE, 369–370 update step, 235–236 startExec, 415 Stream lookup step, 173, 178, 253–255, 383 startJob, 416 import_xml_into_db.ktr, 527 startTrans, 415 memory, 453 startup scripts, 70 StrictHostKeyChecking, 641 state-dependent objects, data quality, 168 String, 27 state-oriented fact tables (SOF), 261–263 Boolean, 30 loading, 265–266 Date, 29 static data, rows, 397–398 NULL, 28 static dimensions Number, 29–30 special dimension builder, 120 Palo, 285 tables, 84–87 string(s), 384 static testing, 307 JSON, 522 static values, JavaScript, 396 UI, 609 status, 415 String analysis, DataCleaner, 149 STDIN. See standard input String getDialogClassName(), STEP, 640 StepMetaInterface, 600 step, 346 string literals, JSON, 522 @Step, Plug-in Registry, 601 Strings cut step, Mondrian, 276 _step_, 557 structural testing, 21 Step name, transformation log tables, 369 Stylus Studio, 523 Step name field, Step sequence step, 212 subscription, 635 StepDataInterface getStepData(), 600 subsystems, ETL, 113–126 StepDialogInterface, 607–613 subtansformation interface, 101 step_error_handling, 346 Subversion, Apache, 343, 570 StepInterface, 614–619 success hops, 90 StepInterface getStep(), 600 SugarCRM, 15 step-log-table, 346 SUPER, 82 stepMetaInterface, 599–607 supportsErrorHandling(), 600 StepMetaInterface check, 599 surrogate key(s), 118 steps, 7. See also specific steps Add sequence step, 211–217 outgoing hops, 26 business keys, 210 plugins, 570, 619 creation system, 119 row metadata, 28 database sequence, 217

635179bindex.indd 669 8/17/10 10:07:27 PM 670 Index n S–T

Dimension lookup / update step, 234–235 CDC, 164 dimension tables, 209, 251–260 commit size, 390 DWH, 210 data lineage, 358 generating, 210–217 dynamic templates, 584 hubs, 469 export_xml_from_db, 538 import_xml_into_db.ktr, 527 import_xml_into_db.ktr, 527 pipeline, 121, 252–255 streams, 538 Sakila, 527 Use batch updates for inserts, 389 SOF, 266 TableInput, 595 XML, 527 table_params, 355 surrogate primary keys TableView, 609 dimension tables, 80 tabular format tables, 77 non-relational, 498–501 UPDATE, 470 source data, 497 Switch/Case step, 189–190 tags/, 342, 352 SWT, Eclipse, 607 Talend, 6 swt.jar, 596 Data Profiler, 154 synchronization, data, 9 .tar, 517 Synchronize after merge step, 160–161 Target fields sysdate, 354 Denormalize Special Features, 105 Insert / Update step, 230 T Target XML field, XML Join step, 542 tab-delimited files, 128 .tar.gz, 60 table(s). See also specific table types Task Scheduler, 327 DataCleaner, 148 TCP/IP DV, 485–486 Carte, 57, 417 foreign keys, 77, 208 clustering, 423 hubs, 467 templates, dynamic, 583–584 indexes, 392 testing link-to-link, 472, 474 automation, 311 logging, 367–374 CI, 311 channels log tables, 372 Data Grid step, 311 job entries log table, 373–374 dynamic, 307 job log table, 373–374 ETL, 21, 306–312 performance log tables, 371 integration, 307 step log tables, 370–371 spreadsheets, 311 transformation log tables, 367–370 static, 307 partitioning, performance, 392 transformations, 311 star schema, 495 upgrade, 312 static dimensions, 84–87 test_sequence.ktr, 212–213, 215 surrogate primary keys, 77 text file(s) Table daterange end, Dimension lookup / extraction, 128–132 update step, 236 Web, 137 Table input step, 103, 596 fields, 384 aggregate tables, 266 key/value pairs, 509–510 CDC, 160 parallelism, 385–386, 387 Data Grid step, 132 performance, 384–387 rows, 424 reading, 384–387 slave servers, 424 writing, 387 Stream lookup step, 254 Text file input step, 203, 384 Table output step, 216, 397 Text file output step bulk loading, 250 CDC, 164 rows, 405

635179bindex.indd 670 8/17/10 10:07:27 PM Index n T 671

TextVar, 609 hops, 25, 26–27 third normal form (3NF), 218 Java DV, 469 API, 572–573 threads, 397. See also multi-threading expressions, 70–71 RowProducer, 579 job entries, 88 3NF. See third normal form JSON, 523 time analysis, DataCleaner, 149 Kitchen, 57 time dimensions, 239 logging, 453–454 time-outs, databases, 453 master, 421–422 TIMESTAMP, 80 memory, 452, 453 timestamps metadata, 36–37, 421–425, 572–573 CDC, 155–157, 163, 450 notes, 25 DV, 480 Pan, 57 title, 559 parallelism, 27, 404 TLS. See Transport Layer Security parameters, 579–580 /tmp/carte.log, 441 performance, 377–398 tokens, Get data from XML step, 536 phases, 452 tools, 41 relational databases, 497 ETL, 6 RSS, 563 requirements, 17–22 Run button, 83–84 top-down level-wise loading, 219 shared objects, 69 Tortoise SVN, 570 slave, 421–422 TPC-H, 253 Spoon, 57, 82 traceability steps, 26 of data, 467 streams, 452–461, 577 DV, 471 testing, 311 TRANS, 640 variables, 89, 579–580 Trans, 577 VPLs, 46 trans, 325 XML, metadata, 345–346 transaction grain fact tables, 121 Transformation File, 330 , 345 Transformation Inputs, 330 transformation(s), 7 transformation log tables, 367–370 action sequence, 328–330 Get System Info step, 367 architecture, 452 history, 367–368 bottlenecks, 379–382 parameters, 369 buffers, 406–407 Transformation Settings, Manage thread Calculate Dimension Attributes, 85–86 priorities?, 397 canvas, 56 Transformation Step, 330 clustering, 417–425 transformation step plugins, 599–619 partitioning, 430 Transformations Settings, 368 command line, 322–326 Monitoring tab, 381 data, 576–580 transitive closure table, 242 data conversion, 29–30 trans-log-table, 345 Database Connection, 37, 90–95 TransMeta, 572, 577 debugging, 56 TransMeta.getSQLStatements(), 566 deduplication, 195–199 transparency, ETL, 24 dynamic TRANS_PERFORMANCE, 640 CSV, 580–583 Transport Layer Security (TLS), 337 Spoon, 580–583 transStatus, 415 error handling, 186–187 triggers ETL, 12, 25–30 CDC, 163, 450 challenges, 20 database, 157–158 Get data from XML step, 532 performance, 392

635179bindex.indd 671 8/17/10 10:07:27 PM 672 Index n T–V

Truncate, bulk loading, 251 URLs. See Uniform Resource Locators trunk/, 342 Use batch updates for inserts, Table output Trunk version, 630 step, 389 trunks, 283 Use Kettle Repository, 330 tst, 354 Use tokens, Get data from XML step, 533, 535 Tungsten Replicator, 163 user, 323 Twitter, 454–457 jdbc.properties, 65 type, 65 User Acceptance test (UA), 307 TYPE_BIGNUMBER, 606 User Console, 333 TYPE_BINARY, 606 User Defined Java Class step, 620–624 TYPE_BOOLEAN, 606 Change number of copies to start, 404 TYPE_DATE, 606 DyanicJob, 586 TYPE_INTEGER, 606 get(), 620 TYPE_NUMBER, 606 init(), 459 TYPE_STRING, 606 JavaScript, 395 metadata, 620 U rows, 404 UA. See User Acceptance test snippets, 620 Ubuntu, 439 variables, 43 AMI, 442 User Defined Java Expressions step, 70–71, UI. See user interface 202–205 ui/laf.properties, 591 data cleansing, 202–203 uname, 354 user interface, 24. See also graphical user unbalanced hierarchy, 120 interface unconditional hops, 88 elements, 609 unconditional job hop, 31 StepMetaInterface, 600 UniCode, 15, 507 user maintained dimensions, 120 Uniform Resource Locators (URLs), 516 User Name and Password, Database Web services lookup step, 545 Connection, 38 Unique rows step, 193–194 user-defined expressions and classes, Java, unit tests, 307 520 UNIX, 12, 507 user.dir, 642 chmod, 322 user.home, 642 cron, 326–327 user.name, 642 crontab, 326–327 UTF-8, 129 Kitchen, 57 Add XML step, 539 Pan, 57 RSS Ouput step, 565 running programs, 62 Unknown, 17 V unstructured data, 501–508 Vaillencourt, Luc, 591 Unzip, AMI, 440 valid, 160 UPDATE, 157, 230 Validate msg field, XSD Validator step, 529 surrogate primary keys, 470 Validate XML?, Get data from XML step, 533 Update fields, Insert / Update step, 231–232 validation. See data validation update mode, 232 Validator step, 179–180 Update step valid_from, 265 CRC, 485 valid_to, 265 Dimension lookup / update step, 238 value(s) upgrade JSON, 522 repositories, 351–352 metadata, 605–606 testing, 312 static, 396 url, jdbc.properties, 65 Value distribution, DataCleaner, 149

635179bindex.indd 672 8/17/10 10:07:27 PM Index n V–W 673

Value mapper step, 94, 99, 170 W Value when XML is invalid, XSD Validator Warehouse Builder, 6, 9 step, 529 warnings, 405 Value when XML is valid, XSD Validator waterfall model, 12 step, 529 Wavemaker, 120 ValueMetaInterface, 605–606 WDSL, SOAP, 517 van der Lek, Harm, 303 web van Dongen, Jos, 228, 327 browsers, slave servers, 413 VARCHAR, 29 extraction, 137–138 variables, 43 pages Apache VFS, 641–642 HTML, 520 built-in, 637–642 web services, 515–517 hierarchy, 120 text files extraction, 137 internal, 428–429 web services, 515–568 Java API, 579–580 Apache VFS, 517 JavaScript, 396 API, 516 jobs, 89 data formats, 517–523 JRE, 642 Freebase, 550 kettle.properties, 66 HTML, 520 logging, 367 JSON, 520–523 Spoon, 44 RSS, 517 StepInterface, 618–619 SOAP, 517 transformations, 89, 579–580 web pages, 515–517 using, 44–45 XML, 518–520 VariableSpace, 618–619 Web Services Description Language VCS. See Version Control System (WSDL), 544 version, 324 Web services lookup step, 517 Version Control System (VCS), 341–344 SOAP, 544–546 ETL, 124 streams, 517 XML, 352 Web services tab, Web services lookup step, Version field, Dimension lookup / update 545 step, 235 wget, 60 Version Migration System, 352–355 WHERE, 230 ETL, 124 satellites, 484 parameters, 353–355 SQL, 553 repositories, 352–353 white box testing, 306 XML, 352 whitespace, 507 Versioning, Enterprise Edition, 636 widgets, 607–608 VFS. See Virtual File System wiki, 631 Virtual File System (VFS), 41, 42 Wikipedia, 549–550 Apache, 42, 349, 517, 619 Windows, 61–62 variables, 641–642 Wintner, Robert, 140 virtual machines (VM), 438 WMS. See Workflow Management Systems visual programming languages (VPLs), Workflow Management Systems (WMS), 344 45–51 Workflow Monitor, 124 steps, 47–49 wrappers, LucidDB, 10 transformations, 46 write back, 271 Visualize, Spoon, 302–303 WSDL. See Web Services Description VM. See virtual machines Language VPLs. See visual programming languages

635179bindex.indd 673 8/17/10 10:07:27 PM 674 Index n X–Z

X xml=Y, 414 XBase, 134 -Xmx, memory, 253 XChat, 633 XP. See Extreme Programming xchataqua.soureforge.net, 633 XPath, 518, 532 XML. See eXtensible Markup Language Get data from XML step, 535 XML Join step, 519, 541–544 XSD Filename, XSD Validator step, 529–530 streams, 541 XSD Source, XSD Validator step, 529–530 XML output step, 518 XSD Validator step, 519, 528–530 CDC, 164 data validation, 530 XML Schema, 518, 528 error handling, 530 data validation, 530 job entries, 519 XSD Validator step, 519 XML, 133 XML Schema Definition, XSD Validator step, xsi:schemaLocation, 529 529–530 XSL. See eXtensible Stylesheet Language XML source, XSD Validator step, 529 XSL Transformation job entry, 519 XML source from field, Get data from XML XSL Transformation step, 518–519 step, 532 XSL Transformations (XSLT), 133 XML source is a filename?, Get data from XSLT. See XSL Transformations XML step, 532 xstream.codehaus.org, 603 XML source is defined in field, Get data XUL, 628 from XML step, 532 XML source is defined in field?, Get data Y from XML step, 548 Yourdon, Ed, 12 XML/A YouTube, 315 JavaScript, 281 MSAS, 279–280 Z OLAP, 277–282 ., 60, 517 Rename fields step, 280

635179bindex.indd 674 8/17/10 10:07:27 PM