APPENDIX

Normalization Rules

Normalization is the process of removing data redundancy by implementing normalization rules. There are five degrees of normal forms, from the through the , as described in this appendix.

First Normal Form The following are the characteristics of first normal form (1NF):

• There must not be any repeating columns or groups of columns. An example of a repeating is a customer with Phone Number 1 and Phone Number 2 columns. Using “table (column, column)” notation, an example of a repeating group of columns is Order Table (Order ID, Order Date, Product ID, Price, Quantity, Product ID, Price, Quantity). Product ID, Price, and Quantity are the repeating group of columns.

• Each table must have a (PK) that uniquely identifies each row. The PK can be a composite, that is, can consist of several columns, for example, Order Table (Order ID, Order Date, Customer ID, Product ID, Product Name, Price, Quantity). In this notation, the underlined columns are the PKs; in this case, Order ID and Product ID are a composite PK.

Second Normal Form The following are the characteristics of (2NF):

• It must be in 1NF.

• When each value in column 1 is associated with a value in column 2, we say that column 2 is dependant on column 1, for example, Customer (Customer ID, Customer Name). Customer Name is dependant on Customer ID, noted as Customer ID ➤ Customer Name.

505 506 APPENDIX ■ NORMALIZATION RULES

• In 2NF, all non-PK columns must be dependent on the entire PK, not just on part of it, for example, Order Table (Order ID, Order Date, Product ID, Price, Quantity). The underlined columns are a composite PK. Order Date is dependent on Order ID but not on Product ID. This violates 2NF.

• To make it 2NF, we need to break it into two tables: Order Header (Order ID, Order Date) and Order Item (Order ID,Product ID, Price, Quantity). Now all non-PK columns are dependent on the entire PK. In the Order Header table, Order Date is dependent on Order ID. In the Order Item table, Price and Quantity are dependent on Order ID and Product ID. Order ID in the Order Item table is a foreign key.

Third Normal Form The following are the characteristics of (3NF):

• It must be in 2NF.

• If column 1 is dependent on column 2 and column 2 is dependent on column 3, we say that column 3 is transitively dependent on column 1. In 3NF, no column is tran- sitively dependent on the PK, for example, Product (Product ID, Product Name, Category ID, Category Name). Category Name is dependant on Category ID, and Category ID is dependant on Product ID. Category Name is transitively dependent on the PK (Product ID). This violates 3NF.

• To make it 3NF, we need to break it into two tables: Product (Product ID, Product Name, Category ID) and Category (CategoryID, Category Name). Now no column is transi- tively dependent on the PK. Category ID in the Product table is a foreign key.

Boyce-Codd Normal Form Boyce-Codd Normal Form (BCNF) is between 3NF and 4NF. The following are the characteris- tics of BCNF:

• It must be in 3NF.

• In Customer ID ➤ Customer Name, we say that Customer ID is a determinant. In BCNF,every determinant must be a candidate PK. A candidate PK means capable of being a PK; that is, it uniquely identifies each row.

• BCNF is applicable to situations where you have two or more candidate composite PKs, such as with a cable TV service engineer visiting customers: Visit (Date,Route ID, Shift ID, Customer ID, Engineer ID, Vehicle ID). A visit to a customer can be identified using Date, Route ID, and Customer ID as the composite PK. Alternatively, the PK can be Shift ID and Customer ID. Shift ID is the determinant of Date and Route ID. APPENDIX ■ NORMALIZATION RULES 507

Higher Normal Forms The following are the characteristics of other normal forms:

• A table is in (4NF) when it is in BCNF and there are no multivalued dependencies.

• A table is in fifth normal form (5NF) when it is in 4NF and there are no cyclic dependencies.

It is a good practice to apply 4NF or 5NF when it is applicable.

■Note A (6NF) has been suggested, but it’s not widely accepted or implemented yet. Index

■Numbers and Symbols overview, 302 @ for naming report parameters, 343 purposes of, 323 1NF (first normal form), 506 audits 2NF (second normal form), 505 DQ auditing, 296–298 3NF (third normal form), 506 ETL, defined, 31 4NF (fourth normal form), 507 reports, 332 5NF (fifth normal form), 507 authentication of users, 498 authorization of user access, 498 ■A Auto Build, 385 accounts, security audits of, 499 Auto Layout, 249 action column, 322 autofix action (DQ rules), 296 actions, data quality, 293–296 automating ETL monitoring, 492–493 administration functions ■ data quality monitoring, 495–498 B database management, 499–501 backing up ETL monitoring, 492–495 , 500 schema changes, 501–502 MDBs, 405–408 security management, 498–499 band attribute (Amadeus), 64 updating applications, 503 batch files ADOMD.NET, 412 creating, 138, 157 aggregates. See also summary tables ETL, 269 defined, 415 updating, 15–16 alerts (BI), 437–438 BCNF (Boyce-Codd Normal Form), 506 aligning partition indexes, 166 BI () allow action (DQ rules), 295 alerts, 437–438 Amadeus Entertainment case study. See case analytics applications, 413–416 study (Amadeus Entertainment) application categories, 411 AMO (Analysis Management Objects), 417 Business Intelligence Development Studio Analysis Services (OLAP) Report Wizard, 339 authentication and, 397 applications, 432–437 cubes in, 397 applications. See data mining failover clusters and, 115 applications (BI) partitioned cubes, 119 examples of, 12–13 tools vs. reports, 333 portal applications, 438–439 analytics applications (BI), 413–416 reports, 34, 412–413 applications, updating by DWA, 503 search product vendors, 474 architectures systems, applications for, 17–18 data flow. See data flow architecture binary files, importing, 190 determining, 52 bitmapping, index, 169 system. See system architecture design block lists (black lists), 451 association scores, 471 boolean data type (data mining), 419, 420 attributes, customer, 444 bounce rate (e-mail), defined, 447 audio processing, text analytics and, 473 bridge tables, defined, 109 audit bulk copy utility (bcp) SQL command, components of, 323 188–189 event tables, 323 bulk insert SQL command, 187, 189 maintaining, 327 business areas, identifying (Amadeus), 61–62 business case document, 51–52 509 510 ■INDEX

business Intelligence (BI). See BI (Business class attribute (Amadeus), 64 Intelligence) classification algorithm, 422 Business Objects Crystal Report XI, 356 cleaning (CDI), defined, 468 Business Objects XI Release 2 Voyager, 380 cleansing, data, 277–290 business operations, evaluating (Amadeus), click-through rate (email), 98, 447 62–63 clustered configuration, defined, 43 business performance management, 13 clustering algorithm, 422 business requirements Clustering model, 431 CRM data marts (Amadeus), 96 Cognos subscription sales (Amadeus), BI 8 Analysis, 380 90 PowerCube, 377, 379 verifying with functional testing, 480 Powerplay, 356 collation, database, 124 ■C columns calendar date attributes column (date continuous (data mining), 419 dimension), 77–78 cyclical (data mining), 420 campaigns description (data definition table), 305 creating CRM, 447–448 discrete (data mining), 419 defined, 447 discretized (data mining), 419 delivery/response data (CRM), 454–460 ordered (data mining), 420 response selection queries, 449 repeating, 505 results , 99, 450 risk_level column, 322 segmentation (CRM), 18, 98, 447–450 status, 320, 322 candidate PK, 506 storing historical data as, 81 case sensitivity in database configuration, types in DW tables, 306 124 communication case study (Amadeus Entertainment) Communication Subscriptions Fact Table data feasibility study, 67–70 (example), 452 risks, 67 communication_subscription transaction defining functional requirements, 63–65 table (NDS database), 140–143 defining nonfunctional requirements, master table (NDS physical database), 143 65–67 permission, defined, 96 evaluating business operations, 62–63 preferences, defined, 96 extracting Jade data with SSIS, 191–200 subscription, defined, 96 functional testing of data warehouse, 480 comparing data (ETL monitoring), 494–495 identifying business areas, 61–62 complaint rate (email), 98 iterative methodology example, 56–58 conformed dimensions overview of, 44–46 creating (views), 158 product sales. See product sales data mart defined, 7 (Amadeus) consolidation of data, 5–6 product sales reports, 349, 353, 355, 359, construction iteration, 56 369 content types (data mining), 419–420 query for product sales report, 331 continuous columns (data mining), 419 security testing, 485 control system, ETL, 31 server licenses and, 119 converting data for consolidation, 6 case table, defined (data mining), 418 cookies vs. self-authentication, 464 CDI (Customer Data Integration) covering index, 170 customer data store schema, 469 CRM (customer relationship management) fundamentals, 23–24, 467–468 basics, 14 implementation of, 469 campaign analysis (Amadeus), 64 CET (current extraction time), 182 campaign delivery/response data, change requests, procedures for, 501 454–460 character-based data types, 277 campaign segmentation, 447–450 charting. See also analytics applications (BI), customer analysis, 460–463 440 customer loyalty schemes, 465–466 churn analysis, 465 customer support, 463–464 ■INDEX 511

data marts (Amadeus), 96–101 history, storing, 10–11 fundamentals, 441 integration, defined, 36 permission management, 450–454 leakage, ETL testing and, 187, 479 personalization, 464–465 lineage metadata. See data mapping single customer , 442–447 metadata systems, applications for, 18–19 matching, 6, 277–290 cross-referencing vs. metadata (example), 475 data validation and, 291–292 querying basics, 11 data with external sources, 290–291 reconciliation of (ETL monitoring), cross tab reports, 13 493–495 cubes (multidimensional data stores) retrieval of, 4–5 in Analysis Services, 397 risks, examples of (Amadeus), 67–69 building/deploying, 388–394 scrubbing, 277 Cube Wizard, 385 storage, estimating, 69 defined, 3 transformation, defined, 36 engines, 379 update frequency, 6 reports from, 362–366 data definition metadata scheduling processing with SSIS, 399–404 overview, 301 current extraction time (CET), 318 report columns, 306 customer relationship management (CRM). table, 303 See CRM (customer relationship table DDL, 305 management) customers connecting to source data, 179–180 analysis (CRM), 18, 460–463 ETL. See ETL (Extract, Transform, and attributes, 444 Load) behavior selection queries, 449 extracting e-mails, 191 customer table (NDS physical database), extracting file systems, 187–190 147–151 extracting message queues, 191 Customer Data Integration (CDI). See CDI extracting relational databases. See (Customer Data Integration) extracting relational databases data store schema (CDI), 469 extracting web services, 190 dimension, creating, 133 from flat files, 208–213 dimension, designing, 84–86 memorizing last extraction timestamp, defined, 18 200–207 loyalty schemes (CRM), 18, 465–466 potential problems in, 178 permissions (CRM). See permissions, with SSIS, 191–200 management (CRM) from structured files, 177 profitability analysis, 13 from unstructured files, 178 services/support (CRM), 18, 463-464 data feasibility studies cyclical columns (data mining), 420 Amadeus example of, 67–70 populating source system metadata, 317 ■D purpose of, 67 daily batches, 269 data firewall dashboards creating, 215, 218–219 applications (BI), 432–437 defined, 32 data quality, 275 data flow data formatting, 249 architecture vs. data flow architecture, 29 table (ETL process metadata), 318–320 availability, 5 data flow architecture cleansing, 69, 277–290 vs. data architecture, 29 comparing (ETL monitoring), 494 data stores. See data stores consolidation of, 5–6 defined, 29 conversion of, 6 federated data warehouse (FDW), 39–42 defining, 6 fundamentals, 29–33 dictionary, defined, 308 NDS+DDS example, 35–37 hierarchy in dimension tables, 101–102 512 ■INDEX

ODS+DDS example, 38–39 DQ rules table, 321–322 single DDS example, 33–35 DW user table, 321–322 data mapping metadata overview, 302 data flow table, 307 data quality rules overview, 302 data quality metadata and, 320 source column and, 306 defined, 32 data mart fundamentals, 291-293 fact tables and, 74 violations, 496-497 view, 158–159 data stores data mining data lineage between, 307 applications for, 19–20 defined, 30 fundamentals, 14, 19 delivering data with ETL testing, 478 data mining applications (BI) overview, 31–32 column data types, 419–420 types of, 30 creating/processing models, 417–422 data structure metadata demographic analysis example, 424–431 maintaining, 326 implementation steps, 417 overview, 302 processing mining structure, 423–424 populating from SQL Server, 311–313 uses for, 416 purposes of, 308–309 data modeling tables, 309–311 CRM data marts (Amadeus), 96–101 tables with source system metadata, data hierarchy (dimension tables), 314–317 101–102 data types date dimension, 77–80 conversion output for (SSIS), 250 defined, 29 in data mining, 419 designing DDS (Amadeus), 71–76 data warehouses (DW) designing NDS (Amadeus), 106–111 advantages for SCV, 445–447 dimension tables, 76–77 alerts, 437 product sales data mart. See product sales building in multiple iterations, 54 data mart (Amadeus) The Data Warehouse Toolkit (Wiley), 82 SCD, 80–82 defined, 1, 16–17 source system mapping, 102–106 deploying, 53 subscription sales data mart (Amadeus), designing, 52 89–94 development methodology. See system supplier performance data mart development methodology (Amadeus), 94–95 development of, 52 data quality (DQ) DW keys, 109 actions, 293–296 vs. front-office transactional system, 5 auditing, 296-298 internal validation, 291 components in DW architecture, 274 major components of, 478 cross-referencing with external sources, MDM relationship to, 23 290–291 migrating to production, 491 data cleansing and matching, 277–290 non-business analytical uses for, 14 Data Quality Business Rules document, operation of, 53 292 populating. See populating data database, defined, 32 warehouses importance of, 273 real-time, 27 logging, 296–298 system components, 4 monitoring by DWA, 495–498 updating data in, 15–16 process, 274–277 uses for, 17 processes, defined, 32 databases reports, 32, 332 collation of, 124 reports and notifications, 298–300 configuring, 123–128 data quality metadata design, data stores and, 7 components of, 320 extracting relational. See extracting DQ notification table, 321–322 relational databases ■INDEX 513

management by DWA, 499–501 delivery MPP systems, 175 campaign delivery/response data, multidimensional. See MDB 454–460 (multidimensional database) channel, defined (CRM), 447 naming, 124 rate (e-mail), defined, 447 restoring backup of, 500 demographic data selection queries servers, sizing, 116–118 (campaigns), 449 SQL Server. See physical (DDS dimension tables), transaction log files, 189 251 DataMirror software, 190 denormalized databases, defined, 30 date dimension dependency network diagrams, 425 fundamentals, 77–80 deploying source system mapping, 104 data warehouses, 53 dates reports, 366–369 data type (data mining), 419–420 description column (data definition table), date/time data types, 278 305 dimension table, creating, 128–132 descriptive analysis excluding in MDM systems, 21 in data mining, 417 format columns (date dimension), 77 defined, 14 DBA (Database Administrator), liaising with, examples of, 460–463 489 determinants, 506 DDL () diagram pane (Query Builder), 337 of data definition table, 303 dicing, defined (analytics), 413 of data mapping table, 307 dimension tables (DDS) for subscription implementation fundamentals, 76–77 (example), 453 loading data into, 250–266 DDS (dimensional data store) dimensional attributes, defined, 76 database, creating new, 501 dimensional data marts, defined, 33 defined, 2, 30 dimensional data store (DDS). See DDS designing (Amadeus), 71–76 (dimensional data store) dimension tables, populating, 215, dimensional databases, defined, 30 250–266 dimensional hierarchy, defined, 101 drill-across dimensional reports, 333 dimensional reports, 332 fact tables, populating, 215, 266–269 dimensions, defined, 3, 377 fundamentals, 7 discrete columns (data mining), 419 vs. NDS, 9 discretized columns (data mining), 419 NDS+DDS example, 35–37 disk, defined, 121 ODS+DDS example, 38–39 distributing (CDI), defined, 468 single DDS example, 33–35 Division parameter example, 349–351 single dimension reports, 333 DMX (), 432 sizing, 124, 126–128 DMX SQL Server data mining language, 417 DDS database structure documentation, creating, 489 batch file, creating, 138 documents customer dimension, creating, 133 transforming with text analytics, 471–473 date dimension table, creating, 128–132 unstructured into structured, 471 product dimension, creating, 132 double data type (data mining), 419, 420 Product Sales fact table, 135 DQ (data quality). See data quality (DQ) store dimension, creating, 135 drilling decision trees across, 394 algorithm, 422 up, 414–415 model, 431 DW (data warehouse). See data warehouses decode table (example), 180 (DW) defragmenting database indexes, 500 DWA (data warehouse administrator) degenerate dimensions, defined, 73 functions of, 56, 488–489. See also deletion trigger, 184 administration functions metadata scripts and, 326 dynamic file names, 188 514 ■INDEX

■E exception-based reporting, 492 e-commerce industry exception scenarios (performance testing), customer analysis and, 460–461 484 customer support in, 464 execution, report, 374–375 e-mails external data, NDS populating and, 219, email_address_junction table (NDS 222–223 physical database), 155–156 external notification (ETL monitoring), email_address_table (NDS physical 493–494 database), 153 external sources, cross-referencing data with, email_address_type table (NDS physical 290–291 database), 156–157 Extract, Transform, and Load (ETL). See ETL extracting, 191 (Extract, Transform, and Load) store application, 473 extracting relational databases EII (enterprise information integration), 40 fixed range method, 186 elaboration iteration, defined, 56 incremental extract method, 181–184 ELT (Extract, Load, and Transform) related tables, 186 defined, 5 testing data leaks, 187 ETL and, 117 whole table every time method, 180–181 fundamentals, 175 ■ end-to-end testing F defined, 477 fact constellation schema, 7 fundamentals, 487 fact tables enterprise data warehouse, illustrated, 10 campaign results, 99 Enterprise Edition, SQL Server, 118–119 loading data into (DDS), 250, 266–269 enterprise information integration (EII). See populating DDS, 215, 266–269 EII (enterprise information product sales (Amadeus), 71, 75, 102 integration) subscription sales (Amadeus), 90 entertainment industry, customer support supplier performance (Amadeus), 90 in, 464 failover clusters ETL (Extract, Transform, and Load) defined, 114 batches, 269 number of nodes for, 119 CPU power of server, 116 FDW (federated data warehouse), 39–42 defined, 2, 4 feasibility studies, 51–52 ELT and, 175 federated data warehouse (FDW). See FDW extraction from source system, 176–177 (federated data warehouse) fundamentals, 32, 173–174 fibre networks, 115 log, 483 fifth normal form (5NF), 507 monitoring by DWA, 492–495 file names, dynamic, 188 near real-time ETL, 270 file systems, extracting, 187–190 performance testing and, 482 filegroups, 131–132 pulling data from source system, 270 filtering reports, 351 testing, defined, 477–479 financial industry, customer support in, 463 ETL process metadata firewalls components of, 318 creating data, 215, 218–219 overview, 302 ODS, 276 purposes of, 320 first normal form (1NF), 505 tables, 318–320 first subscription date, 274 updating, 327 fiscal attribute columns (date dimension), events 77, 78 defined, 62 fix action (DQ rules), 295 Event Collection (Notification Services), fixed position files, 177 438 fixed range extraction method, 185–186 event tables (audit metadata), 323 flat files, extracting, 187, 208–213 exact matching, 278 forecasting (data mining), 416 Excel, Microsoft, creating reports with, 359–362 ■INDEX 515 foreign keys inception iteration, 56 naming, 146, 157 incoming data validation, 291 necessity of, 137 incremental extraction method, 181–184 fourth normal form (4NF), 507 incremental loading (DDS dimension tables), fragmentation of database indexes, 500 251 frequent-flier programs, 465 incremental methodology. See iterative full-text indexing, 126 methodology functional requirements indexes defined, 61 covering index, 170 establishing (Amadeus), 63–65 creating in partitioned tables, 170 functional testing index intersection, 169 defined, 477 Index Wizard, 168 fundamentals, 480 indexer in search applications, 474 fuzzy logic matching, 278 maintaining database, 500 Fuzzy Lookup transformation (example), indexing 279–290 full-text, 126 implementing, 166–170 ■G online index operation, 119 galaxy schemas, 7 parallel index operations, 119 general performance requirements, 483 stage tables, 217 general permissions (CRM), 450 indicator columns (date dimension), 77, 79 Generic Query Designer, 352 inferred dimension members, 260 geographic dispersion of rule violations, 496 infrastructure setup overview, 53 global enterprise currency, 75 Inmon, Bill, 16 Google search products, 474 insert SQL statements, 323 grain, table, 72 insurance industry, customer analysis and, granularity, FDW data and, 39 460 grid pane (Query Builder), 337 integration testing. See end-to-end testing grouping reports, 351–355 internal data store, defined, 30 groups, security, 498–499 internal notification (ETL monitoring), 493 internal validation, data warehouse, 291–292 ■H intersection, index, 169 hard RI, defined, 76 invoices, text analytics and, 473 hardware platform (physical database iterative methodology, 54–59 design), 113–119 help desk support, 488 ■J hierarchy Jade system, 45 data, 101–102 junction tables dimensional, 101 defined, 109 MDM, 23 NDS populating and, 219, 225–228 historical data, storing, 10–11 Jupiter ERP system, 44 HOLAP (Hybrid Online Analytical Processing), 381 ■K horizontal/vertical partitioning, 162 key columns (data mining), 419 hot spare disks, 123 key management hubs, MDM, 23 DDS dimension tables and, 251 Hungarian naming conventions, 343 in NDS, 151 hybrid data store, defined, 30 NDS populating and, 219, 223–225 hypercubes, 378 key sequence columns (data mining), 420 Hyperion Essbase, 377, 379 key time columns (data mining), 420 keys, DW, 109 ■I Kimball , 41 IIS logs, 325 Kimball, Ralph, 16, 82 image processing, text analytics and, 473 knowledge discovery, 416. See also data impact analysis, defined, 302 mining inactive accounts, security audits of, 499 516 ■INDEX

■L online analytical processing (OLAP), language attribute table (NDS physical 380–381 database), 145–146 querying, 394–396 last cancellation date, 274 vs. relational databases, 378 last month view, 159 scheduling cube processing with SSIS, last successful extraction time (LSET), 318 399–404 late-arriving dimension rows, 260 security of, 397–399 late-arriving facts, 269 MDBMS (multidimensional database latest summary table, 161 management systems), 379, 415 layouts, report, 340–342 MDDS (multidimensional data store), 377 leakage, defined, 174 MDM () leavers examples of, 20–21 defined, 498 fundamentals, 21–23 updating, 499 OLTP systems and, 22 levels of objects, defined, 63 relationship to data warehouses, 23 licensing models, SQL Server, 119 MDX (Multidimensional Expressions) lift charts, defined (data mining), 430 fundamentals, 435 list selection process (campaigns), 448 MDX Query Designer, 365 loading/query of partitioned tables, 163 membership subscriptions, 452 log files memory maintenance (database database transaction, 189 management), 500 size of, 125 message queue (MQ). See MQ (message logging queue) data quality, 296–298 messaging, defined, 16 ETL log, 483 metadata log reader, database, 176–177 change request (example), 326 SSIS logging, 484 vs. data (example), 475 web logs, 189 database, configuring, 126, 128 logical unit number design, 123 defined, 31 logins for customer ID, 464 maintaining, 325–327 long data type (data mining), 419, 420 overview, 301–303 Lookup transformations, upsert using, reasons for using, 303 236–242 storage, 22 loyalty schemes, customer (CRM), 465–466 types of, 301–302 LSET (last successful extraction time), 182 unstructured data and, 473 methodology, system development. See ■M system development methodology massively parallel processing (MPP) Microsoft database system. See MPP (massively Analysis Services, 377, 379 parallel processing) database system clustering algorithm, 428 master data Office SharePoint Server, 438–439 fundamentals, 21 MicroStrategy OLAP Services, 381 management (MDM). See MDM (master migrating data warehouse to production, data management) 487–489, 491 store, defined, 30 mini-batches, defined, 15, 269 storing history of, 11 Mining Structure designer, 421–422 master tables, defined, 36, 106 MOLAP (multidimensional online analytical matching, data, 6, 277–290 processing) matching rules (metadata storage), 22 applications, 415 matrix form (reports), 13, 338, 342 defined, 14, 381 MDB (multidimensional database) monitoring backing up and restoring, 405–408 data quality, 495–498 building/deploying cube, 388–394 ETL processes, 492–495 creating (Amadeus), 381–387 Morris, Henry, 413 defined, 3, 31 fundamentals, 377–379 ■INDEX 517 movers networks, testing security access, 485 defined, 498 NK (natural key). See natural keys updating, 499 NLB (network load balanced) servers, 114 MPP (massively parallel processing) nodes database system columns, defined, 425 defined, 43 defined, 43 fundamentals, 175 nonfunctional requirements MQ (message queue) defined, 61 basics, 16 establishing, 65–67 extracting, 191 normal scenarios (performance testing), 484 failure, simulating, 479 normalization multidimensional data stores (cubes). See defined, 8 cubes (multidimensional data stores) NDS population and, 219, 242–248 multidimensional database (MDB). See MDB normalized databases, defined, 30 (multidimensional database) normalized data store (NDS). See NDS multidimensional online analytical (normalized data store) processing (MOLAP). See MOLAP rules, 109, 505–507 (multidimensional online analytical notification processing) column, 320 multiple iterations, building in, 54 data quality, 275, 298–300 to monitor ETL processes, 493 ■N Notification Delivery (Notification naming Services), 438 database, 124 Notification Services, SQL Server, 438 dynamic file names, 188 numerical data types, 278 foreign keys, 146 primary keys, 146 ■O report parameters, 343 OCR (Optical Character Recognition), 471 tables, 137 ODBC (Open Database Connectivity), 412 natural keys ODS ( ) defined, 37, 223 CRM systems and, 18 example, 84 defined, 30 NDS (normalized data store) firewall, 276 customer table (Amadeus), 110 ODS+DDS architecture, configuring, 126 defined, 30 ODS+DDS architecture (example), 38–39 designing (Amadeus), 106–111 reports, 332 fundamentals, 8–10 OLAP (Online Analytical Processing) NDS+DDS example, 35–37 applications. See analytics applications populating, 215, 219–228 (BI) populating with SSIS, 228–235 basics, 14 population, normalization and, 242–248 fundamentals, 380–381 sizing, 124 server cluster hardware, 116 store table example (populating), 242 servers, 379 NDS physical database, creating tools, 333, 356, 380 batch file, 157 OLTP (Online Transaction Processing) communication master table, 143 vs. data warehouse reports, 333 communication_subscription transaction defined, 2 table, 140–143 Online Analytical Processing (OLAP). See customer table, 147–151 OLAP (Online Analytical Processing) email_address_junction table, 155–156 online index operation, 119 email_address_table, 153 Online Transaction Processing (OLTP). See email_address_type table, 156–157 OLTP (Online Transaction language attribute table, 145–146 Processing) order_header table, 151–153 open rate (e-mail), defined, 98, 447 overview, 139 operation, data warehouse (overview), 53 near real-time ETL, 270 operation team, user support and, 488 518 ■INDEX

operational data store (ODS). See ODS NDS, creating physically. See NDS physical (operational data store ) database, creating operational system alerts, 437 partitioning tables. See partitioned tables opting out (permissions), 454 (databases) order column, defined, 318 sizing database server, 116–118 order header table SQL Server, editions of, 118–119 example, 182 SQL Server, licensing of, 119 NDS physical database, 151–153 storage requirements, calculating, ordered columns (data mining), 420 120–123 summary tables, 161 ■P views. See views (database object) package, ETL, defined, 31 PIM (product information management), 22 package table (ETL process metadata), PM (project manager), function of (example), 318–320 56 parallel database system. See MPP (massively populating data warehouses parallel processing) database system data firewall, creating, 215, 218–219 parallel index operations, 119 DDS dimension tables, 215, 250–266 parallel query, defined, 10 DDS fact tables, 266–269 parameters, report ETL batches, 269 Division parameter example, 349–351 NDS, 215, 219–228 naming, 343 NDS with SSIS, 228–235 overview, 342 near real-time ETL, 270 Quarter parameter example, 346–348 normalization, 242–248 Year parameter example, 345–346 overview, 215 partition indexes, aligning, 166 pushing data approach, 270–271 partitioned cubes, 119 SSIS practical tips, 249–250 partitioned tables (databases) stage loading, 215, 216–217 administering, 166 upsert using Lookup transformation, 236 creating indexes in, 170 upsert using SQL statements, 235–236 loading/query of partitioned tables, 163 portals maintenance of, 500 applications (BI), 438–439 Subscription Sales fact table example, 162, creating data warehouse, 489 163–166 post office organizations, 290 vertical/horizontal partitioning, 162 Prediction Query Builder, 417 partitioning, table and index, 118 predictive analysis patches, security, 498 basics, 13 per-processor licenses (SQL Server), 119 customer analysis (example), 461 performance in data mining, 416 requirements, 483 defined, 14 testing, defined, 477 PredictProbability function, 432 testing, fundamentals, 482–484 primary keys, naming, 146 periodic snapshots processes defined, 11 data quality, 274–277 fact table, 90, 269 ETL, 31 periodic updating of data, 6 mining structure (data mining), 423–424 permissions ProClarity Analytics 6, 380 management (CRM), 18, 450–454 product data, MDM systems and, 21–22 selection queries, 449 product dimension personalization (CRM), 18, 464–465 creating, 83–84, 132 physical database design source system mapping, 105 configuring databases, 123–128 product information management (PIM). See DDS database structure, creating. See DDS PIM (product information database structure management) hardware platform, 113–119 product sales data mart (Amadeus) indexing, 166–170 analysis of product sales, 63 customer dimension, 84–86 ■INDEX 519

date dimension, 77–80 relational online analytical processing fact tables, 71, 75 (ROLAP). See ROLAP (relational product dimension, 83–84 online analytical processing ) sales taxes, 73 reliability DQ key, 277 source system logic, 73 repeating columns, 505 source system mapping, 103 reports store dimension, 86–87 BI, 412–413 production environment, migrating DW to, creating with Excel, 359–362 487–489 creating with report wizard, 334–340 profitability band attribute (Amadeus), 64 data quality, 275, 298–300 project management, 53 deploying, 366–369 pull approach (updating), 16, 22 dimensional, 332 purchase orders, 62 execution, managing, 374–375 purchase pattern table (data mining), 418 filtering, 351 pushing data approach formatting cells, 341 for populating DW, 270–271 fundamentals, 13 updating with, 16, 22 grouping, 351–355 layout of, 340–342 ■Q from multidimensional data stores, QA (Quality Assurance) in DW, 46 362–366 Quarter parameter example, 346–348 OLAP tools vs. data warehouse, 333 querying OLTP vs. data warehouse, 333 data, 11 overview, 329–332 MDBs, 394–396 parameters. See parameters, report Query Builder, 244, 337 report columns (data definition table), 306 Query Execution Plan, 168 Report Manager, 366–367 recursive queries, defined, 308 report server scale-out deployment, 118 Reporting Services SharePoint web parts, ■R 439 RAID (Redundant Array of Inexpensive search, 475 Disks) security, managing, 370–372 definition and configurations, 121 simplicity vs. complexity of, 356–357 RAID 5 volumes, 122 sorting, 351, 354 ranking algorithms, 474 spreadsheets, 357–362 RCD (rapidly changing dimension), 82 subscriptions, managing, 372–374 real-time data integration, 271 types of, 332–333 real-time data warehouse requests, change, 501 fundamentals, 27 requirements, determining user, 52 updates from key tables, 15 response data. See campaigns, recipient_type table, 322 delivery/response data (CRM) reconciliation, to monitor ETL processes, restoring MDBs, 405–408 493–495 retrieval of data, 4–5 recoverability retriever (search applications), 474 defined, 174 retrieving (CDI), defined, 468 ETL testing and, 479 revenue analysis, 465 recovery model, 125 revoking permissions, 454 recursive queries, 308 risk_level column, 322 , 75 ROLAP (relational online analytical refresh frequency, 313 processing ) reject action (DQ rules), 294–295 applications, 415 relational databases defined, 14, 380 analytics and, 415 roles defined, 30 defined, 63 extracting. See extracting relational security, 498 databases Ross, Margy, 82 rows, storing historical data as, 80 520 ■INDEX

rules sequential methodology. See waterfall data quality, 291–293 methodology DQ, adjusting, 496–497 server+CAL licenses (SQL Server), 119 normalization, 109, 505–507 servers, sizing database, 116–118 rule-based logic, 278 service-oriented architecture (SOA). See SOA rule category table, 322 (service-oriented architecture) rule risk level table, 322 share nothing architecture, 175 rule (SQL Server keyword), 322 SharePoint Server, Microsoft Office, 438–439 rule type table, 322 Simon, Alan, 17 RUP (Rational Unified Process) methodology, single customer view, 18, 442–447 56 single DDS architecture example, 33–35 single login requirement, 70 ■S sizing database servers, 116–118 sales taxes, 73 SK (), defined, 223 SAN (storage area network), 115 slicing, defined (analytics), 413 scale-out deployment, 115 slowly changing dimension (SCD). See SCD scanned documents, text analytics and, 473 (slowly changing dimension) SCD (slowly changing dimension) smalldatetime, 131 DDS dimension tables and, 251–265 SMP (symmetric multiprocessing) database defined, 11 system, 43 fundamentals, 80–82 snapshots Slowly Changing Dimension Wizard defined, 11 (SSIS), 228–230 report output, 374 schemas, database snowflake schemas design for campaign delivery/response basics, 7 data, 457–459 benefits of, 89 managing changes to, 501–502 SOA (service-oriented architecture), 26–27 snowflake, 7, 89 soft deletes (records), 184 updating, 501 sorting reports, 351, 354 scoring routines (search applications), 474 source data scripts, metadata, 326 connecting to, 179–180 scrubbing, data, 277 profiles, 317–318 SCV (single customer view), 442–447 source system metadata SDLC (system development life cycle). See overview, 302 system development methodology populating, 317 searching purposes of, 313 fundamentals, 25–26 source data profiles, 317–318 search facilities, 474 table components of, 314–317 search interface, 475 source systems second normal form (2NF), 505 analysis. See data feasibility studies security functional testing and, 481–482 groups, defined, 498 logic, replicating, 72 management by DWA, 498–499 mapping, 102–106 of MDBs, 397–399 moving data out of, 176 report, managing, 370–372 pushing data from, 270–271 testing, defined, 477 querying, 12 testing, fundamentals, 485-486 spam verdict (e-mail), defined, 98 segmentation specific performance requirements, 483 algorithm, 422 specific permissions (CRM), 450 campaign (CRM), 447–450 specific store view, 160 selection queries (campaigns), 448 spiral methodology. See iterative self-authentication vs. cookies, 464 methodology semiadditive aggregate functions, 119 spreadsheets (reports), 357–362 semistructured files, defined, 178 SQL (Structured ) Send Mail tasks (SSIS), 493 Native Client driver, 412 queries, exploring data with, 357 ■INDEX 521

query formatting, 352 store dimension statements, upsert using, 235–236 creating, 135 SQL Server designing (Amadeus), 86–87 Analysis Services 2005, 356 structured data, defined, 470 Configuration Manager, 367 structured files, extracting, 177 databases, design of. See physical subscribers database design subscriber class attribute (Amadeus), 64 Enterprise Edition, 118–119 subscriber profitability, analyzing licensing, 119 (Amadeus), 64 Management Studio, 232, 363 subscriptions Notification Services, 438 Communication Subscriptions Fact Table object catalog views, 311–313, 326 (example), 452 Profiler, performance testing and, 484 managing report, 372–374 Reporting Services. See SSRS (SQL Server membership, 452 Reporting Services) permissions (CRM), 451 system views, naming, 322 sales, analyzing (Amadeus), 63 SSAS (SQL Server Analysis Services) sales data mart (Amadeus), 89–94 data mining in, 20 sales fact table (partitioning), 163–166 KPIs and, 434–437 Subscription Management (Notification as OLAP tool, 380 Services), 438 SSIS (SQL Server Integration Services) Subscription Processing (Notification data extraction with, 191–200 Services), 438 failover clusters and, 115 Subscription Sales fact table logging, 484 (partitioning), 162 packages, simulating incremental load summary tables with, 70 application performance and, 484 populating dimension table with, 251–265 fundamentals, 161 populating NDS with, 228–235 supplier performance practical tips, 249–250 analyzing (Amadeus), 64 scheduling cube processing with, 399–404 data mart (Amadeus), 94–95 Send Mail tasks in, 493 SupplyNet system, 44 SSRS (SQL Server Reporting Services) support, types of user, 53 building reports with, 329–330 surrogate keys, defined, 37 charts and tables with, 412 survivorship rules (metadata storage), 22 DQ reports and, 299 symmetric multiprocessing (SMP) database NLB servers and, 114 system. See SMP (symmetric report security and, 370–371 multiprocessing) database system scheduling package (example), 403–404 sys.dm_db_index_physical_stats dynamic stage data store management function, 500 defined, 30 system architecture design, 42–44 fundamentals, 33 system development methodology stage loading (populating DW), 215–217 defined, 49 star schemas, 7, 89 iterative methodology, 54–59 statistical analysis, 13 waterfall methodology, 49–53 status columns, 320, 322 systematic comparisons (ETL monitoring), status of objects, defined, 62 495 status table (ETL process metadata), system_user SQL variable, 325 318–320 steps, ETL, defined, 31 ■T storage of data table grain, defined, 72 calculating database requirements, 120, table partitioning 123 defined, 10 customer data, 468 maintenance of, 500 estimating, 69 tables unstructured data, 470 column types in DW, 306 data definition metadata, 303 522 ■INDEX

data mapping, 307 travel industry data quality audit. See audits, DQ auditing customer analysis and, 461 data quality metadata, 321–322 customer support in, 463 data structure metadata, 309–311, 314–317 treatment, defined (campaigns), 448 DDL of data definition, 305 Trend expression (MDX), 435 DDL of data mapping, 306 triggers ETL process metadata, 318–320 database, 176 loading DDS fact, 215, 250, 266–269 detecting updates and inserts with, 184 log, data quality. See logging update, 184 naming, 137 normalization rules and, 505–507 ■U populating DDS dimension, 215, Unicode, 131 250–266 unknown records, defined, 233 source system metadata, 314–317 unsegmented campaigns, 448 structure of stage, 216 unstructured data updating related, 186 defined, 470 usage log (usage metadata), 325 fundamentals, 24–25 whole table every time extraction method, metadata and, 473 180–181 search facilities and, 475 tabular data, defined, 178 storing, 470 tabular report (example), 330 text analytics and, 471–473 telecommunications industry unstructured files, extracting, 178 customer analysis and, 460 update triggers, 184 customer support in, 463 updating testing applications, 503 data leaks, 187 batch data, 15–16 database restore, 500 customer data store, 468 end-to-end testing, 487 data warehouse schemas, 501–502 ETL testing, 478–479 ETL process metadata, 327 functional testing, 480 periodic data, 6 performance testing, 482–484 upsert security testing, 485–486 using Lookup transformation, 236–242 types of, 477–478 using SQL statements, 235–236 user acceptance testing (UAT), 477, usage metadata 486–487 maintaining, 327 waterfall methodology and, 52 overview, 302 text analytics purposes of, 324–325 for recruitment industry, 471 usage log table, 325 transforming documents with, 471–473 usage reports, 332 text data type (data mining), 419, 420 user acceptance testing (UAT) third normal form (3NF), 506 defined, 477 time fundamentals, 486–487 consolidating data with different ranges, user-facing data store, defined, 30 5 users excluding in MDM systems, 21 authentication of, 498 timestamps authorizing access of, 498 memorizing last extraction, 200–207 interface, search facility and, 475 reliable, 182 training, 489 transactions utilities industry database transaction log files, 189 customer analysis and, 460 Transact SQL script, 326 customer support in, 463 transactional systems, 5, 12 transaction fact table, defined, 90 ■V transaction tables, defined, 36, 106 validations, types of, 291 transition iteration, 56 VAT (value-added tax), 73 trap hit rate (email), 98 vertical/horizontal partitioning, 162 ■INDEX 523 views (database object) web services, extracting, 190 conform dimensions, creating, 158 WebTower9 system, 44 data mart view, 158–159 whole table every time extraction method, defined, 157 180–181 increasing availability with, 160–161 Windows 2003 R2 Datacenter Edition, 118 last month view, 159 Windows 2003 R2 Enterprise Edition (EE), purposes of, 157 118 specific store view, 160 virtual layers, creating, 158 ■X virtual layers, creating (views), 158 XML files as source data, 190 volume, disk, 121 XMLA (XML for Analysis ) accessing MDBs with, 412 ■W connecting to MDBMS with, 379 waste management, customer analysis and, processing mining models with, 417 461 scripts, backing up MDBs with, 406–408 waterfall methodology, 49–53 web analytics, 15 ■Y web logs, 189 Year parameter example, 345–346 web parts (SharePoint), defined, 438