Managing Quality, Transformations, and Loading Using an XML-Driven Engine

Bill SAVAGE, Ying QIN, Michal ZMUDA, David HALL, Laxminarayana GANAPATHI, Sheping LI RTI International, 3040 Cornwallis Road, Research Triangle Park, NC 27709 {bsavage, yingqin, mzmuda, dhall, lganapathi, shepli}@rti.org and Atif HASAN, Thomas BAKER Alpha-Gamma Technologies, Inc., 4700 Falls of Neuse Road, Suite 350, Raleigh, NC 27609 {tbaker, ahasan}@alpha-gamma.com

ABSTRACT support the Breast and Colon Cancer Family Registries Computer systems commonly require data to be loaded program. A core component of the Support Center is the into a from external sources. Incoming data data integration process that loads monthly data must be examined to ensure they meet acceptable quality submissions from multiple cancer research centers in levels, are complete, and conform to required formats. Australia, Canada, and the United States. Submissions Data must often be integrated from multiple sources, contain study data defined by multiple data dictionaries increasing the complexity of the process. that contain listings of all data elements contained within and loading processes have been traditionally referred to each data file, including name, description, length, data as Extract-Transform-Load (ETL); alternately known as type, and often complex validation rules. The data are data integration. In this paper we present a new data transmitted in multiple file formats such as fixed fields, integration system developed to assemble, validate, and comma-separated values (CSV), and tab delimited. Every load data, with support for complex and diverse submitted data file must be fully documented, validated, validation requirements. This reusable, customizable tracked, and then loaded into an Oracle 10g database to system offers improvements over many systems in support further analysis and research. several areas. The system is based on Extensible Markup We discovered that the submitted data records, the Language (XML) constructs that define: data structures validation requirements, and even the file structures are in use by multiple submitting sources (data dictionaries); subject to frequent and sometimes unpredictable change. complex validation rules; mappings to target objects; and The dynamic nature of the situation informed the RTI target object structures. Java program components team’s decision to implement a data integration solution implement a self-contained data integration engine that based on easily configurable data dictionary can be a standalone system or embedded within a web specifications, including validation rules, that drive the site. data integration process.

Keywords: Data Dictionary, Data Integration, Data The team determined that not only would machine- Validation, ETL, Java, XML. readable specifications support the complex needs of the process, but would also enable generation of user 1. INTRODUCTION documentation, or other forms for use by external Many computer systems depend on a process to load data systems. This dynamic, configurable approach is most into a database from external systems or sources. responsive to all needs of the project, including the Incoming data must be assembled, checked for evolution of study data and validation requirements completeness and conformance to required formatting, across and within the centers that submit the data. and finally checked to ensure they meet defined quality To meet the data integration requirements, the team had specifications. If the data must be integrated from to consider either developing a solution based on using multiple heterogeneous sources, processing complexity is an existing data integration package or developing a significantly increased. The data validation and loading custom solution using open source technologies where process is often referred to as Extract-Transform-Load possible. (ETL) or data integration. We have created a new data integration system that supports complex source 2. EVALUATION OF TECHNOLOGIES assemblies and validation requirements, using standard, The development team identified and investigated open source technologies. existing software that it could potentially use, and In 2005, RTI International (RTI) began developing the considered both open source and commercial packages. National Cancer Institute’s Informatics Support Center to The review process was informal, and involved several team members. The primary evaluation criteria included be deployed to different operating environments if requirements for: needed.

· a highly configurable system; 3. SYSTEM DESIGN & DEVELOPMENT · the ability to plug into a Java-based web portal; Throughout development, the team followed industry- · response time suitable for interactive sessions; standard procedures to manage development in an environment with changing requirements. · operation based on a data dictionary specified in a development utilized Computer Associates’ ERwin® standard format/language; Data Modeler product. The open source Concurrent · reading fixed length and delimited file formats; and Versions System (CVS) system was chosen as the source · supporting both standard and highly customized control system, and the open source Bugzilla system was types of data validation. used to manage and track issues. Software testing used Leading commercial systems such as IBM’s DataStage Selenium, JUnit, and DbUnit. [1] and Informatica’s PowerCenter [2] were considered. Frameworks for the data integration engine include J2SE Open source systems such as Enhydra Octopus [3], 1.5, Spring and Hibernate. Spring is an increasingly clover.ETL [4], and Xineo XIL [5] were also reviewed. important Java integration technology, often the choice We determined that currently available open source for stand-alone and web applications. It has many systems often lacked the ability to handle fixed-length architectural benefits, including effectively organizing file formats and sufficiently customized data validation. middle-tier objects, reducing the proliferation of Commercial systems were generally unable to plug into singletons, eliminating the need to use a variety of Java in the required manner. For some, the client custom properties for file formats, and using a consistent software was only available for the Windows operating XML-based approach. Hibernate is a high performance system, and the software was, in general, prohibitively object-relational persistence and query service. expensive. Data Integration Engine Processing Based on these and other considerations, we decided to Our design decouples the system’s base functionality develop a new data integration package with XML code from specific rules concerning data loading and utilized throughout for data dictionary specifications, validation constraints. We implemented this design using validation rules, and source-to-target mappings. XML XML files of our design. These files govern was selected based on its wide acceptance throughout the engine operation and provide flexibility to handle computing industry and its flexibility in supporting many changing requirements. The formats of data sources and different types of specifications. targets, data file parsing information, source-to-target The selection of XML as the runtime specification mappings, and the data validation and error reporting medium recognizes its suitability and wide acceptance for information are all defined in XML. use in such situations. As noted in [6], XML supports The engine first reads all available XML files into document structures that can be nested to any level of memory as a singleton object named XMLTemplate. As complexity. This makes it especially suitable for source files are read into memory, they are handled by representing complex validation rules, which are the appropriate parser according to the extension of the sometimes best represented as hierarchies of assertions. input file name (.csv, .txt or .dat). The parsed data is The system design placed heavy emphasis on use of stored in memory as a two-dimensional array. metadata to specify many aspects of process behavior. Methods of the XMLTemplate object are then called to This mirrors many successful systems that specify perform data conversion and data validation. Data are metadata, especially using XML, to document the then stored in an object and passed to the database structure and meaning of data and their inter- module. relationships, as discussed in [7], [8], [9]. The database module acquires both the data and the The system implementation relies on metadata constructs loading instructions stored in the XMLTemplate object, (XML-based) for expressing and and then uses this information to update existing records, integration processes in heterogeneous database insert new records, or mark records as no longer active in environments, as discussed in [10], [11], [12]. the database. Java was chosen as the implementation language because When records are inserted or updated, an MD5 checksum of its use within similar RTI projects, its ability to handle is generated from the new record values and stored with complex data loading and validation, and its suitability the record. The system uses the checksum values to for direct connectivity within the project web portal. determine whether a submitted record represents a Because Java is platform independent, the package can change from the existing record; this also supports generation of file submission statistics. Database triggers enable change tracking by inserting the prior record For this project, source data structures are flat files. The version into audit tables when data records are updated or metadata is a representation of the data dictionaries for deleted. these files, including validation definitions. A data structure in this model can also be a within a Figure 1 shows the high-level architecture of the system. , which is the case for the targets. Therefore, the DD_File entity in Figure 2 is used to store both the metadata for the project source data file structures and the metadata describing the target tables that receive the data through the data loading process.

Figure 1 – System Architecture

Metadata Usage The data integration engine is driven by metadata in XML form, but the master source of metadata is stored within the system database. Data dictionary metadata within the database is used to generate the instance of the XML template information for engine uses at runtime. Figure 2 – Entity-Relationship Diagram: Metadata The entire XML specification is loaded into memory, The data model includes a variety of information required thereby maximizing performance. to support the engine. It is used to generate the physical Tools were created to convert data dictionaries in both objects, including subject areas for: directions between XML and the database format. These · metadata that defines the master copy of the data tools can be used to generate the runtime XML from the dictionaries; database source, or to synchronize the database source with the XML if needed. · loading study data into staging and “clean” areas; · file submission tracking and error handling; and Web portal components have also been developed using XSL Transformations to generate human-readable data · full audit tracking of changed records over time. dictionaries identical to the data specifications that drive the data integration processes. XML Specifications Data Model A very basic sample of the XML is shown in Figure 3. The data model subject area shown in Figure 2 presents a This sample illustrates the structure of a simple data subset of the database structure. It holds the metadata that source (FILE1) with a single . Each data describes source and target data structures, validation, element includes several attributes, such as its position and source-to-target mappings. within the file, data type, description, and validation rules (omitted from this example for clarity). ÿmsgSeverity="" msgDesc="mustÿbeÿ0-10ÿorÿ99"> ÿ ÿtoInclusive="true" ÿmeaning="Numberÿofÿevents"/> ÿ Figure 4 - Allowable Values Validator Primaryÿkeyÿfield table that reference the record currently undergoing value in the checked individual record for consistency. The test fails if the number of retrieved records differs from the expected number specified by the NUM_CHILDREN value. ÿclass="validation.InterFileValidator" ÿmethod="pass"ÿmsgVol="1"ÿmsgCode="100" msgType="F" ÿmsgDesc="Childÿcountÿmatchesÿrecordÿcount."> Figure 3 - Basic XML Template ÿ Validator Objects ÿ ÿ language developed for this system. This language ÿ effectively unlimited depth and complexity. Several ÿ 1) Conversion checking ensures that data values in ÿ ÿ 3) Allowable values ensures that data fields contain only values allowed by the data dictionary. ÿ 4) Inter-field ensures that records follow rules ÿ specifying proper relationships across fields within a ÿ single record in the same file. ÿ specifying proper relationships across different data ÿ Figure 4 shows the definition of an allowable values ÿ validator. ÿ ÿ ÿ

Figure 5 - Inter-File Validator 4. SYSTEM BENEFITS 5. CONCLUSIONS Plug-in Capability The data integration system developed for the Informatics Support Center has been very successful in The data integration engine can be directly plugged into a meeting the initial requirements. Furthermore, the Java web application using J2EE. This allows end-users architecture and implementation have provided the to upload data files through the web application, with the desired flexibility to support additional or changed engine running transparently on the back-end for data file requirements as the project develops. parsing, validation and loading. Using XML as the specification language has proved to The data integration engine could also be packaged for be a fundamental enabling characteristic of the system. deployment as part of a stand-alone non-web application; As new modules have been developed, file structures this effort is underway as the basis for creating an open changed or added, or additional data validation source project. Compiled Java classes, configuration requirements identified, the resulting modifications to the files, XML files, and third-party libraries could be schema, XML, and data integration engine code have not packaged into a jar file for distribution. To run the incurred significant costs or required design changes. executable jar file, users would adjust the XML file according to their input data files, validation Due to the flexible and robust nature of the system, it is requirements, and target database structures. Invoking the now suitable for use by other projects that require similar main method class would initiate the data integration data integration solutions. To date, the system has been process. successfully deployed within one separate project at RTI, and other projects are evaluating use of it. Simple Design The system is currently being further developed for The system data dictionary uses an open XML format possible release as an open source project. Plans include that allows straightforward specification of data loading enhancing the system packaging and creating additional and validation requirements. In the XML file, we define system documentation for a public release. Additional data fields for an input source file and the database target proposed development steps include: for each field, including database table and names. In addition, the XML includes metadata · developing a user interface to maintain data information for each field, such as field name, dictionary specifications within the system database description, length, and data type. When a new table is (instead of manually editing the XML files); created or a modification is needed for an existing table, we can simply modify the centralized XML file to · completing the creation of testing components and specify the new requirements. an automated test environment (to support continuous integration); Robust Validation · adding a caching mechanism to support very large Data validation requirements can be very complex, but data sets (data from a source file are currently held in they generally are easily defined using the XML memory); and language developed for validation across data fields and · repackaging the system for release as an open source data sources. The engine interprets the validation in XML project. format and applies the validation using an efficient algorithm. There are no design limits to the complexity We encourage other researchers investigating data and depth of validation that can be written using this integration solutions to consider this new system. approach. Data validation can vary from simple (such as data type validation), to complex (such as validation 6. REFERENCES across multiple tables using different field values). [1] IBM WebSphere DataStage. http://www- 306.ibm.com/software/data/integration/datastage/ Flexible Reuse [2] Informatica PowerCenter. Flexible design and ease of configuration have allowed http://www.informatica.com/products/powercenter/ this software to be used in another project within RTI that required a similar data integration solution. To use [3] Enhydra Octopus. the engine in a new project, developers can configure it http://www.enhydra.org/tech/octopus/ by changing the database connection string and the [4] clover.ETL. http://www.cloveretl.org/ source data directory in a configuration file, replicating [5] Xineo XIL. http://www.xineo.net/software.jspx the data dictionary in an XML file, and then adding data validation specifications into the XML file as needed. [6] J. Bosak, "XML, Java, and the Future of the Web", Reusability of the engine makes the development life World Wide Web Journal, 2(4), 1997. cycle much more efficient for projects with similar needs. [7] M. Jarke, M. Jeusfeld, C. Quix, and P. Vassiliadis, “Architecture and quality in data warehouses: An extended repository approach”, Information [11] R. Fagin, P. Kolaitis, L. Popa, and W.-C. Tan, Systems, July 1999. “Composing schema mappings: Second-order [8] I. Varlamis and M. Vazirgiannis, “Bridging XML- dependencies to the rescue”, Proc. of PODS, 2004. schema and relational : a system for [12] R. Goertzen, J. Stausberg, “A grammar of integrity generating and manipulating relational databases constraints in medical documentation systems”, using valid XML documents”, Proc. of ACM Computer Methods & Programs in Biomedicine, Symposium on Document Engineering, Atlanta, Apr2007, Vol. 86 Issue 1, p93-102. USA, Nov 2001. [9] L. Seligman and A. Rosenthal, “The Impact of XML on Databases and Data Sharing”, IEEE Computer, June, 2001. [10] H. Fan and A. Poulovassilis, “Using AutoMed metadata in data warehousing environments”, Proc Int. Workshop on Data Warehousing and OLAP (DOLAP'03), New Orleans, November 2003.