Managing Data Quality, Transformations, and Loading Using an XML-Driven Engine
Total Page:16
File Type:pdf, Size:1020Kb
Managing Data Quality, Transformations, and Loading Using an XML-Driven Engine Bill SAVAGE, Ying QIN, Michal ZMUDA, David HALL, Laxminarayana GANAPATHI, Sheping LI RTI International, 3040 Cornwallis Road, Research Triangle Park, NC 27709 {bsavage, yingqin, mzmuda, dhall, lganapathi, shepli}@rti.org and Atif HASAN, Thomas BAKER Alpha-Gamma Technologies, Inc., 4700 Falls of Neuse Road, Suite 350, Raleigh, NC 27609 {tbaker, ahasan}@alpha-gamma.com ABSTRACT support the Breast and Colon Cancer Family Registries Computer systems commonly require data to be loaded program. A core component of the Support Center is the into a database from external sources. Incoming data data integration process that loads monthly data must be examined to ensure they meet acceptable quality submissions from multiple cancer research centers in levels, are complete, and conform to required formats. Australia, Canada, and the United States. Submissions Data must often be integrated from multiple sources, contain study data defined by multiple data dictionaries increasing the complexity of the process. Data validation that contain listings of all data elements contained within and loading processes have been traditionally referred to each data file, including name, description, length, data as Extract-Transform-Load (ETL); alternately known as type, and often complex validation rules. The data are data integration. In this paper we present a new data transmitted in multiple file formats such as fixed fields, integration system developed to assemble, validate, and comma-separated values (CSV), and tab delimited. Every load data, with support for complex and diverse submitted data file must be fully documented, validated, validation requirements. This reusable, customizable tracked, and then loaded into an Oracle 10g database to system offers improvements over many systems in support further analysis and research. several areas. The system is based on Extensible Markup We discovered that the submitted data records, the Language (XML) constructs that define: data structures validation requirements, and even the file structures are in use by multiple submitting sources (data dictionaries); subject to frequent and sometimes unpredictable change. complex validation rules; mappings to target objects; and The dynamic nature of the situation informed the RTI target object structures. Java program components team’s decision to implement a data integration solution implement a self-contained data integration engine that based on easily configurable data dictionary can be a standalone system or embedded within a web specifications, including validation rules, that drive the site. data integration process. Keywords: Data Dictionary, Data Integration, Data The team determined that not only would machine- Validation, ETL, Java, XML. readable specifications support the complex needs of the process, but would also enable generation of user 1. INTRODUCTION documentation, or other forms for use by external Many computer systems depend on a process to load data systems. This dynamic, configurable approach is most into a database from external systems or sources. responsive to all needs of the project, including the Incoming data must be assembled, checked for evolution of study data and validation requirements completeness and conformance to required formatting, across and within the centers that submit the data. and finally checked to ensure they meet defined quality To meet the data integration requirements, the team had specifications. If the data must be integrated from to consider either developing a solution based on using multiple heterogeneous sources, processing complexity is an existing data integration package or developing a significantly increased. The data validation and loading custom solution using open source technologies where process is often referred to as Extract-Transform-Load possible. (ETL) or data integration. We have created a new data integration system that supports complex source 2. EVALUATION OF TECHNOLOGIES assemblies and validation requirements, using standard, The development team identified and investigated open source technologies. existing software that it could potentially use, and In 2005, RTI International (RTI) began developing the considered both open source and commercial packages. National Cancer Institute’s Informatics Support Center to The review process was informal, and involved several team members. The primary evaluation criteria included be deployed to different operating environments if requirements for: needed. · a highly configurable system; 3. SYSTEM DESIGN & DEVELOPMENT · the ability to plug into a Java-based web portal; Throughout development, the team followed industry- · response time suitable for interactive sessions; standard procedures to manage development in an environment with changing requirements. Data model · operation based on a data dictionary specified in a development utilized Computer Associates’ ERwin® standard format/language; Data Modeler product. The open source Concurrent · reading fixed length and delimited file formats; and Versions System (CVS) system was chosen as the source · supporting both standard and highly customized control system, and the open source Bugzilla system was types of data validation. used to manage and track issues. Software testing used Leading commercial systems such as IBM’s DataStage Selenium, JUnit, and DbUnit. [1] and Informatica’s PowerCenter [2] were considered. Frameworks for the data integration engine include J2SE Open source systems such as Enhydra Octopus [3], 1.5, Spring and Hibernate. Spring is an increasingly clover.ETL [4], and Xineo XIL [5] were also reviewed. important Java integration technology, often the choice We determined that currently available open source for stand-alone and web applications. It has many systems often lacked the ability to handle fixed-length architectural benefits, including effectively organizing file formats and sufficiently customized data validation. middle-tier objects, reducing the proliferation of Commercial systems were generally unable to plug into singletons, eliminating the need to use a variety of Java in the required manner. For some, the client custom properties for file formats, and using a consistent software was only available for the Windows operating XML-based approach. Hibernate is a high performance system, and the software was, in general, prohibitively object-relational persistence and query service. expensive. Data Integration Engine Processing Based on these and other considerations, we decided to Our design decouples the system’s base functionality develop a new data integration package with XML code from specific rules concerning data loading and utilized throughout for data dictionary specifications, validation constraints. We implemented this design using validation rules, and source-to-target mappings. XML XML metadata files of our design. These files govern was selected based on its wide acceptance throughout the engine operation and provide flexibility to handle computing industry and its flexibility in supporting many changing requirements. The formats of data sources and different types of specifications. targets, data file parsing information, source-to-target The selection of XML as the runtime specification mappings, and the data validation and error reporting medium recognizes its suitability and wide acceptance for information are all defined in XML. use in such situations. As noted in [6], XML supports The engine first reads all available XML files into document structures that can be nested to any level of memory as a singleton object named XMLTemplate. As complexity. This makes it especially suitable for source files are read into memory, they are handled by representing complex validation rules, which are the appropriate parser according to the extension of the sometimes best represented as hierarchies of assertions. input file name (.csv, .txt or .dat). The parsed data is The system design placed heavy emphasis on use of stored in memory as a two-dimensional array. metadata to specify many aspects of process behavior. Methods of the XMLTemplate object are then called to This mirrors many successful systems that specify perform data conversion and data validation. Data are metadata, especially using XML, to document the then stored in an object and passed to the database structure and meaning of data and their inter- module. relationships, as discussed in [7], [8], [9]. The database module acquires both the data and the The system implementation relies on metadata constructs loading instructions stored in the XMLTemplate object, (XML-based) for expressing data transformation and and then uses this information to update existing records, integration processes in heterogeneous database insert new records, or mark records as no longer active in environments, as discussed in [10], [11], [12]. the database. Java was chosen as the implementation language because When records are inserted or updated, an MD5 checksum of its use within similar RTI projects, its ability to handle is generated from the new record values and stored with complex data loading and validation, and its suitability the record. The system uses the checksum values to for direct connectivity within the project web portal. determine whether a submitted record represents a Because Java is platform independent, the package can change from the existing record; this also supports generation of file submission statistics.