<<

• Cognizant 20-20 Insights

From Relational Management to Big : Solutions for Data Migration Testing A successful approach to big data migration testing requires end-to-end automation and swift verification of huge volumes of data to produce quick and lasting results.

Executive Summary ing of customers. Such insight will reveal what customers are buying, doing, saying, thinking and Large enterprises face numerous challenges feeling, as well as what they need. connecting multiple CRM applications and their systems to connect with end But this requires capturing and analyzing users across the multitude of products they huge pools of interactional and transactional offer. When their disparate data is spread across data. Capturing such large data sets, however, multiple systems, these enterprises cannot: has created a double-edged sword for many companies. On the plus side, it affords companies Conduct sophisticated analytics that substan- • the opportunity to make meaning from Code Halo tially improve business decision-making. intersections; the downside is figuring out how • Offer better search and data sharing. and where to store all this data. Gain a holistic view of a single individual • Enter Hadoop, the de facto open source across multiple identities; customers may have standard that is increasingly being used by many multiple accounts due to multiple locations or companies in large data migration projects. devices such as company or Facebook IDs. Hadoop is an open-source framework that allows • Unlock the power of to create for the distributed processing of large data sets. reports using tools of their choice. It is designed to scale up from single servers to In such situations, companies lose the ability thousands of machines, each offering local com- to understand customers. Overcoming these putation and storage. As data from different obstacles is critical to gaining the insights needed sources flows into Hadoop, the biggest challenge to customize user experience and personalize is “ from source to Hadoop.” interactions. By applying Code HaloTM1 thinking – In fact, according to a report published by IDG and distilling insights from the swirl of data that Enterprise, “70% of enterprises have either surrounds people, processes, organizations and deployed or are planning to deploy big data 2 devices – companies of all shapes and sizes and projects and programs this year.” across all sectors can gain a deep understand-

cognizant 20-20 insights | september 2015 With the huge amount of data migrated to Amazon Redshift, a fast, fully managed, pet- Hadoop and other big data platforms, the abyte-scale data warehouse service. challenge of emerges. The simple, widely used, cumbersome solution is manual The migration to the AWS Hadoop environment is validation. However, this is not scalable and may a three-step process: not offer any significant value-add to customers. Cloud service: Virtual machines/physical It impacts project schedules. Moreover, testing • machines are used to connect and extract the cycle times can get squeezed. tables from source using Sqoop, This white paper posits a solution: a framework which pushes them to Amazon S3. that can be adopted across industries to perform • Cloud storage: Amazon S3 cloud storage effective big data migration testing with all open- center is used for all the data that is being sent source tools. by virtual machines. It stores data in flat . Challenges in RDBMS to Big Data Migration Testing • Data processing: Amazon EMR processes and distributes vast amounts of data using Hadoop. Big data migration typically involves multiple The data is grabbed from S3 and stored as Hive source systems and large volumes of data. tables (see Glossary, page 7). However, most organizations lack the open- source tools to handle this important task. RDBMS to Big Data Migration The right tool should be set up quickly and Testing Solution offer multiple customization options. Migration Step 1: Define Scenarios generally happens in entity batches. A set To test migrated data, performing one-to-one of entities is selected, migrated and tested. comparison of all the entities is required. Since This cycle goes on until all application data big data volumes are (as the term suggests) huge, is migrated. three test scenarios are performed for each entity: Migration generally happens in • Count reconciliation for all rows. entity batches. A set of entities • Find missing primary keys for all rows. is selected, migrated and tested. • Compare field-to-field data for sample records. This cycle goes on until all These steps are required to, first, verify the application data is migrated. record count in the source DB and target DB and, second, to ensure that all records from source systems flow to the target DB, which is performed An easily scalable solution can reduce the con- by checking the primary key in the source secutive testing cycles. Even minimal human system and the target system for all records. This intervention can hinder testing efforts. Another confirms that all records are present in the target challenge comes when defining effective DB. Third, and most important, is comparing the scenarios for each entity. Performing 100% source and target databases for all columns for field-to-field validation of data is ideal, but when sample records. This ensures that the data is the data volume is in petabytes, test execution not corrupted, date formats are maintained and duration increases tremendously. A proper data is not truncated. The number of records sampling method should be adopted, and solid for sample testing can be decided according to rules should be considered the data volume. A basic can be in testing. identified by testing 100 sample records.

Big Data Migration Process Step 2: Choose the Appropriate Method Hadoop as a service is offered by Amazon Web of Testing Services (AWS), a cloud computing solution that Per our analysis, we shortlisted two methods of abstracts the operational challenges of running testing: Hadoop and making medium- and large-scale data processing accessible, easy, fast and inex- • UNIX shell script and T-SQL-based reconcilia- pensive. The typical services available include tion. Amazon S3 (Simple Storage Service) and Amazon • PIG scripting. EMR (Elastic MapReduce). Also preferred is

cognizant 20-20 insights 2 Testing Approach Comparison

Unix Shell Script and T-SQL-Based PIG Scripting Reconciliation Prerequisites Load target Hadoop data into the Migrate data from RDBMS to HDFS central QA server (SQL server) as and compare QA HDFS files with Dev different entities and validate with HDFS files using Pig scripting. source tables. Flat files for each entity created SQL server database to store tables using Sqoop tool. and perform comparison using SQL queries. Preconfigured linked server in SQL server DB is needed to connect to all your source databases. Efforts Initial coding for five to 10 tables Compares flat files. takes one week. Scripting needed for each column in Consecutive additions take two days the table. for ~10 tables. Efforts are equally proportionate to the number of tables and their columns. Automation/ Full automation possible. No automation possible. Manual Performance (On Delivers the results quickly compared This method needs migration Windows XP, 3 GB to other methods. of source table to HDFS files as RAM, 1 CPU) a prerequisite, which is time- For 15 tables with an average 100K, consuming. records will take: Processing can be faster than other ~30 minutes for count. methods. ~20 minutes for sample 100 records. ~1 hour for missing primary keys. Highlights Full automation possible/job Offers a lot of flexibility in coding. scheduling possible. Very useful in more complex Fast comparison. transformations. No permission or security issues faced while accessing big data on AWS. Low Points Initial framework setup is time- Greater effort for decoding, consuming. reporting results and handling script errors. Figure 1

Another option is to use Microsoft Hive ODBC end-to-end automation is possible. If any transfor- Driver to access Hive data, but this approach is mations are present, those need to be performed more appropriate for smaller volumes. in the staging layer – which can be treated as source, to further implement similar solutions. Figure 1 shows a comparison of the two methods. According to the above analysis, PIG scripting is more appropriate for testing migration with Hence, based on this comparison, we recommend complex transformation logic. But for this type a focus on the first approach, where full

cognizant 20-20 insights 3 High-Level Validation Approach Source Systems Jenkins Slave Machine • Stored procedure to compute the count of each table from Files from Hive source system. Results from Hive are compared with this result. SQL batch Files from UNIX • Stored procedure to pull ROW_ID from all tables of source files used to server are and find out missing/extra ones in Hive results. load file contents downloaded to • Stored procedure to pull source column data of the sample to QA tables. Windows server. records pulled from Hive and compare results. Report any data mismatch. Windows batch Oracle SQL script DB Server Server DB Linked server to pull data from various DBs. WinSCP MySQL Any Download Get Data shell Shell script to DB Other Commands RDBMS script to get generate count, ROW_IDs Get Data and sample shell script data from dynamically. Source to target data flow: QA DB Hive tables. Data from source systems is migrated Server to HDFS using Sqoop – ETL. (SQL server) CSV File with Hive table names.

LOAD DATA INPATH Hadoop ‘hdfs_file’ INTO TABLE Distributed tablename File System • CSV file with count of records and table name for each table in Hive. • CSV file with ROW_ID from all tables available in Hive. • CSV file with first 100 records of all columns from Hive tables. HIVE

AWS HADOOP

Figure 2 of simple migration, the PIG scripting approach is >>Store the table list in the CSV file on a UNIX very time-consuming and resource-intensive. server.

Step 3: Build a Framework >>Write a UNIX shell script with input as a ta- ble list CSV file and generate another shell Here we bring in data from Hadoop to a SQL script to extract the hive data into the CSV server’s consolidation database and validate it files for each table. with the source. Figure 2 illustrates the set of methods we recommend. »»This shell script will be executed from the Windows batch file. UNIX shell scripting: In the migration process, • Dynamically generate a UNIX shell script to the development team uses the Sqoop tool to >> ensure there is a need to update only the migrate RDBMS tables as HDFS files. LOAD table list CSV file of every iteration/release DATA INPATH command creates the table for the new table additions. definition in the Hive metastore. HDSF files are stored in Amazon S3. • WinSCP: The next step is to transfer the files in the Hadoop environment to the Windows To fetch data from Hive to a flat file: server. WinSCP batch command interface can

Sample Code from Final Shell Script

Figure 3

cognizant 20-20 insights 4 Results of Count Reconciliation for Hive Tables Migrated from a Webpage

SIEBEL HIVE COUNT CLUSTER RECON SUMMARY AUD_ID EXECUTION_DATE SCHEMA_NAME TARGET_DB TOTAL_PASS TOTAL_FAIL ENV 153 2015-04-14 21:55:27.787 SIEBELPRD HIVE 60 94 PRD

SIEBEL HIVE COUNT CLUSTER RECON DETAIL AUD_ID AUD_SEQ SOURCE_TAB_ SOURCE_ROW_CNT HIVE_TAB_ HIVE_ROW_CNT DIFF PERCENTAGE_DIFF STATUS EXEC_DATE NAME NAME 153 1 S_ADDR_PER 353420 S_ADDR_PER 343944 9476 2.68 FAIL 2015-04-14 21:55:27.787 153 2 S_PARTY 2730468 S_PARTY 2730468 0 0 PASS 2015-04-14 21:55:27.787 153 3 S_ORG_GROUP 16852 S_ORG_GROUP 16852 0 0 PASS 2015-04-14 21:55:27.787 153 4 S_LST_OF_VAL 29624 S_LST_OF_VAL 29624 0 0 PASS 2015-04-14 21:55:27.787 153 5 S_GROUP_ 413912 S_GROUP_ 413912 0 0 PASS 2015-04-14 CONTACT CONTACT 21:55:27.787 153 6 S_CONTACT 1257758 S_CONTACT 1257758 0 0 PASS 2015-04-14 21:55:27.787 153 7 S_CON_ADDR 6220 S_CON_ADDR 6220 0 0 PASS 2015-04-14 21:55:27.787 153 8 S_CIF_CON_MAP 28925 S_CIF_CON_ 28925 0 0 PASS 2015-04-14 MAP 21:55:27.787 153 9 S_ADDR_PER 93857 S_ADDR_PER 93857 0 0 PASS 2015-04-14 21:55:27.787 153 10 S_PROD_LN 1114 S_PROD_LN 1106 8 0.72 FAIL 2015-04-14 21:55:27.787 153 11 S_ASSET_REL 696178 S_ASSET_REL 690958 5220 0.75 FAIL 2015-04-14 21:55:27.787 153 12 S_AGREE_ITM_ 925139 S_AGREE_ITM_ 917657 7482 0.81 FAIL 2015-04-14 REL REL 21:55:27.787 153 13 S_REVN 131111 S_REVN 128949 2162 1.65 FAIL 2015-04-14 21:55:27.787 153 14 S_ENTLMNT 127511 S_ENTLMNT 125144 2367 1.86 FAIL 2015-04-14 21:55:27.787 153 15 S_ASSET_XA 5577029 S_ASSET_XA 5457724 119305 2.14 FAIL 2015-04-14 21:55:27.787 153 16 S_BU 481 S_BU 470 11 2.29 FAIL 2015-04-14 21:55:27.787 153 17 S_ORG_EXT 345276 S_ORG_EXT 336064 9212 2.67 FAIL 2015-04-14 21:55:27.787 153 18 S_ORG_BU 345670 S_ORG_BU 336424 9246 2.67 FAIL 2015-04-14 21:55:27.787 Figure 4

be implemented for this. The WinSCP batch file. The batch file executes the shell script file (.SCP) connects to the Hadoop environ- from the Windows server on the Hadoop envi- ment using an open sftp command. A simple ronment using the Plink command. In this way, “GET” command with the file name can copy all Hive data is loaded into the SQL server the file to the Windows server. table. The next step is to execute the SQL • SQL server database usage: The SQL server server procedure to perform count/primary is the main database used for loading Hive key/sample data comparison. We use SQLCMD data and final reconciliation results. A SQL to execute the SQL server procedure from the script is created to load data from the .CSV batch file. file to the database table. The script uses the • Jenkins: End-to-end validation processes “Bulk Insert” command. can be triggered by Jenkins. Jenkins jobs can • Windows batch command: The above-men- be scheduled to execute on an hourly/daily/ tioned process of transferring the data in .CSV weekly basis without manual intervention. On files, importing the files into the SQL server and Jenkins, an auto-scheduled ANT script invokes validating the source and target data should all a Java program to connect to the SQL server be done sequentially. All validation processes to generate the HTML report of the latest can be automated by creating a Windows batch records. Jenkins jobs can e-mail the results to the predefined recipients in the HTML format.

cognizant 20-20 insights 5 Overcoming Test Migration Challenges SNO Implementation Issues Resolutions 1 Column data type mismatch errors while Create tables in SQL server by matching Hive table loading .CSV files and Hive data into the SQL data types. server table. 2 No FTP access in Hadoop database to transfer Use WinSCP software. files. 3 Column Sequence Mismatch between Hive Create tables in SQL server for target entities by tables and Source tables, which results in matching Hive table column order. failure to load the .CSV files into Hive_* tables. 4 Inability to load .CSV files, due to end of file Update SQL statement with appropriate row issue in SQL server bulk insert. terminator “char (10)” linefeed, which allows import of .CSV files from a Unix server. 5 Performance issues on primary key validations. Performance tuning on SQL server stored procedures and increasing more temp DB space of SQL server, etc. 6 Handling comma in the column values. Create TSV file, so it will not create any issue while data is loading. Remove NULL, null from TSV and generate .txt file. Finally convert into UTF-8 to UTF-16 and generate XLS file, which can be loaded to SQL server database. Figure 5

The Advantages of a Big Data Testing Migration Framework*

Scenario Manual (mins.) Framework Gain (mins.) % Gain (mins.) Count 20 2 18 90.00% Sample 100 Records 100 1.3 98.7 98.70% Missing Primary Key 40 4 36 90.00%

* Effort calculated for one table with around 500k records with summary report generation Figure 6

Implementation Issues and Resolutions before implementing a framework like the one presented in this white paper. Organizations may face a number of implementa- tion issues. Figure 5 provides probable resolutions. • Think big when it comes to big data testing. Impact of Testing Choose an optimum data subset for testing; sampling should be based on geographies, Figure 6 summarizes the impact of the manual priority customers, customer types, product Excel testing when using our framework for one types and product mix. of the customer’s CRM applications based on Oracle and SQL server databases. • Create an environment to accommodate huge data sets. Cloud setups are recommended. Looking Forward • Be aware of the Agile/Scrum cadence mismatch. More and more organizations are using big data Break up data into smaller incremental blocks tools and techniques to quickly and effectively as a work-around. analyze data for improved customer understand- • Get smart about open-source capabilities. ing and product/service delivery. This white paper Spend a good amount of time up front under- presents a framework to help organizations to standing the tools and techniques that drive more quickly, efficiently and accurately conduct success. big data migration testing. As your organization moves forward, here are key points to consider

cognizant 20-20 insights 6 Glossary • AWS: Amazon Web Services is a collection of remote computing services, also called Web services, that make up a cloud computing platform from Amazon.com. • Amazon EMR: Amazon Elastic MapReduce is a Web service that makes it easy to quickly and cost- effectively process vast amounts of data. • Amazon S3: Amazon Simple Storage Service provides developers and IT teams with secure, durable, highly-scalable object storage. • Hadoop: Hadoop is an open-source software framework for storing and processing big data in a dis- tributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive and faster processing. • Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summariza- tion, query and analysis. Amazon maintains a software fork of Apache Hive that is included in Amazon EMR on AWS. • Jenkins: Jenkins is an open-source, continuous-integration tool written in Java. Jenkins provides continuous integration services for software development. • PIG scripting: PIG is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. • RDBMS: A relational database management system is a database management system (DBMS) that is based on the relational model as invented by E.F. Codd, of IBM’s San Jose Research Laboratory. • Sqoop: Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop and then load the data into HDFS. This process is briefly called extract, transform and load (ETL). • WinSCP: Windows Secure Copy is a free and open-source SFTP, SCP and FTP client for Microsoft Windows. Its main function is to secure file transfer between a local and a remote computer. Beyond this, WinSCP offers basic file manager and file synchronization functionality. • Unix shell scripting: A shell script is a computer program designed to be run by the Unix shell, a command line interpreter. • T-SQL: Transact-SQL is Microsoft’s and Sybase’s proprietary extension to Structured Query Language (SQL).

cognizant 20-20 insights 7 Footnotes 1 For more on Code Halos and innovation, read “Code Rules: A Playbook for Managing at the Crossroads,” Cognizant Technology Solutions, June 2013, http://www.cognizant.com/Futureofwork/Documents/ code-rules.pdf, and the book, Code Halos: How the Digital Lives of People, Things, and Organizations Are Changing the Rules of Business, by Malcolm Frank, Paul Roehrig and Ben Pring, published by John Wiley & Sons, April 2014, http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118862074.html.

2 2014 IDG Enterprise Big Data Report, http://www.idgenterprise.com/report/big-data-2.

About the Author Rashmi Khanolkar is a Senior Architect within Cognizant’s Comms-Tech Business Unit. Proficient in appli- cation architecture, data architecture and technical design, Rashmi has 15-plus years of experience in the software industry. She has managed multiple data migration quality projects involving large volumes of data. Rashmi also has extensive experience on multiple development projects on .Net and Moss 2007, and has broad knowledge within the CRM, insurance and banking domains. She can be reached at [email protected].

About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process outsourcing services, dedicated to helping the world’s leading companies build stronger busi- nesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfac- tion, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 100 development and delivery centers worldwide and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.

World Headquarters European Headquarters India Operations Headquarters 500 Frank W. Burr Blvd. 1 Kingdom Street #5/535, Old Mahabalipuram Road Teaneck, NJ 07666 USA Paddington Central Okkiyam Pettai, Thoraipakkam Phone: +1 201 801 0233 London W2 6BD Chennai, 600 096 India Fax: +1 201 801 0243 Phone: +44 (0) 20 7297 7600 Phone: +91 (0) 44 4209 6000 Toll Free: +1 888 937 3277 Fax: +44 (0) 20 7121 0102 Fax: +91 (0) 44 4209 6060 Email: [email protected] Email: [email protected] Email: [email protected]

­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. Codex 1439