From Relational Database Management to Big Data

• Cognizant 20-20 Insights From Relational Database Management to Big Data: Solutions for Data Migration Testing A successful approach to big data migration testing requires end-to-end automation and swift verification of huge volumes of data to produce quick and lasting results. Executive Summary ing of customers. Such insight will reveal what customers are buying, doing, saying, thinking and Large enterprises face numerous challenges feeling, as well as what they need. connecting multiple CRM applications and their data warehouse systems to connect with end But this requires capturing and analyzing users across the multitude of products they huge pools of interactional and transactional offer. When their disparate data is spread across data. Capturing such large data sets, however, multiple systems, these enterprises cannot: has created a double-edged sword for many companies. On the plus side, it affords companies Conduct sophisticated analytics that substan- • the opportunity to make meaning from Code Halo tially improve business decision-making. intersections; the downside is figuring out how • Offer better search and data sharing. and where to store all this data. Gain a holistic view of a single individual • Enter Hadoop, the de facto open source across multiple identities; customers may have standard that is increasingly being used by many multiple accounts due to multiple locations or companies in large data migration projects. devices such as company or Facebook IDs. Hadoop is an open-source framework that allows • Unlock the power of data science to create for the distributed processing of large data sets. reports using tools of their choice. It is designed to scale up from single servers to In such situations, companies lose the ability thousands of machines, each offering local com- to understand customers. Overcoming these putation and storage. As data from different obstacles is critical to gaining the insights needed sources flows into Hadoop, the biggest challenge to customize user experience and personalize is “data validation from source to Hadoop.” interactions. By applying Code HaloTM1 thinking – In fact, according to a report published by IDG and distilling insights from the swirl of data that Enterprise, “70% of enterprises have either surrounds people, processes, organizations and deployed or are planning to deploy big data 2 devices – companies of all shapes and sizes and projects and programs this year.” across all sectors can gain a deep understand- cognizant 20-20 insights | september 2015 With the huge amount of data migrated to Amazon Redshift, a fast, fully managed, pet- Hadoop and other big data platforms, the abyte-scale data warehouse service. challenge of data quality emerges. The simple, widely used, cumbersome solution is manual The migration to the AWS Hadoop environment is validation. However, this is not scalable and may a three-step process: not offer any significant value-add to customers. Cloud service: Virtual machines/physical It impacts project schedules. Moreover, testing • machines are used to connect and extract the cycle times can get squeezed. tables from source databases using Sqoop, This white paper posits a solution: a framework which pushes them to Amazon S3. that can be adopted across industries to perform • Cloud storage: Amazon S3 cloud storage effective big data migration testing with all open- center is used for all the data that is being sent source tools. by virtual machines. It stores data in flat file format. Challenges in RDBMS to Big Data Migration Testing • Data processing: Amazon EMR processes and distributes vast amounts of data using Hadoop. Big data migration typically involves multiple The data is grabbed from S3 and stored as Hive source systems and large volumes of data. tables (see Glossary, page 7). However, most organizations lack the open- source tools to handle this important task. RDBMS to Big Data Migration The right tool should be set up quickly and Testing Solution offer multiple customization options. Migration Step 1: Define Scenarios generally happens in entity batches. A set To test migrated data, performing one-to-one of entities is selected, migrated and tested. comparison of all the entities is required. Since This cycle goes on until all application data big data volumes are (as the term suggests) huge, is migrated. three test scenarios are performed for each entity: Migration generally happens in • Count reconciliation for all rows. entity batches. A set of entities • Find missing primary keys for all rows. is selected, migrated and tested. • Compare field-to-field data for sample records. This cycle goes on until all These steps are required to, first, verify the application data is migrated. record count in the source DB and target DB and, second, to ensure that all records from source systems flow to the target DB, which is performed An easily scalable solution can reduce the con- by checking the primary key in the source secutive testing cycles. Even minimal human system and the target system for all records. This intervention can hinder testing efforts. Another confirms that all records are present in the target challenge comes when defining effective DB. Third, and most important, is comparing the scenarios for each entity. Performing 100% source and target databases for all columns for field-to-field validation of data is ideal, but when sample records. This ensures that the data is the data volume is in petabytes, test execution not corrupted, date formats are maintained and duration increases tremendously. A proper data is not truncated. The number of records sampling method should be adopted, and solid for sample testing can be decided according to data transformation rules should be considered the data volume. A basic data corruption can be in testing. identified by testing 100 sample records. Big Data Migration Process Step 2: Choose the Appropriate Method Hadoop as a service is offered by Amazon Web of Testing Services (AWS), a cloud computing solution that Per our analysis, we shortlisted two methods of abstracts the operational challenges of running testing: Hadoop and making medium- and large-scale data processing accessible, easy, fast and inex- • UNIX shell script and T-SQL-based reconcilia- pensive. The typical services available include tion. Amazon S3 (Simple Storage Service) and Amazon • PIG scripting. EMR (Elastic MapReduce). Also preferred is cognizant 20-20 insights 2 Testing Approach Comparison Unix Shell Script and T-SQL-Based PIG Scripting Reconciliation Prerequisites Load target Hadoop data into the Migrate data from RDBMS to HDFS central QA server (SQL server) as and compare QA HDFS files with Dev different entities and validate with HDFS files using Pig scripting. source tables. Flat files for each entity created SQL server database to store tables using Sqoop tool. and perform comparison using SQL queries. Preconfigured linked server in SQL server DB is needed to connect to all your source databases. Efforts Initial coding for five to 10 tables Compares flat files. takes one week. Scripting needed for each column in Consecutive additions take two days the table. for ~10 tables. Efforts are equally proportionate to the number of tables and their columns. Automation/ Full automation possible. No automation possible. Manual Performance (On Delivers the results quickly compared This method needs migration Windows XP, 3 GB to other methods. of source table to HDFS files as RAM, 1 CPU) a prerequisite, which is time- For 15 tables with an average 100K, consuming. records will take: Processing can be faster than other ~30 minutes for count. methods. ~20 minutes for sample 100 records. ~1 hour for missing primary keys. Highlights Full automation possible/job Offers a lot of flexibility in coding. scheduling possible. Very useful in more complex Fast comparison. transformations. No permission or security issues faced while accessing big data on AWS. Low Points Initial framework setup is time- Greater effort for decoding, consuming. reporting results and handling script errors. Figure 1 Another option is to use Microsoft Hive ODBC end-to-end automation is possible. If any transfor- Driver to access Hive data, but this approach is mations are present, those need to be performed more appropriate for smaller volumes. in the staging layer – which can be treated as source, to further implement similar solutions. Figure 1 shows a comparison of the two methods. According to the above analysis, PIG scripting is more appropriate for testing migration with Hence, based on this comparison, we recommend complex transformation logic. But for this type a focus on the first approach, where full cognizant 20-20 insights 3 High-Level Validation Approach Source Systems Jenkins Slave Machine • Stored procedure to compute the count of each table from Files from Hive source system. Results from Hive are compared with this result. SQL batch Files from UNIX • Stored procedure to pull ROW_ID from all tables of source files used to server are and find out missing/extra ones in Hive results. load file contents downloaded to • Stored procedure to pull source column data of the sample to QA tables. Windows server. records pulled from Hive and compare results. Report any data mismatch. Windows batch Oracle SQL script DB Server Server DB Linked server to pull data from various DBs. WinSCP MySQL Any Download Get Data shell Shell script to DB Other Commands RDBMS script to get generate count, ROW_IDs Get Data and sample shell script data from dynamically. Source to target data flow: QA DB Hive tables. Data from source systems is migrated Server to HDFS using Sqoop – ETL. (SQL server) CSV File with Hive table names. LOAD DATA INPATH Hadoop ‘hdfs_file’ INTO TABLE Distributed tablename File System • CSV file with count of records and table name for each table in Hive. • CSV file with ROW_ID from all tables available in Hive. • CSV file with first 100 records of all columns from Hive tables. HIVE AWS HADOOP Figure 2 of simple migration, the PIG scripting approach is > Store the table list in the CSV file on a UNIX very time-consuming and resource-intensive.

Load more