Testing Big Data Using Hadoop

Whitepaper

testing BigData using hadoopT EcoSystem

Trusted partner for your Digital Journey What is Big Data?

Big data is the term for a collection of large datasets that cannot be processed using Table of contents traditional computing techniques. Enterprise Systems generate huge amount of data from Terabytes to and even Petabytes of informa- 03 tion. Big data is not merely a data, rather it has 09 become a complete subject, which involves What is Big Data? various tools, techniques and frameworks. How Atos Is Using Hadoop Specifically, Big Data relates to data creation, storage, retrieval and analysis that is remark- 03 able in terms of volume, velocity, and variety. Hadoop and Big Data 12 Testing Accomplishments 03 Hadoop Explained 15 Conclusion Hadoop and Big Data 04 15 Hadoop Architecture Hadoop is one of the tools designed to Hadoop is an open-source program under the Appendices handle big data. Hadoop and other software Apache license that is maintained by a global products work to interpret or parse the community of users. Apache Hadoop is 100% 05 results of big data searches through specif- open source, and pioneered a fundamentally ic proprietary algorithms and methods. new way of storing and processing data. Instead Hadoop Eco System Testing of relying on expensive, proprietary hardware and different systems to store and process data. 08 Testing Hadoop in Cloud Environment

Hadoop Explained…..

Apache Hadoop runs on a cluster of in- The Apache Hadoop platform also includes the HDFS provides APIs for MapReduce appli- dustry-standard servers configured with Hadoop Distributed File System (HDFS), which cations to read and write data in parallel. direct-attached storage. Using Hadoop, is designed for scalability and fault-tolerance. Capacity and performance can be scaled by you can store petabytes of data reliably on HDFS stores large files by dividing them into adding Data Nodes, and a single NameNode Author profile tens of thousands of servers while scaling blocks (usually 64 or 128 MB) and replicat- mechanism manages data placement and Padma Samvaba Panda is a test manager of Atos testing Prac- performance cost-effectively by merely ing the blocks on three or more servers. monitors server availability. HDFS clusters tice. He has 10 years of IT experience encompassing in software adding inexpensive nodes to the cluster. in production use today reliably hold peta- quality control, testing, requirement analysis and professional ser- bytes of data on thousands of nodes. vices. During his diversified career, He has delivered multifaceted software projects in a wide array of domains in specialized testing areas like (big data, crm, erp, data migration). He is responsible for the quality and testing processes and strategizing tools for big-data implementations. He is a CSM (Certified Scrum Master) and ISTQB-certified. He can be reached at [email protected]

Testing – BigData Using Hadoop Eco System version 1.0 2 3 Testing – BigData Using Hadoop Eco System Hadoop Architecture Hadoop Eco System Testing Apache Hadoop is not actually a single product but instead a collection of several components, below screen provides the details of Hadoop Ecosystem.

Test Approach

Components

Elements Components Distributed Filesystem Apache HDFS, CEPH File system Distributed Programming MapReduce,Pig,Spark NoSQL Databases Cassandra, Apache HBASE, MongoDB SQL-On-Hadoop Apache Hive, Cloudera Impala Data Ingestion Apache Flume, Apache Sqoop As Google, Facebook, Twitter and other companies extended their services to web-scale, the Service Programming Apache Zookeeper amount of data they collected routinely from user interactions online would have overwhelmed Scheduling Apache Oozie the capabilities of traditional IT architectures. So they built their own, they released code for Machine Learning Mlib,Mahout many of the components into open source. Of these components, Apache Hadoop has rapidly emerged as the de facto standard for managing large volumes of unstructured data. Apache Benchmarking Apache Hadoop Benchmarking Hadoop is an open source distributed software platform for storing and processing data. The Security Apache Ranger,Apache Knox framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) System Deployment Apache Amabari ,Cloudera Hue pairs to the reduce tasks, which group them into final results. MapReduce uses JobTracker and Applications PivotalR,Apache Nutch TaskTracker mechanisms to schedule tasks, monitor them, and restart any that fail. Development Frameworks Jumpbune BI Tools BIRT ETL Talend

Testing – BigData Using Hadoop Eco System 4 5 Testing – BigData Using Hadoop Eco System Hadoop testers have to learn the components of the Hadoop eco system from the Challenges and Testing Types Some of the Important Hadoop Component Level Test Approach scratch. Till the time, the market evolves and Best Practices Testing in Hadoop Eco System can be categorized as below: fully automated testing tools are available In traditional approach, there are several chal- for Hadoop validation, the tester does not lenges in terms of validation of data traversal Hive is properly integrated with Jasper Studio (Ex: R, Hive, Jasper Report Server). »» Core components testing have any other option but to acquire the and load testing. Hadoop involves distributed MapReduce and Jasper Report Server via a JDBC connection. Testing includes: (HDFS, MapReduce) same skill set as the Hadoop developer in NoSQL databases instance. With the combi- Programming Hadoop at the MapReduce Reports are exported in specified format correctly. »» Proper KNIME installation and configuration. »» Data Ingestion testing (Sqoop,Flume) the context of leveraging the technologies. nation of Talend (open source Big data tool), level means working with the Java APIs and Auto Complete Login Form, Password Expiration (KNIME + R) integration with Hive. »» Essential components testing we can explore list of big data tasks work flow. manually loading data files into HDFS. Testing Days and allow User Password Change criteria. » Testing KNIME analytical re- (Hive, Cassandra) » When it comes to validation on the map-reduce Following this, you can develop a framework to MapReduce requires some skills in white-box User not having admin role cannot create new sults using R scripts. process stage, it definitely helps if the tester has validate and verify the workflow, tasks and tasks testing. QA teams need to validate whether User and new Role. » Checking reports that are exported from In the first stage which is the pre-Hadoop pro- » good experience on programming languages. complete. You can also identify the testing tool transformation and aggregation are handled KNIME analytical results in specified format cess validation, major testing activities include The reason is because unlike SQL where to be used for this operation. Test automation correctly by the MapReduce code. Testers from Hive database in Jasper Report. comparing input file and source systems data queries can be constructed to work through the can be a good approach in testing big data need to begin thinking as developers. Apache Cassandra to ensure extraction has happened correctly data MapReduce framework transforms a list of implementations. Identifying the requirements It is a non-relational, distributed, open-source and confirm that files are loaded correctly into key-value pairs into a list of values. A good unit and building a robust automation framework and horizontally scalable database. A NoSQL Ambari the HDFS (Hadoop Distributed File System). YARN testing framework like Junit or PyUnit can help can help in doing comprehensive testing. database tester will need to acquire knowledge All administrative tasks (e.g.: configuration, start/ There is a lot of unstructured or semi structured It is a cluster and resource management validate the individual parts of the MapReduce However, a lot would depend on how the skills of CQL (Cassandra Query Language) in order stop service) are done from Ambari Web. data at this stage. The next stage in line is the technology. YARN enables Hadoop clusters to job but they do not test them as a whole. of the tester and how the big data environment to perform quality testing. It is independent of Tester need to check that Ambari is integrat- map-reduce process which involves running the run interactive querying and streaming data is setup. In addition to functional testing of big specific application or schema and can operate ed with all other applications. E.g.: Nagios, map-reduce programs to process the incom- applications simultaneously with MapReduce Building a test automation framework using a data applications using approaches such as on a variety of platforms and operating systems. Sqoop, WebHDFS, Pig, Hive, YARN etc. ing data from different sources. The key areas batch jobs. Testing YARN involves validating programming language like Java can help here. test automation, given the large size of data QA areas for Cassandra include data type of testing in this stage include business logic whether MapReduce jobs are getting distrib- there are definitely needs for performance checks, count checks, CRUD operations checks, validation on every node and then validating uted across all the data nodes in the cluster. Apache Hive The automation framework can focus on and load testing in big data implementations. timestamp and its format checks, checks related the bigger picture pertaining to MapReduce them after running against multiple nodes, to cluster failure handling and data integrity Hive enables Hadoop to operate as a data jobs while encompassing the unit tests as making sure that the map reduce program / Apache Hue and redundancy checks on task failure. warehouse. It superimposes structure on data well. Setting up the automation framework process is working correctly and key value in HDFS, and then permits queries over the to a continuous integration server like Jen- pairs are generated correctly and validating Hadoop has provided a web interface to make data using a familiar SQL-like syntax. Since kins can be even more helpful. However, the data post the map reduce process. The it easy to work with Hadoop data. It provides Flume and Sqoop Hive is recommended for analysis of terabytes building the right framework for big data last step in the map reduce process stage is a centralized apoint of access for components Big data is equipped with data ingestion tools of data, the volume and velocity of big data applications relies on how the test environ- to make sure that the output data files are like Hive, Oozie, HBase, and HDFS.From Testing such as Flume and Sqoop, which can be are extensively covered in Hive testing from ment is setup as the processing happens in generated correctly and are in the right format. point of view it involves checking whether used to move data into and out of Hadoop. a functional standpoint, Hive testing requires a distributed manner here. There could be a a user is able to work with all the aforemen- Instead of writing a stand-alone application tester to know HQL (Hive Query Language). cluster of machines on the QA server where The third or final stage is the output validation tioned components after logging in to Hue. to move data to HDFS, these tools can be It incorporates validation of successful setup testing of MapReduce jobs should happen. phase. The data output files are generated considered for ingesting data, say for exam- of the Hive meta-store database; data integrity and ready to be moved to an EDW (Enterprise Apache Spark ple from RDBMS since they offer most of the between HDFS vs. Hive and Hive vs. MySQL Data Warehouse) or any other system based common functions.General QA checkpoints (meta-store); correctness of the query and Apache Spark is an open-source cluster on the requirement. Here, the tester needs to include successfully generating streaming data transformation logic, checks related to computing framework originally devel- ensure that the transformation rules are applied data from Web sources using Flume, checks number of MapReduce jobs triggered for oped in the AMPLab at UC Berkeley. correctly, check the data load in the target over data propagation from conventional data each business logic, export / import of data Spark is an in-memory data processing frame- system including data integrity and confirm storages into Hive and HBase, and vice versa. from / toHive, data integrity and redundan- work where data divided into smaller RDD.Spark that there is no data corruption by comparing cy checks when MapReduce jobs fail. the target data with the HDFS file system data. performance is up to 100 times faster than In functional testing with Hadoop, testers need hadoop mapreduce for some applications. From Talend to check data files are correctly processed and QA standpoint it involves validating whether Talend is an ETL tool that simplifies the inte- Apache Oozie loaded in database, and after data processed spark worker nodes are working and processing gration of big data without having to write or It’s a really nice scalable and reliable solution the output report should generated properly the streaming data supplied by the spark job maintain complicated Apache Hadoop code. in Hadoop ecosystem for job scheduling. also need to check the business logic on a running in the namenode.Since it is integrated Enable existing developers to start working with Both Map Reduce and Hive, Pig scripts can standalone node and then on multiple nodes. with other nodes (E.g. Cassandra) it should Hadoop and NoSQL databases. Using Talend be scheduled along with the job duration. QA Load in target system and also validating have appropriate failure handling capability. data can be transferred between Cassandra activities involve validating an Oozie work- aggregation of data and data integrity. Performanace is also an important benchmark ,HDFS and Hive.Validating talend activities flow. Validating the execution of a workflow of a spark job as it is used as an enhance- involve data loading is happening as per job based on user defined timeline. ment over existing MapReduce Operation. business rules,counts match, it appropriately rejects, replaces with default values and reports Nagios testing Jasper Report invalid data. Time taken and performance is also important while validating the above scenarios. Nagios enables to perform health checks of Ha- It is integrated with Data Lake layer to fetch the doop & other components and checks whether required data (Hive). Report designed using the platform is running according to normal Jaspersoft Studio are deployed on Jasper KNIME Testing behavior. We need to configure all the services Report Server. Analytical and transactional Konstanz Information Miner, is an open in Nagios, so that we can check the health data coming from Hive database is used by source data analytics, reporting and inte- and performance from Nagios web portal. Jasper Report Designer to generate complex gration platform. We have integrated and - The health check includes: reports. The Testing comprises the following: tested KNIME with various components . If a process is running Jaspersoft Studio is installed properly. . If a service is running . If the service is accepting connections . Storage capacity on data nodes.

Testing – BigData Using Hadoop Eco System 6 7 Testing – BigData Using Hadoop Eco System Testing Hadoop in How Atos Is Cloud Environment Using Hadoop

Before Testing Hadoop in Cloud:

1. Document the high level cloud test Information Culture is changing… Leading to increased Volume, infrastructure (Disk space, RAM required for each node, etc.) Variety & Velocity 2. Identify the cloud infrastructure service provider 3. Document the data security plan t a i ata 4. Document high level test strategy, testing release cycles, testing types, volume of data erie erie r ritia t ie pprt processed by Hadoop, third party tools.

Technology principles Big data in the cloud

Cloud Services Types Big Data Big Data Strategy »»Application, data, computing and storage Awareness and and Design Value added services: »»Fully used or hybrid cloud Discovery Workshop Business process brokerage, integration, aggregation, security »»Public or on-premise Opportunity Assessment modeling andreengineering »»Multi-tennant or single -tennant Readiness Assessment Big data enhancement Proof of Value/ PoC planning Computing So ware as a Service Characheristics as a Service External Big Data Implementation Big Data Visualization Architecture Services Cloud »»Scalability Big Data Analytics Services Integration Services »»Elasticity Big Data Patterns Infrastructure as a Providers »»Resource pooling Storage Services Service Insight Services »»Self service Hosting Services »»Pay as you go Cloud

Broader Impact The main key features that leverage Bigdata test framework in cloud are:

»» On demand Hadoop testbed to test Big data We e a rtae raer t eier r i ata ati ti a erie »» Virtualized application / service availability that need to be tested ir trate a trarati i ata patr iht »» Virtualized testing tool suite- Talend and jmeter osops – e deie sccess Big Data stategy and design o Big Data impementation sc as D Big Data isaiation naytics »» Managed test life cycle in Cloud osops o cients acoss a maets sod yo siness appoac ig data citecte ntegation Stoage and attens and nsigt Seices »» Different types of Big Data oo o oncept oo o ae – Bsiness pocess modeing and eengi Hosting Seices it anopy test metrics in cloud pactica and eady oo scenaios neeing at impications does ig can e depoyed data ae on yo siness »» Operations like import / export configura- tions and test artifacts in / out of the testbed.

Testing – BigData Using Hadoop Eco System 8 9 Testing – BigData Using Hadoop Eco System Economics of Big Data

Data examples 1 2 3

Structured data Semi-structured data Unstructured data ae ati i the e pat et i eer e iiht a ati eaer hae a etr the hep t ie pprtitie etai e t three e atr aeerate inancia ti ti ptiiati ptiiati oenment

Heatcae esona iita peratia Machine-generated data trarati eii economy pprt Tanspot anacting

Teco ntenet o eriati Tings W Enegy iret iit t rii the ee t ptiiati aae a haret pen data stome Data teta iit itien Data Human-generated rii the Empoyee Data riht erie at the riht acine Data et t the riht per Entepise Data

Atos has a clear vision of the importance of Big Vertical & Industry-specific Analytics: Process Performance Management & Operational Structured data Semi-structured data Unstructured data Data as a Business Success factor. This section optimization and improved operational efficien- Intelligence is on one hand about financial data gives a single, overall view of how we see the cy through automated use of real-time data, reporting and on the other hand about creating »» Machine-generated: Input data, click-stream »» Machine-generated: Electronic data inter- »» Machine-generated: Satellite images, scien- analytics marketplace and how we operate intelligence, monitoring & analytics. better decision support systems that support data, gaming-related data, sensor data change (EDI), SWIFT, XML, RSS feeds, sensor tific data, photographs and video, radar or within it. stakeholders in running their business or an »» Human-generated: Web log data [!= data sonar data The top line of the graph shows the principal organization. weblog], point-of-sale data (when some- »» Human-generated: Emails, spreadsheets, »» Human-generated: Internal company text Atos believes that analytics is the key to gaining inputs, the flows of real-time or near real-time thing is bought), financial data incident tickets, CRM records data, social media, mobile data, website true business insight and to achieving compet- data that provide raw material for analytics. M/C/H is about modernizing existing Data and content itive advantage. Organizations must turn on The three connected circles show the main Analytics environments to support challenges analytics everywhere to realize this understand- areas of focus and activity for Atos: in performance, requirements, disruptive trends ing and apply it effectively. like mobility & cloud, and operational costs Digital transformation is all about digitizing and Finally, we see the ways in which different ana- We have a three level approach on analytics: optimizing business processes through proper lytics-driven outputs lead to positive change in application of workflow, data and information a wide range of different sectors, as you can see Enterprise Analytics: Better decision-making, management, and analytics concepts. on the right hand side of the model. enabled by customized business intelligence solutions

Consumer Analytics: Driven by real-time data flows, delivering immediate access to actionable intelligence

Testing – BigData Using Hadoop Eco System 10 11 Testing – BigData Using Hadoop Eco System Testing Accomplishments

Here are the rules of Big Data testing: Test Timeline Charts Case Study 2 (Perfor- Load Generation Timeline 1. Generate a lot of test data mance Testing) Average Response 2. Using continuous integration (CI) and automated builds Streaming data ingestion continuously. 3. Create two or more test modules, with increasing load and execution time Volumes to be handled: 1.3 million 4. Spin your clusters of Hadoop or HBase transactions per day with a peak nodes as part of the test of 1,000 transactions/minute 5. Do performance testing early Day Long Test (24 hr. test) 6. Install proper monitoring. Long Run Test (7 day test).

Let’s discuss some of the case studies… To meet the above objectives Apache Jme- ter was used as a performance testing tool. Case Study 1 Plugin were used in Jmeter to enable it send (Integration Testing) messages to Hadoop via Messaging Queue. CSV data set config feature was used to feed the bulk data (file size 1.4 GB).Various listeners Data from external sources should pass were configured(E.g. Summary, Response through various stages and processed data Time, Aggregate Report).Variety of Graphs should be properly stored in Data Lake (Hive) were generated some of which are given euene Diagra below. Later on the above feature was also extended to Day Long test and Long Run Test eae iterae ata reparati pert re appiati er ata trae trtre ata where using Jmeter, System was being fed pahe par iatr ait reprei r pert aara ie with Streaming Data on a continual basis. ata

oulatenutData

reeienutData

reroenutData

inertranationDetail

alyuleoreroeeData

inertlertDetail

inertoieD

Test Data from a variety of sources to simulate real life scenarios was prepared as an external data source. Using the simulator data was fed to RabbitMQ where as a tester it was checked whether data was in proper format and then through a Spark job data was decoded using Spark-Cassandra connector it was inserted into Cassandra (NoSQL) where data was checked by tester for integrity and count using CQL languages. After through Talend jobs data from Cassandra to Hive via HDFS was transferred. Same was validated in Hive using Hive Query Language.

Testing – BigData Using Hadoop Eco System 12 13 Testing – BigData Using Hadoop Eco System Conclusion

Response Time vs Virtual Users. (Apache Hive) Big data is still emerging and a there is a lot Till the time, the market evolves and fully Case Study 3 of onus on testers to identify innovative ideas automated testing tools are available for BIG (100 simultaneous users to test the implementation. One of the most Data validation, the tester does not have any challenging things for a tester is to keep pace other option but to acquire the same skill test) with changing dynamics of the industry. While set as the BIG Data developer in the context on most aspects of testing, the tester need not of leveraging the BIG Data technologies like Hadoop interfacing applications should know the technical details behind the scene Hadoop. This requires a tremendous mindset support 100 simultaneous users with- however this is where testing Big Data Technol- shift for both the testers as well as the testing out any performance degradation. ogy is so different. A tester not only needs to be units within the organization. To be com- strong on testing fundamentals but also has to petitive, in the short term, the organizations 100 Simultaneous User Test: be equally aware of minute details in the archi- should invest in the BIG Data specific training tecture of the database designs to analyze sev- needs of the testing community and in the Customer interfacing applications (E.g. eral performance bottlenecks and other issues. long term, should invest in developing the Hue, Drools, RabbitMQ, Hive, Ambari, Jas- Hadoop testers have to learn the components automation solutions for BIG Data validation. per Report, and BIRT) were subjected to of the Hadoop eco system from the scratch. a 100 user test where those applications can support up to 100 simultaneous users without any degrade in performance. JMeter was also used here to generate scripts simulating the activity performed by the users. Appendix Response Time vs TPS (Apache Hive)”

References

1. http://www.slideshare.net/pnicolas/overview-hadoop-ecosystem

Testing – BigData Using Hadoop Eco System 14 15 Testing – BigData Using Hadoop Eco System About Atos Atos SE (Societas Europaea) is a leader in digital services with pro forma annual revenue of circa € 12 billion and circa 100,000 employees in 72 countries. Serving a global client base, the Group provides Consulting & Systems Integration services, Managed Services & BPO, Cloud operations, Big Data & Cyber-security solutions, as well as transactional services through Worldline, the European leader in the payments and transactional services industry. With its deep technology expertise and industry knowledge, the Group works with clients across different business sectors: Defense, Financial Services, Health, Manufacturing, Media, Utilities, Public sector, Retail, Telecommunications, and Transportation.

Atos is focused on business technology that powers progress and helps organizations to create their firm of the future. The Group is the Worldwide Information Technology Partner for the Olympic & Paralympic Games and is listed on the Euronext Paris market. Atos operates under the brands Atos, Atos Consulting, Atos Worldgrid, Bull, Canopy, Unify and Worldline. Management

Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company, Unify, Yunano, Zero Email, Zero Email Certified and atos.net The Zero Email Company are registered trademarks of the Atos group. June 2016 © 2016 Atos