Whitepaper
testing BigData using hadoopT EcoSystem
Trusted partner for your Digital Journey What is Big Data?
Big data is the term for a collection of large datasets that cannot be processed using Table of contents traditional computing techniques. Enterprise Systems generate huge amount of data from Terabytes to and even Petabytes of informa- 03 tion. Big data is not merely a data, rather it has 09 become a complete subject, which involves What is Big Data? various tools, techniques and frameworks. How Atos Is Using Hadoop Specifically, Big Data relates to data creation, storage, retrieval and analysis that is remark- 03 able in terms of volume, velocity, and variety. Hadoop and Big Data 12 Testing Accomplishments 03 Hadoop Explained 15 Conclusion Hadoop and Big Data 04 15 Hadoop Architecture Hadoop is one of the tools designed to Hadoop is an open-source program under the Appendices handle big data. Hadoop and other software Apache license that is maintained by a global products work to interpret or parse the community of users. Apache Hadoop is 100% 05 results of big data searches through specif- open source, and pioneered a fundamentally ic proprietary algorithms and methods. new way of storing and processing data. Instead Hadoop Eco System Testing of relying on expensive, proprietary hardware and different systems to store and process data. 08 Testing Hadoop in Cloud Environment
Hadoop Explained…..
Apache Hadoop runs on a cluster of in- The Apache Hadoop platform also includes the HDFS provides APIs for MapReduce appli- dustry-standard servers configured with Hadoop Distributed File System (HDFS), which cations to read and write data in parallel. direct-attached storage. Using Hadoop, is designed for scalability and fault-tolerance. Capacity and performance can be scaled by you can store petabytes of data reliably on HDFS stores large files by dividing them into adding Data Nodes, and a single NameNode Author profile tens of thousands of servers while scaling blocks (usually 64 or 128 MB) and replicat- mechanism manages data placement and Padma Samvaba Panda is a test manager of Atos testing Prac- performance cost-effectively by merely ing the blocks on three or more servers. monitors server availability. HDFS clusters tice. He has 10 years of IT experience encompassing in software adding inexpensive nodes to the cluster. in production use today reliably hold peta- quality control, testing, requirement analysis and professional ser- bytes of data on thousands of nodes. vices. During his diversified career, He has delivered multifaceted software projects in a wide array of domains in specialized testing areas like (big data, crm, erp, data migration). He is responsible for the quality and testing processes and strategizing tools for big-da- ta implementations. He is a CSM (Certified Scrum Master) and ISTQB-certified. He can be reached at [email protected]
Testing – BigData Using Hadoop Eco System version 1.0 2 3 Testing – BigData Using Hadoop Eco System Hadoop Architecture Hadoop Eco System Testing Apache Hadoop is not actually a single product but instead a collec- tion of several components, below screen provides the details of Hadoop Ecosystem.
Test Approach
Components
Elements Components Distributed Filesystem Apache HDFS, CEPH File system Distributed Programming MapReduce,Pig,Spark NoSQL Databases Cassandra, Apache HBASE, MongoDB SQL-On-Hadoop Apache Hive, Cloudera Impala Data Ingestion Apache Flume, Apache Sqoop As Google, Facebook, Twitter and other companies extended their services to web-scale, the Service Programming Apache Zookeeper amount of data they collected routinely from user interactions online would have overwhelmed Scheduling Apache Oozie the capabilities of traditional IT architectures. So they built their own, they released code for Machine Learning Mlib,Mahout many of the components into open source. Of these components, Apache Hadoop has rapidly emerged as the de facto standard for managing large volumes of unstructured data. Apache Benchmarking Apache Hadoop Benchmarking Hadoop is an open source distributed software platform for storing and processing data. The Security Apache Ranger,Apache Knox framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) System Deployment Apache Amabari ,Cloudera Hue pairs to the reduce tasks, which group them into final results. MapReduce uses JobTracker and Applications PivotalR,Apache Nutch TaskTracker mechanisms to schedule tasks, monitor them, and restart any that fail. Development Frameworks Jumpbune BI Tools BIRT ETL Talend
Testing – BigData Using Hadoop Eco System 4 5 Testing – BigData Using Hadoop Eco System Hadoop testers have to learn the compo- nents of the Hadoop eco system from the Challenges and Testing Types Some of the Important Hadoop Component Level Test Approach scratch. Till the time, the market evolves and Best Practices Testing in Hadoop Eco System can be categorized as below: fully automated testing tools are available In traditional approach, there are several chal- for Hadoop validation, the tester does not lenges in terms of validation of data traversal Hive is properly integrated with Jasper Studio (Ex: R, Hive, Jasper Report Server). »» Core components testing have any other option but to acquire the and load testing. Hadoop involves distributed MapReduce and Jasper Report Server via a JDBC connection. Testing includes: (HDFS, MapReduce) same skill set as the Hadoop developer in NoSQL databases instance. With the combi- Programming Hadoop at the MapReduce Reports are exported in specified format correctly. »» Proper KNIME installation and configuration. »» Data Ingestion testing (Sqoop,Flume) the context of leveraging the technologies. nation of Talend (open source Big data tool), level means working with the Java APIs and Auto Complete Login Form, Password Expiration (KNIME + R) integration with Hive. »» Essential components testing we can explore list of big data tasks work flow. manually loading data files into HDFS. Testing Days and allow User Password Change criteria. » Testing KNIME analytical re- (Hive, Cassandra) » When it comes to validation on the map-reduce Following this, you can develop a framework to MapReduce requires some skills in white-box User not having admin role cannot create new sults using R scripts. process stage, it definitely helps if the tester has validate and verify the workflow, tasks and tasks testing. QA teams need to validate whether User and new Role. » Checking reports that are exported from In the first stage which is the pre-Hadoop pro- » good experience on programming languages. complete. You can also identify the testing tool transformation and aggregation are handled KNIME analytical results in specified format cess validation, major testing activities include The reason is because unlike SQL where to be used for this operation. Test automation correctly by the MapReduce code. Testers from Hive database in Jasper Report. comparing input file and source systems data queries can be constructed to work through the can be a good approach in testing big data need to begin thinking as developers. Apache Cassandra to ensure extraction has happened correctly data MapReduce framework transforms a list of implementations. Identifying the requirements It is a non-relational, distributed, open-source and confirm that files are loaded correctly into key-value pairs into a list of values. A good unit and building a robust automation framework and horizontally scalable database. A NoSQL Ambari the HDFS (Hadoop Distributed File System). YARN testing framework like Junit or PyUnit can help can help in doing comprehensive testing. database tester will need to acquire knowledge All administrative tasks (e.g.: configuration, start/ There is a lot of unstructured or semi structured It is a cluster and resource management validate the individual parts of the MapReduce However, a lot would depend on how the skills of CQL (Cassandra Query Language) in order stop service) are done from Ambari Web. data at this stage. The next stage in line is the technology. YARN enables Hadoop clusters to job but they do not test them as a whole. of the tester and how the big data environment to perform quality testing. It is independent of Tester need to check that Ambari is integrat- map-reduce process which involves running the run interactive querying and streaming data is setup. In addition to functional testing of big specific application or schema and can operate ed with all other applications. E.g.: Nagios, map-reduce programs to process the incom- applications simultaneously with MapReduce Building a test automation framework using a data applications using approaches such as on a variety of platforms and operating systems. Sqoop, WebHDFS, Pig, Hive, YARN etc. ing data from different sources. The key areas batch jobs. Testing YARN involves validating programming language like Java can help here. test automation, given the large size of data QA areas for Cassandra include data type of testing in this stage include business logic whether MapReduce jobs are getting distrib- there are definitely needs for performance checks, count checks, CRUD operations checks, validation on every node and then validating uted across all the data nodes in the cluster. Apache Hive The automation framework can focus on and load testing in big data implementations. timestamp and its format checks, checks related the bigger picture pertaining to MapReduce them after running against multiple nodes, to cluster failure handling and data integrity Hive enables Hadoop to operate as a data jobs while encompassing the unit tests as making sure that the map reduce program / Apache Hue and redundancy checks on task failure. warehouse. It superimposes structure on data well. Setting up the automation framework process is working correctly and key value in HDFS, and then permits queries over the to a continuous integration server like Jen- pairs are generated correctly and validating Hadoop has provided a web interface to make data using a familiar SQL-like syntax. Since kins can be even more helpful. However, the data post the map reduce process. The it easy to work with Hadoop data. It provides Flume and Sqoop Hive is recommended for analysis of terabytes building the right framework for big data last step in the map reduce process stage is a centralized apoint of access for components Big data is equipped with data ingestion tools of data, the volume and velocity of big data applications relies on how the test environ- to make sure that the output data files are like Hive, Oozie, HBase, and HDFS.From Testing such as Flume and Sqoop, which can be are extensively covered in Hive testing from ment is setup as the processing happens in generated correctly and are in the right format. point of view it involves checking whether used to move data into and out of Hadoop. a functional standpoint, Hive testing requires a distributed manner here. There could be a a user is able to work with all the aforemen- Instead of writing a stand-alone application tester to know HQL (Hive Query Language). cluster of machines on the QA server where The third or final stage is the output validation tioned components after logging in to Hue. to move data to HDFS, these tools can be It incorporates validation of successful setup testing of MapReduce jobs should happen. phase. The data output files are generated considered for ingesting data, say for exam- of the Hive meta-store database; data integrity and ready to be moved to an EDW (Enterprise Apache Spark ple from RDBMS since they offer most of the between HDFS vs. Hive and Hive vs. MySQL Data Warehouse) or any other system based common functions.General QA checkpoints (meta-store); correctness of the query and Apache Spark is an open-source cluster on the requirement. Here, the tester needs to include successfully generating streaming data transformation logic, checks related to computing framework originally devel- ensure that the transformation rules are applied data from Web sources using Flume, checks number of MapReduce jobs triggered for oped in the AMPLab at UC Berkeley. correctly, check the data load in the target over data propagation from conventional data each business logic, export / import of data Spark is an in-memory data processing frame- system including data integrity and confirm storages into Hive and HBase, and vice versa. from / toHive, data integrity and redundan- work where data divided into smaller RDD.Spark that there is no data corruption by comparing cy checks when MapReduce jobs fail. the target data with the HDFS file system data. performance is up to 100 times faster than In functional testing with Hadoop, testers need hadoop mapreduce for some applications. From Talend to check data files are correctly processed and QA standpoint it involves validating whether Talend is an ETL tool that simplifies the inte- Apache Oozie loaded in database, and after data processed spark worker nodes are working and processing gration of big data without having to write or It’s a really nice scalable and reliable solution the output report should generated properly the streaming data supplied by the spark job maintain complicated Apache Hadoop code. in Hadoop ecosystem for job scheduling. also need to check the business logic on a running in the namenode.Since it is integrated Enable existing developers to start working with Both Map Reduce and Hive, Pig scripts can standalone node and then on multiple nodes. with other nodes (E.g. Cassandra) it should Hadoop and NoSQL databases. Using Talend be scheduled along with the job duration. QA Load in target system and also validating have appropriate failure handling capability. data can be transferred between Cassandra activities involve validating an Oozie work- aggregation of data and data integrity. Performanace is also an important benchmark ,HDFS and Hive.Validating talend activities flow. Validating the execution of a workflow of a spark job as it is used as an enhance- involve data loading is happening as per job based on user defined timeline. ment over existing MapReduce Operation. business rules,counts match, it appropriately rejects, replaces with default values and reports Nagios testing Jasper Report invalid data. Time taken and performance is also important while validating the above scenarios. Nagios enables to perform health checks of Ha- It is integrated with Data Lake layer to fetch the doop & other components and checks whether required data (Hive). Report designed using the platform is running according to normal Jaspersoft Studio are deployed on Jasper KNIME Testing behavior. We need to configure all the services Report Server. Analytical and transactional Konstanz Information Miner, is an open in Nagios, so that we can check the health data coming from Hive database is used by source data analytics, reporting and inte- and performance from Nagios web portal. Jasper Report Designer to generate complex gration platform. We have integrated and - The health check includes: reports. The Testing comprises the following: tested KNIME with various components . If a process is running Jaspersoft Studio is installed properly. . If a service is running . If the service is accepting connections . Storage capacity on data nodes.
Testing – BigData Using Hadoop Eco System 6 7 Testing – BigData Using Hadoop Eco System Testing Hadoop in How Atos Is Cloud Environment Using Hadoop
Before Testing Hadoop in Cloud:
1. Document the high level cloud test Information Culture is changing… Leading to increased Volume, infrastructure (Disk space, RAM re- quired for each node, etc.) Variety & Velocity 2. Identify the cloud infrastructure service provider 3. Document the data security plan t a i ata 4. Document high level test strategy, testing release cycles, testing types, volume of data er i e er ie r riti a t i e pp rt processed by Hadoop, third party tools.