J.B. INSTITUTE OF ENGINEERING AND TECHNOLOGY (UGC AUTONOMOUS) Bhaskar Nagar, Moinabad Mandal, R.R. District, Hyderabad -500075 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BIG DATA ANALYTICS

LECTURE NOTES

R16

B. TECH IV YEAR – I SEM (Sec- A&B) Academic Year 2020-21

Prepared & Compiled by

DR.G. ARUN SAMPAUL THOMAS, ASSOCIATE PROFESSOR, DEPARTMENT OF CSE J.B.I.E.T Bhaskar Nagar, Yenkapally(V), Moinabad(M), Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.

J.B. INSTITUTE OF ENGINEERING & TECHNOLOGY UGC AUTONOMOUS BIG DATA ANALYTICS (Professional ELECTIVE-III) B.Tech CSE L T-P-D C IV Year – I Semester 4 0-0-0 4

Course Objectives: • To understand about big data • To learn the analytics of Big Data • To Understand the Map Reduce fundamentals

UNIT-I Big Data Analytics: What is big data, History of Data Management; Structuring Big Data ; Elements of Big Data ; Big Data Analytics; Distributed and Parallel Computing for Big Data; Big Data Analytics: What is Big Data Analytics, What Big Data Analytics Isn’t, Why this sudden Hype Around Big Data Analytics, Classification of Analytics, Greatest Challenges that Prevent Business from Capitalizing Big Data; Top Challenges Facing Big Data; Why Big Data Analytics Important; Data Science; Data Scientist; Terminologies used in Big Data Environments; Basically Available Soft State Eventual Consistency (BASE); Open source Analytics Tools

UNIT-II: Understanding Analytics and Big Data: Comparing Reporting and Analysis, Types of Analytics; Points to Consider during Analysis; Developing an Analytic Team; Understanding Text Analytics; Analytical Approach and Tools to Analyze Data: Analytical Approaches; History of Analytical Tools; Introducing Popular Analytical Tools; Comparing Various Analytical Tools.

UNIT-III: Understanding MapReduce Fundamentals and HBase : The MapReduce Framework; Techniques to Optimize MapReduce Jobs; Uses of MapReduce; Role of HBase in Big Data Processing; Storing Data in Hadoop Introduction of HDFS: Architecture, HDFC Files, File system types, commands, org.apache.hadoop.io package, HDF, HDFS High Availability; Introducing HBase, Architecture, Storing Big Data with HBase , Interacting with the Hadoop Ecosystem; HBase in Operations-Programming with HBase; Installation, Combining HBase and HDFS

UNIT-IV: Big Data Technology Landscape and Hadoop : NoSQL, Hadoop; RDBMS versus Hadoop; Distributed Computing Challenges; History of Hadoop; Hadoop Overview; Use Case of Hadoop; Hadoop Distributors; HDFC (Hadoop Distributed File System): HDFC Daemons, read, write, Replica Processing of Data with Hadoop; Managing Resources and Applications with Hadoop YARN UNIT-V: Social Media Analytics and Text Mining: Introducing Social Media; Key elements of Social Media; Text mining; Understanding Text Mining Process; Sentiment Analysis, Performing Social Media Analytics and Opinion Mining on Tweets;

pg. 2 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Mobile Analytics: Introducing Mobile Analytics; Define Mobile Analytics; Mobile Analytics and Web Analytics; Types of Results from Mobile Analytics; Types of Applications for Mobile Analytics; Introducing Mobile Analytics Tools

TEXT BOOKS: 1. BIG DATA and ANALYTICS, Seema Acharya, Subhasinin Chellappan, Wiley publications. 2. BIG DATA, Black BookTM , DreamTech Press, 2015 Edition. 3. BUSINESS ANALYTICS 5e , BY Albright |Winston

REFERENCE BOOKS: 1. Rajiv Sabherwal, Irma Becerra- Fernandez,” Business Intelligence –Practice, Technologies and Management”, John Wiley 2011. 2. Lariss T. Moss,ShakuAtre, “ Business Intelligence Roadmap”, Addison-Wesley It Service. 3. Yuli Vasiliev, “ Oracle Business Intelligence : The Condensed Guide to Analysis and Reporting”, SPD Shroff, 2012

pg. 3 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

BDA - UNIT-I

Topics Covered:

1.1.What is Big Data ? 1.2.History of data management 1.3.Structuring big data 1.4.elements of Big Data . 1.5. capitalizing big data, Distributed and parallel Computing as applicable to Big Data. 1.6 .what is Big Data Analytics? 1.7. classifications of Analytics 1.8.challenges facing Big data. 1.9. Data Science and Data Scientists 1.10. HBASE. 1.11. open source tools used in DA

1.1. what is Big data?

Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction. As we note earlier in this chapter, big data is typically broken down by three characteristics:

✓ Volume: How much data ✓ Velocity: How fast that data is processed ✓ Variety: The various types of data

Although it’s convenient to simplify big data into the three Vs, it can be misleading and overly simplistic. For example, you may be managing a relatively small amount of very disparate, complex data or you may be processing a huge volume of very simple data. That simple data may be all structured or all unstructured. Even more important is the fourth V: veracity. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense? It is critical that you don’t underestimate the task at hand. Data must be able to be verified based on both accuracy and context. An innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of that customer and the potential to provide additional offers to that customer. It is necessary to identify the right amount and types of data that can be analyzed to impact business outcomes. Big data incorporates all data, including structured d ata and unstructured data from e-mail, social media, text streams, and more. This kind of data management requires that companies leverage both their structured and unstructured data

• 1. volume: giga (10 9 )> tera(10 12) > peta(10 15) > exa (10 18)> zetta (10 21)> yotta (10 24) bytes • 2. velocity: batch processing> periodic > real

pg. 4 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

time processing..(mbps) • 3. (variety): structured + semi structured+ unstructurerd data • 4. veracity : all the data may not be relevant to the problem • 5. validity : all the data may not be accurate • 6. volatility : the data may not be valid for long periods • 5.variability : rate of data flow may not be constant

• Velocity means that data is generated extremely fast and often continuously processed, like live streaming social media data. • Volume simply means large amounts that cannot be processed fast enough by one’s existing computing system, like gigabytes and terabytes of data. • variety means different types of data, like a large dataset in an Excel sheet, text, videos from CCTV cameras, energy data , internet, email, face book etc

1.2.History of data management • Before 1970s: only storage of primitive and structured data and storage intensive management involved .Mainframes were used • 1980s and 1990s: structured Relational data bases evolved .Storage and data intensive applications management was required. • 2000s and beyond: web and IOT caused evolution of unstructured multimedia data q A , is a collection of information. q Database Management System can access the data and pull a specific information. q in 1890 :Herman Hollerith is given credit for adapting the punch cards to act as the memory. Ø In 1960: Charles W. Bachman designed the Integrated Database System, the “first” DBMS. IBM created a database system of their own, known as IMS. Ø In 1971 : evolved a standardization of a language for data base management called Common Business Oriented Language (COBOL) Ø in 1974 :IBM to develop SQL, which was more advanced . Ø In 1980s-90s : RDBM Systems like Oracle, MS SQL, DB2,My SQL and Teradata became very popular leading to development of enterprise resource planning systems (ERP), CRM, Ø RDBMS were efficient to store and process structured data.

pg. 5 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Ø In 2000s and beyond : due to explosion of internet processing speeds were required to be faster, and “unstructured” data (art, photographs, music, etc.) became much more common place. Unstructured data is both non-relational and schema-less, and Relational Database Management Systems simply were not designed to handle this kind of data. • NoSQL database are primarily called as non-relational or . • SQL are table based databases which represent data (schema) in the form of rows and columns whereas NoSQL databases are collection of v documents v key-value pairs v graph databases or v wide-column stores. which do not have such standard schema definitions to adhere to but have a dynamic schema for the unstructured data q NoSQL (“Not only” Structured Query Language) came about as a response to the Internet and the need for faster speed and the processing of unstructured data. q NoSQL databases are preferable in certain use cases to relational databases because of their speed and flexibility. q The NoSQL model is non-relational and uses a “distributed” database system. q This non-relational system is fast, uses an ad-hoc method of organizing data, and processes high- volumes of different kinds of data. q “Not only” does it handle structured and unstructured data, it can also process unstructured Big Data, very quickly. q NoSQL is not faster than SQL, nor is SQL faster than NoSQL. They are each different technologies suited to different work. ... No RDBMS (whether we are discussing SQL / Relational vs Distributed / NoSQL) is "magic". In effect, all of them work with files. q The widespread use of NoSQL can be connected to the services offered by Twitter, LinkedIn, , and . q Solution q NoSQL databases are designed with a distribution architecture that includes redundant backup storage of both data and functions. q It does this by using multiple nodes (database servers). q If one, or more, of the nodes goes down, the other nodes can continue with normal operations and suffer no data loss. q When used correctly, NoSQL databases can provide high performance at an extremely large scale, and never shut down. q Types of NoSQL databases- q There are 4 basic types of NoSQL databases: q Key-Value Store – It has a Big Hash Table of keys & values {Example- , Amazon S3 (Dynamo)} q Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB} q Column-based Store- Each storage block contains data from only one column, {Example- HBase, Cassandra} q Graph-based-A network database that uses edges and nodes to represent and store data. {Example- Neo4J} advantages and disadvantages of NoSQL over SQL and RDBM Systems

Ø higher scalability Ø A distributed computing system Ø Lower costs

pg. 6 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Ø A flexible schema Ø Can process unstructured and semi-structured data Ø Has no complex relationship Disadvantage of NoSQL databases Ø It is resource intensive, demanding high RAM and CPU allocations. Ø It can also be difficult to find tech support if your open source NoSQL system goes down

WHY DISTRIBUTED COMPUTING IS NEEDED FOR BIG DATA Not all problems require distributed computing. If a big time constraint doesn’t exist, complex processing can done via a specialized service remotely. When companies needed to do complex data analysis, IT would move data to an external service or entity where lots of spare resources were available for processing.

It wasn’t that companies wanted to wait to get the results they needed; it just wasn’t economically feasible to buy enough computing resources to handle these emerging requirements. In many situations, organizations would capture only selections of data rather than try to capture all the data because of costs. Analysts wanted all the data but had to settle for snapshots, hoping to capture the right data at the right time.

THE PROBLEM WITH LATENCY FOR BIG DATA One of the perennial problems with managing data — especially large quantities of data — has been the impact of latency. Latency is the delay within a system based on delays in execution of a task. Latency is an issue in every aspect of computing, including communications, data management, system performance, and more.

If you have ever used a wireless phone, you have experienced latency firsthand. It is the delay in the transmissions between you and your caller. At times, latency has little impact on

1.3. Big data structuring

Comparison

sno Structured Semi structured Unstructured

1 Conforms to a data model . does not confirm to a does not confirm to Relationship exists between data. data model. But some a data model. RDBMS conforms to relational data structure exists. model wherein data is stored jn rows It uses tags to and columns. segregate semantic elements

pg. 7 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

2 Can be easily used by a computer Can not be easily used Can not be easily by a computer used by a computer

2 Ex: data stored in data abses Ex: emails, XML, Ex: memos, chat, HTML PP, images, videos, letter, researches, body of an email

3 Sources: On processing systems XML, JSON,

Ease: i/o, security, In consistent structure indexing/searching, scalability, , labrl/value pairs, transaction processing schema info blended with data values.

q Data Structures are the programmatic way of storing data so that data can be used efficiently. q Almost every enterprise application uses various types of data structures in one or the other way. q Data Structure is a systematic way to organize data in order to use it efficiently. q Following terms are the foundation terms of a data structure. Ø Interface (function) ü Each data structure has an interface. ü Interface represents the set of operations that a data structure supports. ü An interface only provides o the list of supported operations, o type of parameters they can accept o return type of these operations. Ø Implementation ü Implementation provides the internal representation of a data structure. ü Implementation also provides the definition of the algorithms used in the operations of the data structure.

Structuring big data • As applied to big data, the idea, therefore, is to get unstructured information, process it according to requirements and then store it into a suitable data structure as structured data. • This is where the necessity of developing a new frame work for structuring Big data comes in • Hadoop is such platform that facilitates data distribution and storage of unstructured data

Need for Data Structure q As applications are getting complex and data rich, there are three common problems that applications face now-a-days. Ø Data Search ü Consider an inventory of 1 million(106) items of a store.

pg. 8 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

ü If the application is to search an item, it has to search an item in 1 million(106) items every time slowing down the search. ü As data grows, search will become slower. Ø Processor speed ü Processor speed although being very high, falls limited if the data grows to billion records. Ø Multiple requests ü As thousands of users can search data simultaneously on a web server, even the fast server fails while searching the data. q To solve the above-mentioned problems, data structures come to rescue. q Data can be organized in a data structure in such a way that all items may not be required to be searched, and the required data can be searched almost instantly.

Characteristics of a Data Structure q Correctness Ø Data structure implementation should implement its interface correctly. q Time Complexity Ø Running time or the execution time of operations of data structure must be as small as possible. It is denoted as a function ƒ(n) secs ,where n is no of operations . q Space Complexity − Ø Memory usage of a data structure operation should be as little as possible.

Basic Terminology q Data Ø Data are values or set of values. q Data Item Ø Data item refers to single unit of values. q Group Items Ø Data items that are divided into sub items are called as Group Items. q Elementary Items Ø Data items that cannot be divided are called as Elementary Items. q Attribute and Entity Ø An entity is that which contains certain attributes or properties, which may be assigned values. q Entity Set Ø Entities of similar attributes form an entity set. q Field Ø Field is a single elementary unit of information representing an attribute of an entity. q Record Ø Record is a collection of field values of a given entity. q File Ø File is a collection of records of the entities in a given entity set.

Data Structures - Algorithms Basics q Algorithm is a step-by-step procedure, which defines a set of instructions to be executed in a certain order to get the desired output. q Algorithms are generally created independent of underlying languages, i.e. an algorithm can be implemented in more than one programming language. q From the data structure point of view, following are some important categories of algorithms Ø Search

pg. 9 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

ü Algorithm to search an item in a data structure. Ø Sort ü Algorithm to sort items in a certain order. Ø Insert ü Algorithm to insert item in a data structure.

Ø Update ü Algorithm to update an existing item in a data structure. Ø Delete ü Algorithm to delete an existing item from a data structure.

Structuring big data • The idea, therefore, is to get unstructured information, process it according to requirements and then store it into a data structure as structured data. This is where the necessity of developing a new frame work for structuring Big data comes in Hadoop is such platform that facilitates data distribution and storage of unstructured data.

Technologies used in BD environments 1. In-memory analytics: preprocess and store the ing data relevant 2. In-database processing 3. Symmetric multiprocessor system(SMP) 4. Massively Parallel Processing: processing of applications by segmenting the programs and allocating the segments to number of processors in parallel. Each processor may have its own OS and dedicated memory. Segments of a program communicate using messaging interface.

1.4. Elements of big data

These 4 elements of big data reflect the tasks involved in using Big data for business intelligence. 1.data collection: deals with how to collect such big data (with characteristic 5 Vs) from multiple, geographically separated, sources 2.data storage : where and how to store retrieve such data which cannot be accommodated at one server/memory 3. data analysis : how to process such data if it is not stored at one storage. (BDA) 4. data visualization/output q VARIETY Ø Data can be sourced from emails, audio players, video recorders, watches, personal devices, computers, health monitoring systems, satellites..etc. Ø Each device that is recording data is recording and encoding it in a different format and pattern. Ø Additionally, the data generated from these devices also can vary by granularity, timing, pattern and schema. Ø Much of the data generated is based on object structures that vary depending on an event, individual, transaction or location. Ø Data collections for varied source and forms means that traditional relational databases and structures cannot be used to interpret and store this information. Ø NoSQL technologies are the solution to move us forward because of the flexible approach they bring to storing and reading data without imposing strict relational bindings.

pg. 10 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Ø NoSQL systems such as Document Stores and Column Stores already provide a good replacement to OLTP/relational database technologies as well as read/write speeds that are much faster.

Velocity q The velocity of data streaming is extremely fast paced. q Every millisecond, systems all around the world are generating data based on events and interactions. q Devices like heart monitors, televisions, RFID scanners and traffic monitors generate data at the millisecond. Servers, weather devices, and social networks generate data at the second. q As technology furthers, it would not be surprising to see devices that generated data even at the nanosecond. q The reward that this data velocity provides is information in real time that can be harnessed to make near real time decisions or actions. q Most of the traditional insights we have are based on aggregations of actuals over days and months. Having data at the grain of seconds or milliseconds will provide a more detailed and vivid information. q With the speed in which data is generated, it demands equally, if not quicker, tools and technology to be able to extract, process and analyze the data. q This limitation has lead to the emergence of Big Data architectures and technologies. NoSQL, Distributed and Service Oriented Systems. q NoSQL systems replace traditional OLTP/relational database technologies because they place less importance on ACID (Atomicity, Consistency, Isolation, Durability) principles and are able to read/write records at much faster speeds. q Distributed and Load Balancing systems have now become a standard in all organizations to split and distribute the load of extracting, processing and analyzing data across a series of servers. q This allows for large amounts of data to be processed in high speeds which eliminate bottle necks. q Enterprise Service Bus (ESB) systems replace traditional integration frameworks written in custom code. q These distributed and easily scalable systems allow for serialization across large workloads and applications to process large amounts of data to a variety of different applications and systems.

Volume q If we take all the data generated in the world between the beginning of time and 2008, the same amount of data will soon be generated every minute. q billions of touch points generate Petabytes and Zettabytes of data. q On social media and telecommunication sites alone, billions of messages, clicks and uploads take place everyday. q We now have information for every interaction, perspective and alternate. Having this diverse data allows us to more effectively analyze, predict, test and ultimately prescribe to our customers. q Large collections of data coupled with the challenges of Variety (different formats) and Velocity (near real time generation) pose significant managing costs to organizations. q Despite the pace of Moore's Law, the challenge to store large data sets can no longer be met with traditional databases or data stores.

pg. 11 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

q The strengths of distributed storage systems like SAN (Storage Area Network) as well as NoSQL data stores that are able effectively divide, compress and store large amounts of data with improved read/write performances.

Veracity q In context, a fourth V, Veracity is often referenced. q Veracity concerns the data quality risks and accuracy as data is generated at such a high and distributed frequency. q In solving the challenge of the 3 Vs, organization put little emphasis or work into cleaning up the data and filtering on what is necessary and as a result the credibility and reliability of data have suffered.

Differences between traditional and big data handling for business intelligence • data collection: in traditional practice the data is collected from one enterprise whereas Big data is collected from different sources across internet. • Data storage: in traditional data can be accommodated in one server storage. Whereas big data cannot be and has to be distribiuted into different storages. Also big data is required to be scaled up horizontally by adding more server and storage space and not on the same server whereas in traditional the data should be scaled up vertically . • Data Analysis: since the big data is distributed it has to be also processed parallely and both off line and in real time while in traditional the data could be analyzed off line • Also In traditional the data is structured and data is moved to the processing functions whereas the Big data it is difficult to move large volumes data and so the processing functions must be moved to data instead • data visualization/output: to steer the business to excellence by understanding customers, vendors and suppliers’ requirements and preferences

1.5. Parallel and distributed systems for Big data.

WHY DISTRIBUTED COMPUTING IS NEEDED FOR BIG DATA Not all problems require distributed computing. If a big time constraint doesn’t exist, complex processing can done via a specialized service remotely. When companies needed to do complex data analysis, IT would move data to an external service or entity where lots of spare resources were available for processing.

It wasn’t that companies wanted to wait to get the results they needed; it just wasn’t economically feasible to buy enough computing resources to handle these emerging requirements. In many situations, organizations would capture only selections of data rather than try to capture all the data because of costs. Analysts wanted all the data but had to settle for snapshots, hoping to capture the right data at the right time.

Key hardware and software breakthroughs revolutionized the data management industry. First, innovation and demand increased the power and decreased the price of hardware. New software emerged that understood how to take advantage of this hardware by automating processes like load balancing and optimization across a huge cluster of nodes.

pg. 12 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

THE CHANGING ECONOMICS OF COMPUTING AND BIG DATA

Fast-forward and a lot have changed. Over the last several years, the cost to purchase computing and storage resources has decreased dramatically. Aided by virtualization, commodity servers that could be clustered and blades that could be networked in a rack changed the economics of computing. This change coincided with innovation in software automation solutions that dramatically improved the manageability of these systems.

The capability to leverage distributed computing and parallel processing techniques dramatically transformed the landscape and dramatically reduce latency. There are special cases, such as High Frequency Trading (HFT), in which low latency can only be achieved by physically locating servers in a single location.

1.6.BIG DATA ANALYTICS

Big Data Analytics Existing analytics tools and techniques will be very helpful in making sense of big data. However, there is a catch. The algorithms that are part of these tools have to be able to work with large amounts of potentially real-time and disparate data. The infrastructure that we cover earlier in the chapter will need to be in place to support this. And, vendors providing analytics tools will also need to ensure that their algorithms work across distributed implementations. Because of these complexities, we also expect a new class of tools to help make sense of big data. We list three classes of tools in this layer of our reference architecture. They can be used independently or collectively by decision makers to help steer the business. The three classes of tools are as follows:

✓ Reporting and dashboards: These tools provide a “user-friendly” representation of the information from various sources. Although a mainstay in the traditional data world, this area is still evolving for big data. Some of the tools that are being used are traditional ones that can now access the new kinds of databases collectively called NoSQL (Not Only SQL). We explore NoSQL databases in Chapter 7. ✓ Visualization: These tools are the next step in the evolution of reporting. The output tends to be highly interactive and dynamic in nature. Another important distinction between reports and visualized output is animation. Business users can watch the changes in the data utilizing a variety of different visualization techniques, including mind maps, heat maps, infographics, and connection diagrams. Often, reporting and visualization occur at the end of the business activity. Although the data may be imported into another tool for further computation or examination, this is the final step. ✓ Analytics and advanced analytics: These tools reach into the data warehouse and process the data for human consumption. Advanced analytics should explicate trends or events that are transformative, unique, or revolutionary to existing business practice. Predictive analytics and sentiment analytics are good examples of this science

pg. 13 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

what is BDA? 1. working with data sets whose volume, variety and velocity exceed the present storage and computing capabilities. 2. to steer the business to excellence by understanding customers, vendors and suppliers’ requirements and preferences 3. for quicker and better decision making 4. better collaboration between IT, Business users and data scientists 5. writing the code for distributed processing for achieving the above tasks What isn’t BDA?

Data Analytics • Data Analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. • The data that is captured by any data collection agent or tool or software is in its raw form, i.e., unformatted or unstructured or unclean with noises/errors or redundant or inconsistent. • Hence, analytics covers a spectrum of activities starting from data collection till visualization. • data analytics is generally divided into three broad categories: • (i) Exploratory Data Analysis (EDA) • (ii) Confirmatory Data Analysis (CDA) • (iii) Qualitative Data Analysis (QDA)

Difference in analysis of data

Traditional Analytics • It is structured and repeatable in nature • Structure is built to store data • Business users determine the questions which shall be answered by building systems by IT experts

Big Data Analytics • Iterative and exploratory in nature • Data itself is a structure • IT team and data experts deliver the data on flexible platform for any exploration and querying by the business users

1.7. Classification of analytics • Classification I • 1.basic analytics • 2.operationalized analytics • 3.advanced analytics • 4.monetized analytics • Classification 2 • 1.analytics 1.0 • 2. analytics 2.0 • 3. analytics 3.0 • Classification 3 • (i)Exploratory Data Analysis (EDA) • (ii)Confirmatory Data Analysis (CDA) • (iii)Qualitative Data Analysis (QDA)]

pg. 14 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

pg. 15 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

1.basic analytics 2.operationalized 3.advanced analytics 4.monetized analytics analytics

slicing and dicing of where the analysis is using predictive and used to derive direct historical data to oven into the prescriptive revenue generate reporting business processes of modeling to forecast and basic an enterprise. the future visualization etc.,

• Reports/ dash boards: what happened • Data mining: why did it happen? • Present: • Real time analytics= what is happening • Real time data mining= why is it happening • Future: • Predictive analytics: what is likely to happen • Prescriptive analytics: how to leverage it to one’s own advantage

1.8. Challenges facing big data

• 1.scale: storage to handle elastically scaling data and vertical or horizontal • 2.security: NoSQL platforms have poor security mechanisms. • 3.schema:dynamic • 4.continuous availability: Available 24/7 without down time which is built into NoSql AND rdbms • 5.consistency: always get latest updated data. should we opt for consistency or eventual consistency? • 6.Partition tolerant: if a network is partitioned it should still be able to handle hw and sw problems • 7. data quality: how to maintain ? Accuracy, timeliness. Is there metadata in place?

CAP Theorem • Only 2 of the 3 : C , A, or P is guaranteed. • CA:traditional RDBMS, MySQL etc., • CP: Hbase, MongoDB .. • AP: Risk , Cassandra ..

Why BDA is important ?

Because BDA has various approaches that lead to 1. Reactive –business intelligence: this approach is analyzes the historical,static data sets and generates reports. By this approach It enables business to take better decisions by providing right info to the right person at the right time in the right format 2.reactive- BDA: this approach analyzes static data only but here the data is huge

pg. 16 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

3. Proactive –analytics: this approach is traditional data mining, predictive modeling, text mining and statistical analysis but applied on big data- therefore it has limitations on storage d processing capacity 4.proactive-BDA: this approach is to filter relevant data from big data and analy ze using high performance analytics to solve complex problems usingmore data

What to do with these data? • Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. • Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions. • Aggregation and Statistics : information is gathered based on specific variables such as age, profession, or income and expressed in a summary form for statistical analysis. • Data aggregation is a common in data warehouses and OLAP operations • • Indexing, Searching, and Querying : Indexing based on keys is suitable for keyword based search and pattern matching applications. – Pattern matching (XML/RDF) • • Knowledge discovery : Knowledge discovery by applying various data mining and statistical modeling techniques on such data has become strategically important • – Data Mining • – Statistical Modeling • Companies now use an increasing array of tools to develop a 360-degree view (figure 3) : • social media listening tools to gather what customers are saying on sites like Facebook and Twitter, • predictive analytics tools to determine what customers may research or purchase next, • customer relationship management suites and marketing automation software. • companies can get a complete view of customers by aggregating data from the various touch points that a customer may use

Terminologies • In-memory analytics: technology to quiery data in RAM rather than stored in disks • In-data base processing • Symmetric multiprocessor system • Massively parallel processing: a coordinated processing of a program by multiple processors , each working on different parts of the program and using its own OS and memory • Distributed and parallel computing

1.9. Data science and data scientist • Data science is the science of extracting knowledge from data.it is a science of recognizing hidden patterns amomg the data using l techniques drawn from statistics, mathematics,IT, ML, data engineering, probability models, statistical learning, pattern recognition etc., • It is multi disciplinary • It explores massive data sets for weather predictions,oil drilling,seismic activities, etc.,

pg. 17 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Big data use cases

1.10.BASE • It is used in distributed computing • Why? To achieve high availability • How achieved? • BASE is a data system design philosophy that prefers availability over consistency of operations. • BASE was developed as an alternative for - producing more scalable and affordable data architectures, - providing more options to expanding enterprises/ IT clients - and simply acquiring more hard ware to expand data operations • BASE is an acronym for Basically Available, Soft state, Eventual consistency • BasicallyAvailable: The system is guaranteed to be available for querying by all users. • Soft State: The values stored in the system may change because of the eventual consistency model, as described in the next bullet. • Eventually Consistent: As data is added to the system, the system’s state is gradually replicated across all nodes. For the short period before the blocks are replicated, the state of the file system isn’t consistent.

1.11. Analytics tools • MS EXCEL • SAS • IBM SPSS Modeler • Statistica • Salford systems • WPS

pg. 18 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Main open source analytics tools • R analytics • Weka

Other open source analytics tools • . • . ... • . ... • . ... • MongoDB. ... • R Programming Environment. ... • Neo4j. ... • Apache SAMOA.

Extra tools: • 1. R tool • 2 Weka • 3. Pandas • 4.Tanagra • 5 Gephi • 6.MOA( Massive Online Analysis) • 7.Orange • 8.Rapid Miner • 9.Root packages • 10.Encog, • 11.NodeXL • 12.Waffles

Businesses and Big Data Analytics Big Data analytics tools and techniques are rising in demand due to the use of Big Data in businesses. Organizations can find new opportunities and gain new insights to run their business efficiently. These tools help in providing meaningful information for making better business decisions. The companies can improve their strategies by keeping in mind the customer focus. Big data analytics efficiently helps operations to become more effective. This helps in improving the profits of the company. Big data analytics tools like Hadoop helps in reducing the cost of storage. This further increases the efficiency of the business. With latest analytics tools, analysis of data becomes easier and quicker. This, in turn, leads to faster decision-making saving time and energy.

Real-time Benefits of Big Data Analytics There has been an enormous growth in the field of Big Data analytics with the benefits of the technology. This has led to the use of big data in multiple industries ranging from • Banking • Healthcare • Energy • Technology • Consumer • Manufacturing There are many other industries which use big data analytics. Banking is seen as the field making the maximum use of Big Data Analytics.

pg. 19 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

BDA - UNIT-II

Topics Covered:

• 2.1. Compare Reporting and Analysis; • 2.2. Types of Analytics; • 2.3. Text Analytics; understanding text analytics; • 2.4. Developing an analytic team; • 2.5 . Text Analytics • 2.6. Analytical Approach and Tools to analyze data and history of analytic tools ; • 2.7. Introduction to popular Analytical tools and their comparison

2.1. Compare Reporting and analysis

pg. 20 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

2.2. Types of Analytics

Data Analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. The data that is captured by any data collection agent or tool or software is in its raw form, i.e., unformatted or unstructured or unclean with noises/errors or redundant or inconsistent. Hence, analytics covers a spectrum of activities starting from data collection till visualization.

There are 3 types of analytics

Exploratory Data Analysis (EDA)

pg. 21 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Confirmatory Data Analysis (CDA)

Qualitative Data Analysis (QDA)

2.3. Points to consider during Analysis • Essential factors that drives analysis are “G.R.E.A.T” criteria: Guided, Relevant, Explainable, Actionable and Timely • Guided : • A good analysis is one that starts through the identification of a specific business problem. Once identified, the analysis is guided by what is required to solve that problem. Every step of the analysis should be guided by the needs of the problem. • Relevant : • Any great analysis has to be relevant to the business. The problem needs to be one that the business feels needs a solution, and it has to be a problem that the business has an ability to address. • Explainable

pg. 22 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• A good analysis will be explainable and easy for the decision makers to make a decision and not with formulas, algorithms and statistics. Technical details may be the proof required behind the problem that an analysis is valid, but the results need to be explained in terms that decision makers can understand.. • Actionable • A great analysis will be actionable. It will point to specific steps that can be taken to improve a business. Analysis become useless if it is not providing the ability to be acted upon it. • Timely • Time is critical because the data should be needed for decision making. It is possible for an analysis to be great in every aspect, but it just can’t be completed in time for the decision it supports.. A late analysis is not great.

2.4. Developing an analytics team • Analytics team consists of • Data Analysts: for taking data, using it to answer questions, and communicating the results to help make business decisions • Data Scientists: Specialist that applies expertise in statistics and building machine learning models to make predictions and answer key business questions • Discover hidden insights in data by leveraging supervised and unsupervised machine learning models • Data engineer: Build and optimize the systems that allow data scientist and analysts to perform their work. Ensures data is properly received, transformed, stored, and made accessible to other users • Leans heavier in software development skillset

Project team & Roles • Project Lead → Project plan with scope & timeline • Data Architect → Data model and queries • Product Developer → Implementation of tracking • Analyst(s) → Generation of new business questions • Reporting Developer → Reports for your business

• It’s important to have a team of people that can build r data connections, warehouses, and get to know data. • analytics teams should include people with an understanding of areas such as relational tables, dimensional models, cubes, JavaScript Object Notation (JSON), Extensible Markup Language (XML), and comma-separated values (CSV) • The team needs at least one expert for each type of database,” including SQL, NoSQL document, and NoSQL wide column. • analytics team needs secure, reliable access to resources such as data hubs, data lakes, and data warehouses. • The team may have to handle different projects at the same team. This requires setting up of prjoect teams

pg. 23 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

2.5.Text Analytics: understanding text analytics

• In a customer experience context, text analytics means examining text that was written by, or about, customers. • You find patterns and topics of interest, and then take practical action based on what you learn. • Text analytics can be performed manually, but it is an inefficient process. • Therefore, text analytics software has been created that uses text mining and natural language processing algorithms to find meaning in huge amounts of text • Also known as Natural Language Processing, text analytics is the science of turning text portion of unstructured data into structured data. • It has moved from university research into real-world products that can be used by any business • Text mining, or text data mining, equivalent to text analytics is the process of deriving high- quality information from text, • The text data ie keywords, concepts, verbs, nouns, adjectives, etc. are extracted through the text mining process. • They are then used in the text analysis step to extract insight from the data by devising patterns and trends through statistical pattern learning. • NLP addresses tasks such as identifying sentence boundaries in documents, extracting relationships from documents, and searching and retrieving of documents, among others. • NLP is a necessary means to facilitate text analytics by establishing structure in unstructured text to enable further analysis. • Emails, online reviews, tweets, call center agent notes, survey results, and other types of written feedback all hold insight into your customers. • There is also a wealth of information in recorded interactions that can easily be turned into text. • Text analytics is the way to unlock the meaning from all of this unstructured text. It lets you uncover patterns and themes, so you know what customers are thinking about. It reveals their wants and needs. • In addition, text analytics software can provide an early warning of trouble, because it shows what customers are complaining about. Using text analytics tools gives you valuable information from data that isn’t easily quantified in any other way. It turns the unstructured thoughts of customers into structured data that can be used by business.

Text Analytics process • High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. • Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘ • High quality' in text mining usually refers to some combination of relevance, novelty, and interest. • Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). • Text analysis involves • information retrieval, • lexical analysis to study word frequency distributions, • pattern recognition, tagging/annotation, information extraction,

pg. 24 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• data mining techniques including link and association analysis, visualization, and predictive analytics. • A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted

2.6. Analytical Approaches

Traditionally, the business expected that data would be used to answer questions about what to do and when to do it. Data was often integrated as fields into general-purpose business applications. With the advent of big data, the developments of applications are being designed specifically to take advantage of the unique characteristics of big data.

pg. 25 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Advanced Analytics • The main goal of advanced analytics is to quantify the cause of events, predict when they might happen again, and identify how to influence those events in the future.

2.7. Tools to analyze data + history of Analytical tools+ Introduction to popular Analytical tools + comparing various analytical tools.

Data Visualization open source tools (free under the GNU General Public License) • Data visualization describes the presentation of abstract information in graphical form. • Data visualization allows us to spot patterns, trends, and correlations that otherwise might go unnoticed in traditional reports, tables, or spreadsheets. • Data analysis is the process of inspecting, cleaning, transforming and modelling the data with the goal of discovering useful information, suggestions and conclusions.

1. R • R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. This programming language was named R, based on the name of the two authors and is currently developed by the R Development Core Team. The current R is the result of a collaborative effort with contributions from all over the world. It is highly extensible and flexible. • R is an interpreted language; users typically access it through a command-line interpreter. Pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

2. Weka • The original non-Java version of WEKA primarily was developed for analyzing data from the agricultural domain. With the Java-based version, the tool is very sophisticated and used in many different applications including visualization and algorithms for data analysis and predictive modeling. The users can customize it however they please. • WEKA supports several standard data mining tasks, including data preprocessing, clustering, classification, regression, visualization and feature selection. Sequence modeling is currently not included • Weka uses the Attribute Relation File Format for data analysis, by default. But listed below are some formats that Weka supports, from where data can be imported: • Ø CSV Ø ARFF Ø Database using ODBC

3. Pandas • pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy.pandas is well suited for many different kinds of data: • • Ø Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet • Ø Ordered and unordered (not necessarily fixed-frequency) time series data. • Ø Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels • Ø Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

pg. 26 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2- dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

4.Tanagra • TANAGRA is a free Data mining software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area. • This project is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms... • TANAGRA is an "open source project" as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license.The main purpose of Tanagra project is to give researchers and students an easy-to- use data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyse either real or synthetic data. • The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools : the data management. • The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for building this kind of software. They should take advantage of free access to source code, to look how this sort of software is built, the problems to avoid, the main steps of the project, and which tools and code libraries to use for. In this way, Tanagra can be considered as a pedagogical tool for learning programming techniques.

5 Gephi • Gephi is an open-source network analysis and visualization software package written in Java on the NetBeans platform. Gephi is an open source tool designed for the interactive exploration and visualization of networks . Designed to facilitate the user’s exploratory process through real-time analysis and visualization. Visualization module uses a 3D render engine . Uses the computer’s graphic card, while leaving • CPU free for computing . Highly scalable (can handle over 20,000 nodes) . Built on multi- task model to take advantage of multi-core processors. It runs on Windows, Mac OS X and Linux.

6.MOA( Massive Online Analysis) • Massive Online Analysis (MOA) contains several collections of machine learning algorithms: • It includes tools for evaluation and a collection of machine learning algorithms. Specific for data stream mining with concept drift.). It is written in Java and developed at the University of Waikato, New Zealand. MOA is framework software that allows to build and run experiments of machine learning or data mining on evolving data streams. It includes a set of learners and stream generators that can be used from the Graphical User Interface (GUI), the command-line, and the Java API.

pg. 27 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• MOA supports bi-directional interaction with Weka (machine learning). Related to the WEKA project, it is also written in Java, while scaling to more demanding problems. • MOA currently supports stream classification, stream clustering, outlier detection, change detection and concept drift and recommender systems 7.Orange • Orange is an open source data mining tool with very strong data visualization capabilities. It allows you to use a GUI (Orange Canvas) to drag and drop modules and connect them to evaluate and test various machine learning algorithms on your data. • Orange is a component-based visual programming software package for data visualization, machine learning, data mining and data analysis.Orange components are called widgets and they range from simple data visualization, subset selection and preprocessing, to empirical evaluation of learning algorithms and predictive modeling. • Visual programming is implemented through an interface in which workflows are created by linking predefined or user-designed widgets, while advanced users can use Orange as a Python library for data manipulation and widget alteration. 8.Rapid Miner • Written in the Java Programming language, this tool offers advanced analytics through template-based frameworks. Users hardly have to write any code. Offered as a service, rather than a piece of local software, this tool holds top position on the list of data mining tools. • In addition to data mining, RapidMiner also provides functionality like data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. What makes it even more powerful is that it provides learning schemes, models and algorithms from WEKA and R scripts. • RapidMiner, formerly known as YALE (Yet Another Learning Environment), was developed starting in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer at the Artificial Intelligence Unit of the Technical University of Dortmund. Starting in 2006, its development was driven by Rapid-I, a company founded by Ingo Mierswa and Ralf Klinkenberg in the same year. In 2007, the name of the software was changed from YALE to RapidMiner. In 2013, the company rebranded from Rapid-I to RapidMiner • RapidMiner uses a client/server model with the server offered as either on-premise, or in public or private cloud infrastructures. • According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. 9.Root packages • ROOT is an object oriented framework. It has a C/C++ interpreter (CINT) and C/C++ compiler (ACLIC) ROOT is used extensively in High Energy Physics for “data analysis” .For reading and writing data files and calculations to produce plots, numbers and fits. A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++ but integrated with other languages such as Python and R. It can handle large files (in GB) containing N-tuples and Histograms .It is a multiplatform software . It is based on widely known programming language C++ . It is free. • The ROOT graphical framework provides support for many different functions including basic graphics, high-level visualization techniques, output on files, 3D viewing etc. They use well-known world standards to render graphics on screen, to produce high-quality output files, and to generate images for Web publishing. Many techniques allow visualization of all the basic ROOT data types, but the graphical framework was still a bit weak in the visualization of multiple variables data sets • 10.Encog, 11.NodeXL; 12.Waffles

pg. 28 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

BDA - UNIT-III

Understanding MR fundamentals and HBase

Topics Covered

3.1. MapReduce: The MapReduce framework, 3.1.1Techniques to optimize MR jobs, 3.1.3.uses of MR.

3.2.HBase: 3.2.1.Role of HBase in Big data processing , 3.2.2.introducing HBase architecture 3.2.3. storing big data in HBase, 3.2.4.HBase operations-programming with HBase, Installation

3.3.Hadoop: 3.3.1. Storing data in Hadoop 3.3.2. Introduction of HDFS architecture 3.3.3. HDFS file system types, commands , 3.3.4. org.apache.io package, 3.3.5.HDFS high availability, 3.3.6.interacting with Hadoop eco system

3.4., combining HBase and HDFS

______

3.1.Map Reduce framework

1. MR Programming is a software frame work which helps to process massive amounts of data in parallel. 2. In MR the input data set is split into independent chunks. 3. MR involves two tasks: Map task and Reduce task 4. The Map task processes the independent chunks in parallel manner. it converts input data into key value pairs Reduce task combines outputs of mappers and produces a reduced data set 5.The o/p of Mappers is automatically shuffled and sorted by the frame work and stored as intermediate data on the local disk of that server. 6. The MR frame work sorts the o/p of mappers based on keys 7. The sorted o/p becomes input to the Reduce task. 8. The Reduce task combines the o/p of various Mappers and produces a reduced o/p. 9. Map Reduce framework also takes care of other tasks such as scheduling, monitoring, re executing failed tasks etc., 10. For the given jobs the inputs and outputs are stored in a file system (here HDFS is used) 11. HDFS and MR framework run on the same set of nodes. 12. Here the Paradigm shift is that scheduling of tasks is done on the nodes where data is present. from Data>to> compute to Compute> to > data model. ie Data processing is co located with data storage. (data locality). It achieves high throughput

MR daemons • There are two daemons associated with MR -1.Job tracker : a Mater daemon. A single job tracker in the master per cluster of nodes -2. Task trackers: one slave task tracker for each nodes

pg. 29 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Job tracker: • responsible for scheduling tasks to the Task trackers, monitoring the task and re executing the task if the Task tracker fails. • It provides connectivity between hadoop and our MR application • The MR functions and input o/p locations are implemented via our MR application program- the job configuration • In Hadoop, its job client submits the job (jar/executable, etc.,) to the job tracker • The job tracker creates the execution plan and decides which task to assign to which node. • Job tracker monitors and if a task fails it will automatically reschedule the task to a different node after a predetermined no of tries

Task trackers • This daemon present in every node is responsible for executing the tasks assigned to them by the job tracker of the cluster. • There is a single task tracker per slave and which spawns multiple JVMs to handlw multiple map or reduce tasks in parallel. • Task tracker continuously sends messages to job tracker.

Map Reduce features simplicity: programmer can easily design parallel and distributed applications manageability: data and computation are alloocated to the same slave( data) node and no need to forward data for computation scalability: increase the data node to increase job with minimal losses fault tolerance : any node with hw failure can be removed and a new node installed the reliability:tasks under progress run failed tasks also

Map Reduce framework • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality)

• Task Tracker is the slave node (runs on each data node) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress

pg. 30 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (concept of locality) • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs the task until completion (either map or reduce task) • Always in communication with the Job Tracker reporting progress

Applications - It is used in Machine Learning, Graphic programming, and multi core programming

MR programming • Requires three things: • 1. driver class: it specifies job configuration details • 2. mapper class: it overrides map function based on the problem statement • 3. reducer class: this class overrides the Reduce function based on the problem statement

Implementations of MR • Many implementations of MR developed in different languages for different purposes. 1.Hadoop: The most popular Open Source implementation is Hadoop, developed by yahoo, which runs on top of HDFS. It is now being used by face book, amazon etc., - In this implementation it processes 100s of terabytes of data in at least 10000 cores 2. Google implementation: It runs on top of Google File System. Within Google File System data is loaded, partitioned into chunks and each chunk is replicated. - it processes 20 peta bytes /day

MR programming model

• MR functions use functional languages like Lisp • Map function , written by user processes a key/value pair to generate a list of intermediate key/value pairs map(key1, value1)-> list (key2,value2)

pg. 31 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• The reduce function , also written by user, merges, all intermediate values associated with a particular intermediate key • reduce (key2,list(value2))->list(value2) unique key in the sorted list • Finally the key/value pairs are reduced , one for each in the sorted list . Ie the reduce function sums all the counts emitted for a particular key

Example 1: Color Count

Example 2: Color Count

pg. 32 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Example 3: Color Filter

Example 2: Word Count

pg. 33 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Introduction Hadoop, MR and HBase

• Since 1970, RDBMS is the solution for data storage and maintenance related problems. • After the advent of big data, companies realized the benefit of processing big data and started opting for solutions like Hadoop. • Hadoop uses distributed file system HDFS for storing big data, and MapReduce to process it. • Hadoop excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even unstructured.

3.2. HBase It is a distributed, column-oriented database built on top of the hadoop file system.(HDFS)

pg. 34 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

3.2.3. storing big data in HBase • HBase is a column-oriented database and the tables in it are sorted by row. • The table schema defines only column families, which are the key value pairs. • A table have multiple column families and each column family can have any number of columns. • Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. • In short, in an HBase: • Table is a collection of rows. • Row is a collection of column families. • Column family is a collection of columns. • Column is a collection of key value pairs.

pg. 35 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

3.2.4.HBase operations-programming with HBase, Installation • Installing Hbase: • We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode, and Fully Distributed mode. • Installing HBase in Standalone Mode

pg. 36 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• Download the latest stable version of HBase form http://www.interior- dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the tar “zxvf” command • Before proceeding with HBase, you have to edit the following files and configure HBase. • hbase-env.sh • hbase-site.xml

3.3.Hadoop

• Problem1: storing exponentially growing datasets • Solution: Hadoop Distributed file system: divides input data files into chunks of data and stores them across the cluster. • Problem2: storing unstructured data: • Solution: Hadoop allows storing of unstructured, semi structured and structured data. • It follows WORM (Write Once Read Many) • No schema validation is required while dumping data • It is designed to run on clusters of commodity machines. Scalable as per requirements • Problem 3: processing the data faster: Solution: Map Reduce in Hadoop: Provides parallel processing of data present in HDFS Each data node processes the part data stored within the node

Why Hadoop is able to compete with conventional DBMS?

What is Hadoop architecture ?

• Hadoop is a framework consisting of clusters . Each cluster having two main layers • HDFS layer : Hadoop Distributed file system layer-consists of one name node and multiple data nodes

pg. 37 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• MapReduce layer : Execution engine layer, consists of one job tracker and multiple task tracke

Developed by Yahoo

Main components of Hadoop • HDFS (Hadoop Distributed File System) : for big data storage in distributed environment – allows dumping of any kind of data across the cluster • Map reduce :for faster data processing . Allows parallel processing of data stored in HDFS- processing done at data nodes instead of data going to processor (NameNode) • It is a Apache project built and used by a community of contributors • Premier web players: google, yahoo, Microsoft, facebook, use it as engine t power the cloud • The project is a collection of various subprojects: appache Hadoop Common, Avro, Chukwa, Hbase, HDFS, Hive, MapReduce, Pig, Zookeeper

Hadoop ecosystem (Total tools) • Scoop and flume : to inject data into HDFS • HDFS: ditributed file system that allows storage of all 3 types of data • Yarn: (yet another ) the brain of hadoop. Allocates resources and schedules and does all the processing activities • PIG: a platform used to analyze large data sets representing them as data flows. Introduced by yahoo. Language BigLatin • HIVE: is a data warehousing tool that allows us to perform big data analytics using HIVE Query language which is similar to SQL . introduced by face book • Mapreduce: JAVA . Provides parallel processing of data sets • Hbase: is a NoSQL data base on top of HDFS that enables us to store unstructured and semi structured data with ease and provides real time read/ write access • apache Spark :is an in-memory data processing engine that allows efficient execution of streaming, machine learning or SQL workloads and requires fast

pg. 38 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Hadoop Master/Slave Architecture • Hadoop is designed as a master-slave ,shared-nothing, architecture

Design Principles of Hadoop • Need to process Big data • Need to parallelize computation across thousands of nodes • Support Commodity hardware – Large number of low-end cheap machines working in parallel to solve a computing problem • This is in contrast to conventional DBMs where small number of high- end expensive machines are used • Automatic parallelization & distribution – Hidden from the end-user – Fault tolerance and automatic recovery – Nodes/tasks will fail and will recover automatically – Clean and simple programming abstraction – Users only provide two functions “map” and “reduce”

Hadoop: How it Works Hadoop Distributed File System (HDFS)

pg. 39 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Limitations of Hadoop • Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. • A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). • Hadoop Random Access Databases: • Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.

Hadoop vs. Other Systems

pg. 40 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example: Amazon EC2

3.4. combining HBase with HDFS.

pg. 41 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

BDA - UNIT- IV

Big data technology landscape Two important technologies: NoSQL and Hadoop

Topics Covered:

1.Distributed computing challenges 2.NoSQL 3.Hadoop: consisting of HDFS and MapReduce. 3.1.history of hadoop, 3.2.hadoop overview 3.3. use case of hadoop, 3.4.hadoop distributors, 4. HDFS: 4.1.HDFS daemons: Namenode, datanode, secondary namenode 4.2. file read, file write, Replica processing of data with hadoop 4.3.Managing resources and applications with Hadoop YARN

4.1. Distributed computing challenges

1. In a distributed system ,since several servers are networked together there could be failure of hardware. ex: a hard disk failure creates data retrieval problem 2. In DS the data is spread across several machines. How to integrate them prior to processing it? Solution: two important technologies: NoSQL and hadoop. We study in this unit 4

4.2.NoSQL

RDBMSs • MySQL is the world's most used RDBMS, and runs as a server providing multi-user access to a number of databases. • TheOracle Database is an object-relational database management system (ORDBMS). • The main difference between Oracleand MySQL is the fact that MySQL is open source, whileOracle is not. • SQL stands for Structured Query Language. It's a standard language for accessing and manipulating databases • SQL Server, Oracle, Informix, Postgres, etc are RDMS

2.1.introduction to NoSQL. • It is a distributed DataBase model while hadoop is not a data base.(hadoop is a framework) ; • NoSQL is OpenSource, non relational, scalable. • There are several databases which follow this NoSQL model. • NoSQL data bases are used in Big data and real time web applications, social media. • They do not restrict the data to adhere to any schema at the time of storage

pg. 42 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• They structure the unstructured input data into different formats viz key value pairs ; document oriented; coloumn oriented; graph based data ; besides structured data • They adhere to CAP theorem and compromise on C in favor of A and P. • It does not support ACID properties of transactions (Atomocity,Consistency,Isolation, and Durability).

2.2.Types of NoSQL data bases

They can be broadly classified into: 1.key-value or the big hash table type: : they maintain big hash table of keys and values. sample key value pair: key value First name Robert last name williams 2. Document type: maintain data as a collection of documents. Documents are equivalent to records in RDBMS and collection is equivalent of Table in RDBMS. Sample document: {“Book Name”: “Fundamentals .. “, “Publisher”: “Wiley India”, “year”: “2011” }

3.Column type: each storage block has data from only one column 4. Graph type: Also called network db. A graph stores data in nodes sample graph:ID, name, Age stored in each node. arrows carry Labels like “member”,”member since 2002” , “knows since 2002”, etc.,

2.3. popular NoSQL data bases

1. Key value or big hash table 2. Schema-less 1. Key value or big hash table type NoSQL Data bases: (some schema is followed) Amazon S3 (Dynamo); Scalaris , Redis,Riak, 2.schema-less: (no schema even like key, value) 2.1 Column based : Cassaandra, Hbase 2.2 Document based: ApacheCouchDB, MongoDB, MarkLogic 2.3. Graph-based: Neo4j, HyperGraphDB

2.4.Advantages of NoSQL • Dynamic schema: since it allows insertion of data without a predefined schema-it facilitates application changes in real time ie faster code development and integration and less db administration • Auto sharding: it automatically spreads data across arbitrary number of servers while balancing the load and query on the servers. if a server fails the server is replaced w/o disruptions. • Replication: multiple copies of data are stored across the cluster and even across data centers. This promises high availability and fault tolerance • Rapid and elastic Scalability: allows to scale to the cloud with the following capacities:

pg. 43 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Cluster scale: allows distribution of data base across >100 nodes among multiple data centers performance scale: supports over >100000 database read and write operations per sec Data scale: supports storing of >1 billion documents in the db • Cheap and easy to implement • Adheres to CAP. relaxes consistency requirement

2.5.Disadvantages of NoSQL • Does not support joins • No support for ACID • No standard query language interface except in case of MongoDB and Cassandra(CQL) • No easy integration with other applications that support SQL

2.6.No SQL applications in Industry • Key value pairs type data base: used for shopping carts, web user data analysis(amazon, Linkedin) • Column type database: used by facebook, twitter, eBay, • Document type database : used for logging, archives management • Graph type database : used in network modeling, walmart • NoSQL vendors: 1.amazon (Dynamo): Used by Linkedin, Mozilla 2.Facebook(Cassandra):Used by Netflix. Twitter,eBay ie column type darabase 3.Google(Big Table). Used by Adobe Photoshop

2.7.NewSQL • Data base that has the same scalable performance as NoSQL, support OLTP, maintain ACID guarantees of traditional Data Base. • It is a new RDBMS supporting relational data model and uses SQL as interface.

2.8.Comparison

ACID • In databases, a transaction is a very small of a program may contain several lowlevel tasks. • A transaction in a database system must maintain Atomicity, Consistency, Isolation, and Durability − commonly known as ACID properties − in order to ensure accuracy, completeness, and data integrity . • For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

pg. 44 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• Atomicity Consistency Isolation Durability (ACID) is a concept referring to a database system's four transaction properties: atomicity, consistency, isolationand durability. • These four properties describe the major guarantees of the transaction paradigm, which has influenced many aspects of development in database systems.

Atomicity • An atomic transaction is an indivisible and irreducible series of database operations such that either all occur, or nothing occurs. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright. • An atomic transaction is an indivisible and irreducible series of database operations such that either all occur, or nothing occurs. • Transactions are often composed of Multiple statements. • A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright. • Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely: • if any of the statements in a transaction fails to complete, the entire transaction fails and the database is left unchanged. • An atomic system must guarantee atomicity in each and every situation, including power failures, errors and crashes.

Consistency • Consistency ensures that a transaction can only bring the database from one valid state to another valid state, maintaining database invariants: • any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. • This prevents database corruption by an illegal transaction, but does not guarantee that a transaction is correct.

Isolation • Transactions are often executed concurrently (e.g., reading and writing to multiple tables at the same time) • Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. • Isolation is the main goal of concurrency control; • depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions.

Durability • Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). • This usually means that completed transactions (or their effects) are recorded in non-volatile memory

pg. 45 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

4.3.Hadoop: 3.1.history of hadoop, 3.2.hadoop overview 3.3. use case of hadoop, 3.4.hadoop distributors, 4. HDFS: 4.1.HDFS daemons: Namenode, datanode, secondary namenode 4.2. file read, file write, Replica processing of data with hadoop 4.3.Managing resources and applications with Hadoop YARN

1. Hadoop overview

• For 1. massive data storage 2. faster data processing Key aspects: 1. OSS 2. Framework: programs, tools etc; provided to develop and execute applications. It is not a data base like NoSQL 3. distributed: data distributed across multiple computers. Data processed parallelly 4. Massive data and faster processing

Hadoop distributors • The following companies supply hadoop products: • Cloudera, , MAPR, Apache Hadoop

4. HDFS HDFS is one of the two core components of hadoop, the 2nd being MapReduce. 4.1.HDFS daemons: Namenode, datanode, secondary namenode 4.2. file read, file write, Replica processing of data with hadoop 4.3.Managing resources and applications with Hadoop YARN

4.1.HDFS daemons 1.NameNode: • There is a single namenode per cluster • It manages file related operations like read, write, create and delete • Namenode stores HDFS namespace • It manages file system Namespace which is a collection of files in the cluster • file system Namespace includes mapping of blocks to file , file properties and is stored in a file called FsImage • It uses editlog to record every transaction • A rack is a collection of data nodes within a cluster • it uses rackID to identify datanodes in the rack. • When namenode starts, it reads FsImage and EditLog from disk and applies all transactions from EditLog to represent in FsImage. • Then it flushes out new version of FsImage on disk and truncates the old EditLog because the changes are updated in the FsImage.

pg. 46 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

2.DataNode • There are multiples • During pipeline read write datanodes communicate with each other. • A datanode also sends heartbeat message to namenode to ensure connectivity between name and data nodes • In case of no heartbeat, namenode replicates datanode within the cluster and keeps running

3.Secondary NameNode • It takes a snapshot of HDFS metadata at intervals as specified in the configuration • It ocuupies same memory size as namnode • Therefore they are run on different machines • In failure of namenode the secondary can be configured

4.2. file read, file write, Replica processing of data with hadoop • File read: • 1. the client opens file he wants to read by calling open() on the DFS • 2.DFS communicates with namenode to get the location of the data blocks • 3.namenode returns the addresses of the datanodes containing the data blocks • 4.DFS returns an FSDataInputStream to client. • 5. client calls read() on the FSDataInputStream which contains the addresses of the datanodes for the first few blocks of file, connects to the nearest datanode for the 1st block in the file on FSDataInputStream to close the connection • 6.client calls read() repeatedly to get the data stream from the datanode • 7.when the end of a block FSDataInputStream closes the connection with datanode. • 8. it repeats the steps for to find the best node for the next block. 9. client calls close()

File write • 1. client calls create() to create file • 2. An RPC call is initiated to namenode • 3. namenode creates file after few checks • 4. FSDataInputStream returns the stream for client to write on • 5.as the client writes data, the data is split into packets which is then written to a data queue • 6.datastreamer requests namenode to allocate blocks by selecting alist of suitable nodes for storing replicas (by default 3) • 7. this list of dtanodes makes a pipeline with 3 nodes in the pipe line for the 1st block • 8. datastreamer streams the packets to the 1st data node in the pipeline which stores and the forwards to other datanodes in the pipeline • 9.DFSOutputStream manages an “Ack queue” of packets that are waiting for ackment- and a pkt is removed from the queue only if it is acknowledged by all the datanodes in the pipeline • 10.when the client finishes writing the file it calls close() on the stream • 11.this flushes all the remaining pkts to the datanode pipeline and waits for acknowledgements before communicating with NameNode to inform the client that the creation of file is complete

Replica processing of data with Hadoop • Replica placement strategy: • by default 3 replicas are created for each data set 1st replica is placed in the same node as the client

pg. 47 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

2nd replica is placed on a node in a different rack 3rd replica is placed on the same rack as second but on a different node in the rack • Then a data pipeline is built . The client application writes a block to the 1st datanode in the pipeline. next this datanode takes over and forwards data to the next node in the pipeline. • this process continues for all the data blocks. • Subsequently all the dta blocks are written to the disk • The client application need not track all blocks of data. The HDFS directs the client to the nearest replica.

Why hadoop 2.x ? • Because of following limitations of hadoop1.0: • In hadoop 1.0 HDFS and MR are core componenets while other components are built around. • 1. single namenode for entire namespace of a cluster. It saves all its file metadata in main memory. This puts a limit on the number of objects stored in NameNode. • 2.restricted to processing batch-oriented Map reduce jobs • 3.MR for cluster resource management and data processing. not suitable for interactive analysis • 4. hadoop1.0 not suitable for machine learning, graphs and other memory intensive algorithms 5. map slots may become full while reduce slots are empty and vice versa- inefficient resource utilization HDFS 2 used in hadoop 2.0 consists of 2 major components: 1. namespace service: to take care of file related (create, read, write) operations 2. blocks storage service: handles data nodes cluster management, replication HDFS2 uses: 1. mutiple independent namenodes: datanodes are common storage blocks shared by all namenodes. All datanodes register with every namenode in the cluster 2. passive standby namenode

4.3. Managing resources and applications with hadoop YARN • YARN is a sub-project of hadoop 2.x • It is a general processing platform • YARN is not constrained to MR alone • Multiple applications can be run in hadoop2.x if all the applications share the same resources (memory, cpu, network etc.,) management. • With YARN hadoop can do not only batch processing but also interactive, online, streaming, graph and other types of processing

Daemons of YARN 1. Global Resource Manager: to distribute resources among various applications. It has 2 components: 1.1. Scheduler: decides allocation of resources to running applications. No monitoring 1.2. ApplicationManager: accepts jobs, negotiates resources for executing ApplicationMaster which is specific to an application • 2.NodeManager: it monitors usage of resources and reports the usage to Global Resource Manager. It launches ‘application containers’ for execution of application. • Every machine will have one NodeManager

pg. 48 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• 3.Per-application ApplicationMaster: every application has one.to negotiate required resoueces for execution from the Resource Manager. It works along with NodeManager for executing and monitoring component tasks

• Application is a job submitted to the framework. Ex: Map Reduce job • Container: is a basic unit of allocation across multiple resource types: ex: container_0= 2GB, 1 CPU container_1= 1GB, 6 CPU container replaces the fixed map/reduce slots

YARN Architecture: steps • 1.client program submits the application which contains specifications to launch application specific ‘ ApplicationMaster’ • 2.ResourceManager launches ‘ ApplicationMaster’ by assigning some container • 3. ‘ ApplicationMaster’ registers with ApplicationMaster’ so that the client can quiery from Resource manager for details • 4.( applicationmaster negotiates apptopruate resource containers via the resource –request protocol) • 5. after container allocation , the ApplicationMaster launches the container by providing the specs to NodeManger • 6. NodeManger executeds the application code and provides status to ApplicationMaster via application specific protocol • 7.on completion of application , ‘ ApplicationMaster deregisters with ResourceManager and shuts down. itscontainers can then be reused.

pg. 49 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

BDA - UNIT- V

SOCIAL MEDIA ANALYTICS AND TEXT MINING Topics Covered:

• 5.1. Social media analytics: 5.1.1.introduction to social media 5.1.2.key elements of social media 5.1.3.performing social media analytics • 5.2.Text mining : 5.2.1.understanding text mining process 5.2.2. sentiment analysis 5.2.3.opinion mining on tweets • 5.3.mobile analytics : 5.3.1.introduction to mobile analytics 5.3.2.definition of mobile analytics 5.3.3.types of results from mobile analytics 5.3.4. types of applications for mobile analytics 5.3.4. introduction to mobile analytics tools • 5.4. web analytics: 5.4.1. introduction to web analytics 5.4.2.web analytics & mobile analytics

5.1. Social media analytics 5.1.1.introduction to social media 5.1.2.key elements of social media 5.1.3.performing social media analytics

5.1.1. introduction to Social media

What is social media? Collection of different platforms. Definition.

Facebook • It is the biggest social media site, with more than two billion people using it every month. • That’s almost a third of the world’s population! • There are more than 65 million businesses using Facebook Pages and more than six million advertisers actively promoting their business on Facebook • It’s easy to get started on Facebook because almost all content format works great on Facebook — text, images, videos, live videos, and Stories. • But note that the Facebook algorithm prioritizes content that sparks conversations and meaningful interactions between people, especially those from family and friends. • 94 percent of Facebook’s users access Facebook via the mobile app.

Youtube • It is a video-sharing platform where users watch a billion hour of videos every day. • To get started, you can create a YouTube channel for your brand where you can upload videos for your subscribers to view, like, comment, and share. • Besides being the second biggest social media site, YouTube (owned by Google) is also often known as the second largest search engine after Google

pg. 50 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

WhatsApp • whatsApp is a messaging app used by people in over 180 countries. • initially, WhatsApp was only used by people to communicate with their family and friends. Gradually, people started communicating with businesses via WhatsApp. • WhatsApp has been building out its business platform to allow businesses to have a proper business profile, to provide customer support, and to share updates with customers about their purchases. • For small businesses, it has built the WhatsApp Business app while for medium and large businesses, there’s the WhatsApp Business API

Messenger • Is used to be a messaging feature within Facebook • since 2011, Facebook has made Messenger into a independent app by itself and greatly expanded on its features. • Businesses can now advertise, create chatbots, send newsletters, and more on Messenger. • These features have given businesses a lot of new ways to engage and connect with their customers.

Instagram • Instagram is a photo and video sharing social media app. • It allows you to share a wide range of content such as photos, videos, Stories, and live videos. • It has also recently launched IGTV for longer-form videos. • As a brand, you can have an Instagram business profile, which will provide you with rich analytics of your profile and posts and • the ability to schedule Instagram posts using third-party tools

Twitter: • It is a social media site for news, entertainment, sports, politics, and more. • it has a strong emphasis on real-time information — things that are happening right now. • Another unique characteristic of Twitter is that it only allows 280 characters in a tweet (140 for Japanese, Korean, and Chinese), unlike most social media sites that have a much higher limit. • Twitter collects personally identifiable information about its users and shares it with third parties as specified in its privacy policy. • The service also reserves the right to sell this information as an asset if the company changes hands. • While Twitter displays no advertising, advertisers can target users based on their history of tweets and may quote tweets in ads directed specifically to the user.

LinkedIn • linkedIn: is now more than just a resume and job search site. • It has evolved into a professional social media site where industry experts share content, network with one another, and build their personal brand. • It has also become a place for businesses to establish their thought leadership and authority in their industry and attract talent to their company. • LinkedIn also offers advertising opportunities, such as boosting your content, sending personalized ads to LinkedIn inboxes, and displaying ads by the side of the site.

pg. 51 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

BLOGS • A blog is a type of online personal space or website where an individual (or rganization) posts content (text, images, videos, and links to other sites) and • expresses opinions on matters of personal (or organizational) interest on a regular basis. • The most popular blogging platforms are http://www.wordpress.com and http://www.bloggers.com. • Mostly, blogging does not require technical know-how or programming skills, so ordinary users can easily build and manage a professional-looking blog. • Facebook – 2.23 billion MAUs. Facebook is the biggest social media site around, with more than two billion people using it every month. YouTube – 1.9 billion MAUs. WhatsApp – 1.5 billion MAUs. Messenger – 1.3 billion MAUs. WeChat – 1.06 billion MAUs. Instagram – 1 ...

What is Social Media analytics? How is it important for business intelligence? So far we have seen Social media platforms.

5.1.1.introduction to social media analytics Social media for marketing • Social media marketing refers to the process of gaining attention of potential consumers through social media sites • Measuring the success of Social Media marketing campaigns is the only way to know the effectiveness of the campaign. • it is essential to have a tool that will help measure effectiveness of the campaign. • Social media analytics enables businesses to extract information from social media like: - how the public perceives their brand, -what kind of products consumers like and dislike and - generally what is the market trend • this information from social media is in the form of free text and natural language ie unstructured data • Social media analytics is the practice of : 1.gathering data from social media websites and 2.analyzing that data using social media analytics tools to enable making intelligent business decisions. • The most common use of social media analytics is to assess the customer sentiment to support marketing and customer service activities • Tools are available to assess customer sentiment under text analytics (sentrix)

5.1.2.key elements of social media • Social media are a collection of technologies that enable people to, listen, create, share, connect, amplify and measure content with one another. • As a result, five inter-related elements of social media are defined. • The Five Key Elements of Social Media are: 1. Listening (Research): the process of searching and monitoring public conversations and shared content for mentions of brands, products, services, questions, or other keywords with the intention of identifying and understanding trends. 2. Listening can be done manually by examining the news feeds of various sites or individuals to uncover valuable intelligence based on what they share. 3. Instead, Tools are available for listening . They provide trend analysis from popular conversations.

pg. 52 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

2.Content : Content marketing is the use of media, such as written text, pictures, videos, slideshows etc, to explain about the product and position a company or individual as knowledgeable and trusted. • Content marketing provides valuable information to the target audience ie children, adults, students, sportspersons etc., • Content can either be created or curated (collected and organized from the web).

3.Engagement Engagement is the process of using a mixture of listening, content marketing and conversation skills for connecting with individuals and solve problems directly thereby build trust and loyalty • Effective engagement management results in compassion and responsive communication.

4.Promotion (Advertising):Promotion is intended to describe any activities that amplify content or solicit feedback or response about a product, service, website or marketplace. • Promotion can involve : • offline activities such as business cards or posters, • or online activities such as display advertisements, social media ads, paid search campaigns,

5.Measurement :(Metrics, KPIs, and Analytics) • Measurement and analytics should be applied to all of the other four elements of social media in order to understand and improve the effectiveness of each activity. • This can include • user demographic data or • interest data, • website traffic and behavior, • interactions and impressions on advertisements, and • any other activity that generates data. • these five elements of social media are incorporated into business strategy to accomplish business goal

pg. 53 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

5.1.3. Performing social media analytics

There are three main steps in analyzing social media: 1.data identification, 2. data analysis, 3.information interpretation.

Data identification • It is the process of identifying the subsets of available data to focus on for analysis. • To derive wisdom from an unprocessed data, we need to start processing it, refine the dataset by including data that we want to focus on, and organize data to identify information. • In the context of social media analytics, data identification means "what" content are we interested in, in addition to the text of content, we want to know: who wrote the text? • Where was it found or on which social media venue did it appear? Are we interested in information from a specific locale? When did someone say something in social media?[5] • Type of Content: Text; Photos; drawings, simple sketches, or photographs, Audio; audio recordings of books, articles, talks, or discussions, Videos; recording, live streams. • Venue: variety of venues such as news sites, social networking sites (e.g. Facebook, Twitter). Depending on the type of project the venue becomes significant. • Time: It is important to collect data that is posted in the time frame that is being analyzed. • Ownership of Data: private or publicly available? any copyright ? Check before collecting data.

Data analysis • Data analysis is the set of activities that assist in transforming raw data into insight, • In other words, data analysis is the phase that takes filtered data as input and transforms that into information of value to the analysts. • Many different types of analysis can be performed with social media data.

pg. 54 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• The data analysis step begins once we know what problem we want to solve and know that we have sufficient data that is enough to generate a meaningful result. • While analyzing if we found the data isn't sufficient, modify the question. • If the data is sufficient for analysis, build a data model.[5] • Developing a data model is a process or method that we use to organize data elements and standardize how the individual data elements relate to each other.

Examples – How many people mentioned Wikipedia in their tweets? – Which politician had the highest number of likes during the debate? – Which competitor is gathering the most mentions in the context of social business? – Machine Capacity: This analysis could be performed as real-time, near real-time, ad hoc exploration and deep analysis. – Real-time analysis in social media is an important tool when trying to understand the public's perception of a certain topic. – Ad hoc analysis is a process designed to answer a single specific question. The product of ad hoc analysis is typically a report or data summary. – A deep analysis implies an analysis that spans a long time and involves a large amount of data.

Information interpretation • At this stage, the form of presenting the data becomes important. • Visualization (graphics) of the information is preferred • The visualizations expose the underlying patterns and relationships contained in the data. • Exposure of the patterns play a key role in decision making process. • Visualization should package information into a structure that is presented as a narrative and easily remembered.

5.1.3. Performing social media analytics: the process,

• Analysts may define a questionnaire to be answered. • The important questions for analysis are: "Who? What? Where? When? Why? and How?" • These questions help in determining the proper data sources to evaluate, and affect the type of analysis that can be performed.[5]

Social Media Analytics Tools • Viralheat It supports all major social media platforms like Facebook, Google, Linkedin, Pinterest, Youtube, etc. • Not free. It compares search terms across the web and displays information in a graph or a pie chart. • Spreadfast • It supports all major social platforms. It is a scalable platform and can greatly organize the content by larger groups. • Sysomos • It supports all social platforms, blogs, and forums. It is a real-time monitoring tool that collects online conversions about your business and provides insights report on it.

pg. 55 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

SproutSocial • It supports Facebook, Twitter, Linkedin, and YouTube. • It allows you to find opportunities to engage in social conversation, publish your message on social media, and measure the performance of your social efforts. • UberVU • It supports Facebook, Twitter, and all other major platforms. You can use it to measure the social buzz about your business • It provides you actionable insights of data. I • it keeps track of all your audience in real time and lets you engage with your audience.

• Sentiment Analyser is a technology framework in the field of Social BI that leverages Informatica products. • It is designed to reflect and suggest the focus shift of businesses from transactional data to behavioral analytics models. • Sentiment Analyser enables businesses to understand customer experience and ideates ways to enhance customer satisfaction.[12]

Impacts on business intelligence • Recent research on social media analytics has emphasized the need to adopt a BI based approach to collecting, analyzing and interpreting social media data.[13] • Social media presents a promising, challenging, source of data for business intelligence. • Customers voluntarily discuss products and companies, giving a real-time pulse of brand sentiment and adoption.[ • Firms have created specialized positions to handle their social media marketing. • social media activities are interrelated and influence each other.[16]

5.2. Text mining 5.2.1.understanding text mining process 5.2.2. sentiment analysis 5.2.3.opinion mining on tweets Text analytics already discussed in unit 2 presented first

• Emails, online reviews, tweets, call center agent notes, survey results, and other types of written feedback hold insight into customers- they are all in textual form. • There is also a wealth of information in recorded interactions written in natural language that can easily be turned into text. • Text analytics unlocks the meaning from all of this unstructured text. It uncovers patterns and themes that reveal what the customers are thinking about, their wants and needs. • In addition, text analytics software provides an early warning of trouble, because it shows what customers are complaining about. • text analytics tools turn the unstructured thoughts of customers into structured data that can be used by business.

5.2.1. understanding text mining process • Text posted on social media is dynamic, huge, diverse, multilingual, • Text mining is the process of deriving high-quality information from text. • High-quality information is derived from patterns and trends created by means of statistical pattern learning etc • A typical application is to scan a set of documents written in a natural language and

pg. 56 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

i) model the document set for prediction purposes ii) or develop a database or search index with the information extracted.

Text Analytics process • Text mining involves the process of structuring the input text by parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database. • deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘ • Typical text mining tasks include • text categorization, • text clustering, • concept/entity extraction, • sentiment analysis, • document summarization, • learning relations between entities.

Understanding text mining process • finding the right source for the purpose of text analytics is very crucial for gaining useful business insights. • The genre of the source text also will determine the type of tool used • For example tweets require different tools and approaches than analyzing a document or website text. • Analyzing tweets requires API-based searching and extraction of data from the Twitter timeline based on criteria that you specify. • You can choose to extract tweets that include specific keywords, such as your company name. • The desired business question that needs to be answered with text analytics will serve as a good starting point.

TEXT ANALYSIS TOOLS • Discovertext: Discovertext (http://discovertext.com/) is a powerful platform for collecting, cleaning, and analyzing text and social media data streams. • Lexalytics: Lexalytics (http://www.lexalytics.com/) is a social media text and semantic analysis tool for social media platforms, including Twitter, Facebook, blogs, etc. • Tweet Archivist: Tweet Archivist https://www.tweetarchivist.com/) is focused on searching, archiving, analyzing, and visualizing tweets based on a search term or hashtag (#). • Twitonomy: Twitonomy (https://www.twitonomy.com/) is a Twitter analytics tool for getting detailed and visual analytics on tweets, retweets, replies, mentions, hashtags, followers, etc. • Netlytic: Netlytic (https://netlytic.org) is a cloud-based text and social network analytics platform for social media text that discovers social networks from online conversations on social media sites. • LIWC: Linguistic Inquiry and Word Count (LIWC) is a text analysis tool for analyzing emotional, cognitive, structural, and process components present in individuals’ verbal and written speech samples: http://www.liwc.net/ • Voyant: Voyant (http://voyant-tools.org/) is a web-based text reading and analysis. With Voyant, a body of text can be read from a file or directly exported from a website.

pg. 57 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Performing SMA and Opinion mining on tweets:Twitter: • It is a social media site for news, entertainment, sports, politics, and more. • it has a strong emphasis on real-time information — things that are happening right now. • Another unique characteristic of Twitter is that it only allows 280 characters in a tweet (140 for Japanese, Korean, and Chinese), unlike most social media sites that have a much higher limit. • Twitter collects personally identifiable information about its users and shares it with third parties as specified in its privacy policy. • The service also reserves the right to sell this information as an asset if the company changes hands. • While Twitter displays no advertising, advertisers can target users based on their history of tweets and may quote tweets in ads directed specifically to the user. sentiment analysis • Opinion mining is also called sentiment analysis focuses on dynamic text. • sentiment analysis is used to determine how customers feel about a particular product, service, or issue. • For example, a product manager might be interested to know how his customers on Twitter feel about his product/service that was recently launched. • Analyzing the tweets or Facebook comments may provide an answer to the question. • Using sentiment analysis, we will be able to extract the wordings of the comments and • determine if they are positive, negative, or neutral. • several analytical tools are listed for semantic analysis.

Semantria • Semantria is a text sentiment analysis tool. • It will go through the following steps to extract sentiments from a document: • Step 1: It breaks the document into its basic parts of speech, called POS tags, which identify the structural elements of a sentence (e.g. nouns, adjectives, verbs, and adverbs). • Step 2: Algorithms identify sentiment-bearing phrases like “terrible service” or “cool atmosphere.” • Step 3: Each sentiment-bearing phrase earns a score based on a logarithmic scale ranging from negative ten to positive ten. • Step 4: Next, the scores are combined to determine the overall sentiment of the document or sentence. Document scores range between negative two and positive two.

• For example, to calculate the sentiment of a phrase such as “terrible service,” • Semantria uses search engine queries similar to the following: • “(Terrible service) near (good, wonderful, spectacular)” • “(Terrible service) near (bad, horrible, awful)” • Each response is added to a hit count; • these are then combined using a mathematical operation called “log odds ratio” to determine the final score of a given phrase.

pg. 58 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Opinion mining on tweets

• Opinion mining, which is also called sentiment analysis, involves building a system to collect and examine opinions cited in blog posts, comments, reviews or tweets about a product. • performs classifications on the corpus collected from Twitter. • Classification done based on features extracted • into POSITIVE, NEGATIVE and NEUTRAL.

• INTENTION MINING:Intention or intent mining aims to discover users’ intention such as buy, sell, recommend, quit, desire, or wish from natural language text such as user comments, product reviews, tweets, and blog posts. • Semantria analytic tool can be used to mine intentions from tweets by detecting presence of the words like “buy” or “purchase” the “quit” • Trend mining : exploits patterns in data by using statistical techniques,including machine learning, data mining, and social network analysis. • Predictive analysis is used in a variety of domains,including marketing, banking, telecommunication, and healthcare. • Concept mining: Unlike text mining, which is focused on extracting information, concept mining extracts ideas from large document sets. such as wiki content, a web page, Word documents, and news transcripts. • Concept mining can be employed to classify, cluster, and rank ideas.

pg. 59 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

5.3. Mobile analytics

5.3.1.introduction to mobile analytics 5.3.2.definition of mobile analytics 5.3.3.types of results from mobile analytics 5.3.4. types of applications for mobile analytics 5.3.5. introduction to mobile analytics tools

MOBILE APPLICATIONS

• MOBILE APPLICATIONS are special-purpose software developed to perform certain tasks on the go. • Each app has a specific function and runs on specific mobile devices, such as smartphones, tablet computers, and smart watches. • Mobile devices use a special type of operating system called a mobile operating system (or mobile OS). Popular mobile OSes are Android (from Google), iOS ,Windows Phone (from Microsoft), and BlackBerry 10 (From BlackBerry). • Specific apps are developed for each mobile OS. • Most apps (but not all) are made available online for download through application distributors (or app stores), such as the Apple Store, Google Play, and the Amazon apps store. • According http://www.statista.com/, as of July 2014, there were 2.5 million apps available for download in the Apple Store and Google Play alone. • App stores also provide opportunities to users to comment on and rate apps. • WHAT IS MOBILE ANALYTICS? • mobile analytics refers to two things, • 1) mobile web analytics and • 2) apps analytics. • 1.MOBILE WEB ANALYTICS • Mobile web analytics is mostly focused on characteristics, actions, and behaviors of mobile website visitors; that is, the visitors to the mobile version of a company’s website. • Companies collect and analyze a variety of mobile user data, including views, clicks, demographic information, and device-specific data (e.g., the type of mobile device used to access the website).

5.3.1. Introduction to mobile analytics

• Mobile analytics: Mobile analytics captures data from mobile app, website, and web app visitors. • Mobile analytics are similar to traditional web analytics : they identify unique visitors and record their behaviors. • three major types of mobile analytics: • Advertising/Marketing Analytics • In-App Analytics • Performance Analytics

Advertising/Marketing Analytics • The success of an app often depends on whether marketing campaigns are able to attract the right types of users –

pg. 60 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• If the campaign was successful, you would see an increase in installs, engagements and financial metrics of the app

In-App Analytics • in order to be successful, the app must satisfy the expectations of its users. • In-app analytics is essentially “in-session” analytics – what users are actually doing inside the app and how they are interacting with the app. • feature optimization is the primary focus • Examples of common in-app data that can be collected include: • Device Profile: • Type of device (mobile phone, tablet, etc.) • Manufacturer • Operating system (iOS, Android, Windows, etc.) • User Demographics: • Location • Gender • New or returning user • Approximate age • Language • In-App Behavior: • Event Tracking (i.e. buttons clicked, ads clicked, purchases made, levels completed, articles read, screens viewed, etc.)

• Performance Analytics • users expect apps to work correctly and efficiently, and have little patience for underperformance • Performance analytics is generally concerned with two major measures: 1 – App uptime 2 – App responsiveness • factors that can impact the performance of your app • App complexity, Hardware variation, Available operating systems, Carrier/network, • Examples of common performance analytics data that can be collected includes: • API latency • Carrier/network latency • Data transactions • Crashes • Exceptions • Errors

5.4. web analytics • 5.4.1. Reporting and analysis are the two core component of web analytics. • 1.introduction to web analytics : • 2. Web analytics data sources • 3.Four steps for web analytics • 4.. two categories of web analytics • 5.optimization of the websites • 7.web analytics & mobile analytics

pg. 61 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

5.4.1. introduction to web analytics

• web analytics is a process for measuring web traffic : the number of visitors to a website and the number of page views. • to assess and improve the effectiveness of a website: by assessing popularity trends which is useful for business and market research • It helps to estimate how traffic to a website changes after the launch of a new advertising campaign.

Four steps for web analytics 1. Collection of data: Usually, these data are counts of things. 2. Processing of data into information: convert the data into information, specifically ratios and metrics. 3. Developing KPI: correlate the ratios and metrics with business strategies, referred to as key performance indicators (KPI). 4. Formulating online strategy: to evolve strategies for realizing infusing online goals, objectives, and standards for the business and also for making money, or increasing marketshare.

Two categories of web analytics • Off-site web analytics :refers to web measurement and analysis regardless of whether you own or maintain a website. • It includes the measurement of a website's potential audience (opportunity), share of voice (visibility), and buzz (comments) that is happening on the Internet as a whole. • On-site web analytics, measure a visitor's behavior on a website and the performance of the website. • This includes its drivers and conversions; for example, the degree to which different landing pages are associated with online purchases. • This data is typically compared against key performance indicators for performance, and used to improve a website or marketing campaign's audience response. • Google Analytics and Adobe Analytics are the most widely used on-site web analytics service • new tools are emerging that provide additional layers of information, including heat maps and session replay

Web analytics data sources • The data mainly comes from four sources:[3] • Direct HTTP request data: directly comes from HTTP request messages. • Network level and server generated data associated with HTTP requests: For example, IP address of a requester. • Application level data sent with HTTP requests: generated and processed by application level programs (such as JavaScript, PHP, and ASP.Net), including session and referrals. These are captured by internal logs • External data: can be combined with on-site data to help augment the website behavior data described above and interpret web usage. • For example, IP addresses are usually associated with Geographic regions and internet service providers,, or other data types as needed.

Web Analytics Tools • Google Analytics. free tool that any website owner can use to track and analyze data about Web traffic.

pg. 62 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

• Spring Metrics. analytics tool made simpler. . • Woopra. ... • Clicky. ... • Mint. ... • Chartbeat. ... • Kissmetrics. ... • UserTesting.

How to find out if a web page uses Analytics • The most common ways to check are built into most modern browsers. • You can either view the source code, which instructs the browser what to load, or use browser-based developer tools to see if the page is sending information to Analytics. • It’s common for the Analytics JavaScript to be included directly on a web page, so you can see it in the source code. • It’s possible for a page to call Analytics from another source. In these cases, you won’t see the JavaScript directly on the page.

How to check web analytics • Load a web page in the Chrome browser. Right-click the page, then click View page source. You should see a lot of code. Search the page for gtag.js or analytics.js (for Universal Analytics) or ga.js (for Classic Analytics).

5.4.2.web analytics & mobile analytics • Mobile analytics are similar to traditional web analytics • The majority of modern smartphones are able to browse websites, some with browsing experiences similar to those of desktop computers. • Data collected as part of mobile analytics typically includes : • page views, visits, visitors, and countries • information specific to mobile devices, such as device model, manufacturer, screen resolution, device capabilities, service provider, and preferred user language. • This data is compared against key performance indicators for performance and return on investment, and is used to improve a website or mobile marketing campaign's audience response. Mobile web • The mobile web, also known as mobile internet, refers to browser-based Internet services accessed from handheld mobile devices, such as smartphones or feature phones, through a mobile or other wireless network’ • Traditionally, the World Wide Web has been accessed via fixed-line services on laptops and desktop computers. However, the web is now more accessible by portable and wireless devices • Faster speeds, smaller, feature-rich devices, and a multitude of applications continue to drive explosive growth for mobile internet traffic. • The W3C Mobile Web Initiative identifies best practices to help websites support mobile phone browsing. • Many companies use these guidelines and mobile-specific code like Wireless Markup Language or HTML5 to optimize websites for viewing on mobile devices

pg. 63 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

UNIT WISE QUESTION BANK

UNIT 1 Bloom’s Taxonomy S. Level Questions COs No. 1. List and Discuss the four elements of Big Data. 1 Remember As a HR Manager of a Company providing Big Data 1 solutions to clients, what characteristics would you 2. Analyse look for while recruiting a potential candidate for the position of a Data analyst. While implementing marketing strategy for a new 1 3. product in your company, Identify and list some Analyse limitations of structured data related to this work. 4. a) Why Distributed computing needed for Big Data. 1 Understand b) Compare the Parallel computing Vs Distributed 1 Analyse computing for big data. 1 a) What are the various types of analytics? Understand 5. b) Why is Big Data analytics important? 1 Understand Explain in detail about CAP theorem used for Big Data 6. 1 Understand environment? 1 a) Define the responsibilities of the Data Scientist. Remember 7. b) Write about BASE concepts to provide data 1 Apply consistency.

pg. 64 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Bloom’s UNIT 2 Taxonomy S. Questions COs Level No. Discuss how Big data has helped advanced analytics 1. a) in creating a great analysis for different 2 Understand organizations. What are the roles of the IT and analytics team in b) 2 Understand Big data analytics project? 2. a) Explain about reporting? 2 Remember b) Explain about the Analytic process? 2 Understand 3. a) Explain Operational analytics. 2 Understand b) State the characteristics of Big data Analytics. 2 Remember 4. a) Give some examples of Ensemble algorithms. 2 Remember b) Define Text data analysis. 2 Understand 5. a) What are Analytical point solutions? 2 Understand b) Compare the various Analytical tools. 2 Analyse 6. a) List some important features of IBM SPSS. 2 Remember Write about R-programming tools with its features b) 2 Apply and limitations.

pg. 65 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Bloom’s UNIT 3 Taxonomy S. Questions COs Level No. 1. a) Write a short note on Hadoop eco system. 3 Apply What about metadata? What information does it b) 3 Understand provide? 2. a) Write about the HDFS architecture in detail. 3 Apply b) What is the role of NameNode in HDFS cluster? 3 Understand 3. a) List out the features of HBASE. 3 Remember b) Discuss about the concept of regions in HBASE. 3 Understand 4. a) List out the main features of Map Reduce framework 3 Remember b) Describe the working of the Map reduce algorithm. 3 Understand Discuss some techniques to optimize Map reduce 5. a) 3 Understand jobs. Discuss the points you need to consider while b) 3 Understand designing a file system in Map reduce. Discuss about the role of HBASE in Big data 6. 3 Understand processing.

pg. 66 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Bloom’s UNIT 4 Taxonomy S. Questions COs Level No. 1. a) Explain about the types of No-SQL databases 4 Understand b) List out the advantages of No-SQL. 4 Remember 2. a) List out the key advantages of Hadoop. 4 Remember b) Give the differences between Hadoop and SQL. 4 Analyse 3. a) List out the advantages of NoSQL. 4 Remember b) Write short notes on SQL vs NoSQL 4 Apply 4. Briefly explain the HDFS Daemons. 4 Understand 5. a) Write about the Anatomy of File Read? 4 Apply b) Write about the Anatomy of File Write? 4 Apply Explain about the Hadoop 2 – YARN and its 6. 4 Understand architecture in detail.

pg. 67 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas

Bloom’s UNIT 5 Taxonomy S. Questions COs Level No. 1. a) Write about the different forms of Social media. 5 Apply List out the key elements of social media b) 5 Remember participation. 2. a) Describe the steps to perform Text mining. 5 Understand b) Name some commonly used Text mining software. 5 Remember 3. a) What do you understand by sentiment analysis? 5 Understand List some common online tools used to perform b) 5 Remember sentiment analysis. 4. a) Define Mobile analytics and its primary goal? 5 Remember Discuss about the various challenges of Mobile b) 5 Understand analytics. 5. a) Write about the Mobile web analytics. 5 Apply Discuss in detail about the Mobile application b) 5 Understand analytics. Discuss in detail about the various Mobile analytical 6. 5 Understand tools.

pg. 68 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas