Dzone-Guide-To-Big-Data.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
THE 2018 DZONE GUIDE TO Big Data STREAM PROCESSING, STATISTICS, & SCALABILITY VOLUME V BROUGHT TO YOU IN PARTNERSHIP WITH THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY Dear Reader, Table of Contents I first heard the term “Big Data” almost a decade ago. At that time, it Executive Summary looked like it was nothing new, and our databases would just be up- BY MATT WERNER_______________________________ 3 graded to handle some more data. No big deal. But soon, it became Key Research Findings clear that traditional databases were not designed to handle Big Data. BY G. RYAN SPAIN _______________________________ 4 The term “Big Data” has more dimensions than just “some more data.” It encompasses both structured and unstructured data, fast moving Take Big Data to the Next Level with Blockchain Networks BY ARJUNA CHALA ______________________________ 6 and historical data. Now, with these elements added to the data, some of the other problems such as data contextualization, data validity, Solving Data Integration at Stitch Fix noise, and abnormality in the data became more prominent. Since BY LIZ BENNETT _______________________________ 10 then, Big Data technologies has gone through several phases of devel- Checklist: Ten Tips for Ensuring Your Next Data Analytics opment and transformation, and they are gradually maturing. A term Project is a Success BY WOLF RUZICKA, ______________________________ that was considered as a fad and a technology ecosystem that was 13 considered a luxury are slowly establishing themselves as necessary Infographic: Big Data Realization with Sanitation ______ needs for today’s business activities. Big Data is the new competitive 14 advantage and it matters for our businesses. Why Developers Should Bet Big on Streaming BY JONAS BONÉR _______________________________ 16 The more we progress and the more automation we implement, data is Introduction to Basic Statistics Measurements always going to transform and grow. Blockchain technologies, Cloud, BY SUNIL KAPPAL _______________________________ 20 and IoT are adding new dimensions to the Big Data trend. Hats off the developers who are continually innovating and creating new Big Data Diving Deeper into Big Data _____________________ 23 Storage and Analytics applications to derive value out of this data. The Executive Insights on the State of Big Data fast-paced development has made it easier for us to tame fast-grow- BY TOM SMITH _________________________________ 26 DZONE.COM/GUIDES ing massive data and integrate our existing Enterprise IT infrastructure DZONE.COM/GUIDES with these new data sources. These successes are driven by both En- Big Data Solutions Directory ____________________ 30 terprises and Open Source communities. Without open source proj- Glossary __________________________________ 40 ects like Apache Hadoop, Apache Spark, and Kafka, to name a few, the landscape would have be entirely different. The use of Machine Learn- DZONE IS... BUSINESS EDITORIAL ing and Data visualization methods packaged for analyzing Big Data RICK ROSS, CEO CAITLIN CANDELMO, DIR. OF CONTENT & PRODUCTION MATT SCHMIDT, CHRIS SMITH, DIR. OF COMMUNITY is also making life easier for analysts and management. However, we PRESIDENT PRODUCTION still hear the failure of analytics projects more often than the success- JESSE DAVIS, EVP MATT WERNER, ANDRE POWELL, PUBLICATIONS COORD. SR. PRODUCTION SALES es. There are several reasons why. So, we bring you this guide, where COORDINATOR MATT O’BRIAN, DIR. OF MICHAEL THARRINGTON, BUSINESS DEV. G. RYAN SPAIN, CONTENT & COMMUNITY MANAGER these articles written by DZone contributors are going to provide you PRODUCTION ALEX CRAFTS, DIR. OF COORDINATOR MAJOR ACCOUNTS with more significant insights into these topics. KARA PHELPS, CONTENT & ASHLEY SLATE, DESIGN JIM HOWARD, SR COMMUNITY MANAGER DIR. ACCOUNT EXECUTIVE MIKE GATES, SR. CONTENT JIM DYER, ACCOUNT BILLY DAVIS, PRODUCTION COORD. EXECUTIVE The Big Data guide is an attempt to help readers discover and help un- ASSISSTANT ANDREW BARKER, SARAH DAVIS, CONTENT derstand the current landscape of the Big Data ecosystem, where we MARKETING ACCOUNT EXECUTIVE COORD. KELLET ATKINSON, DIR. OF TOM SMITH, RESEARCH MARKETING BRIAN ANDERSON, stand, and what amazing insights and applications people are discov- ACCOUNT EXECUTIVE ANALYST LAUREN CURATOLA, RYAN McCOOK, ACCOUNT JORDAN BAKER, CONTENT ering in this space. We wish that everyone who reads this guide finds it MARKETING SPECIALIST EXECUTIVE COORD. KRISTEN PAGÀN, CHRIS BRUMFIELD, SALES ANNE MARIE GLEN, worthy and informative. Happy reading! MARKETING SPECIALIST MANAGER CONTENT COORD. NATALIE IANNELLO, TOM MARTIN, ACCOUNT ANDRE LEE-MOYE, MARKETING SPECIALIST MANAGER By Sibanjan Das CONTENT COORD. JULIAN MORRIS, JASON BUDDAY, ACCOUNT BUSINESS ANALYTICS & DATA SCIENCE CONSULTANT & DZONE ZONE LEADER MARKETING SPECIALIST MANAGER THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY PAGE 2 OF 41 THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY Executive Summary BY MATT WERNER PUBLICATIONS COORDINATOR, DZONE Classically, Big Data has been defined by three V’s: Volume, or how most popular vendor (70%) followed by Google Cloud (57%) and much data you have; Velocity, or how fast data is collected; and Microsoft Azure (39%). Variety, or how heterogeneous the data set is. As movements like the IMPLICATIONS Internet of Things provide constant streams of data from hardware, The percentage of respondents using cloud solutions increased by and AI initiatives require massive data sets to teach machines to 8% in 2018, while on-premise and hybrid storage decreased by 6% think, the way in which Big Data is stored and utilized continues and 4%, respectively. As cloud solutions become more and more to change. To find out how developers are approaching these ubiquitous, the prospect of storing data on external hardware challenges, we asked 540 DZone members to tell us about what becomes more appealing as a way to decrease costs. Another reason tools they’re using to overcome them, and how their organizations for this increase may be in the abilities of some tools to process data, measure successful implementations. such as AWS Kinesis. THE PAINS OF THE THREE V’S RECOMMENDATIONS DATA Not every organization may need a way to store Big Data if they Of several data sources, files give developers the most trouble do not yet have a strong use case for it. When a business strategy when it comes to the volume and variety of data (47% and 56%, around Big Data is created, however, a cloud storage solution respectively), while 42% of respondents had major issues with the requires less up-front investment for smaller enterprises, though if your business deals in sensitive information, a hybrid solution would DZONE.COM/GUIDES speed at which both server logs and sensor data were generated. be a good compromise, or if you need insights as fast as possible you The volume of server logs was also a major issue, with 46% of may need to invest in an on-premise solution to access data quickly. respondents citing it as a pain point. BLINDED WITH DATA SCIENCE IMPLICATIONS DATA As the Internet of Things takes more of a foothold in various industries, The biggest challenges in data science are working with unsanitized the difficulties in handling all three V’s of Big Data will increase. More or unclean data (64%), working within timelines (40%), and limited and more applications and organizations are generating files and training and talent (39%). documents rather than just values and numbers. IMPLICATIONS RECOMMENDATIONS The return on investment of analytics projects is mostly focused on A good way to deal with the constant influx of data from remote the speed of decision-making (47%), the speed of data access (45%), hardware is to use solutions like Apache Kafka, an open source and data integrity (31%). These KPIs all contribute to the difficulty of dealing with unclean data and project timelines. Insights are project built to handle the processing of data in real-time as it is demanded quickly, but data scientists need to take time to ensure collected. Currently, 61% of survey respondents are using Kafka, and data quality is good so those insights are valuable. we recently released our first Refcard on the topic. Using document store databases, such as MongoDB, are recommended to handle files, RECOMMENDATIONS documents, and other semi-structured data. Project timelines need to be able to accommodate the time it takes to prepare data for analysis, which can range from omitting sensitive DATA IN THE CLOUD information to deleting irrelevant values. One easy way to do this is to DATA sanitize user inputs, keeping users from adding too much nonsense to 39% of survey respondents typically store data in the cloud, compared your data. Our infographic on page 14 can give you a fun explanation to 33% who store it on-premise and 23% who take a hybrid approach. of why unclean and unsanitized data can be such a hassle and why it’s Of those using the cloud or hybrid solutions, Amazon was by far the important to gleaning insights. THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY PAGE 3 OF 41 THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY Key Research Findings BY G. RYAN SPAIN PRODUCTION COORDINATOR, DZONE DEMOGRAPHICS for some time as an open source implementation of S, a language • 540 software professionals completed DZone’s 2018 Big Data sur- specifically designed for statistical analysis. And while R still maintains vey. Respondent demographics are as follows: popularity (in the TIOBE index, it moved from the 16th place ranking in January 2017 to the 8th place ranking in 2018), Python’s use in data • 42% of respondents identify as developers or engineers, and 23% science and data mining projects has been steadily increasing. Last identify as developer team leads. year, respondents to DZone’s Big Data survey revealed that Python had • 54% of respondents have 10 years of experience or more; 28% have overcome R as the predominant language used for data science, though 15 years or more.