Big Data Analytics Using Splunk

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Authors �� xv About the Technical Reviewer �� xvii Acknowledgments �� xix ■ Chapter 1: Big Data and Splunk ��1 ■ Chapter 2: Getting Data into Splunk ��9 ■ Chapter 3: Processing and Analyzing the Data ��31 ■ Chapter 4: Visualizing the Results ��63 ■ Chapter 5: Defining Alerts ��97 ■ Chapter 6: Web Site Monitoring ��109 ■ Chapter 7: Using Log Files To Create Advanced Analytics ��127 ■ Chapter 8: The Airline On-Time Performance Project ��139 ■ Chapter 9: Getting the Flight Data into Splunk ��143 ■ Chapter 10: Analyzing Airlines, Airports, Flights, and Delays ��161 ■ Chapter 11: Analyzing a Specific Flight Over the Years ��195 ■ Chapter 12: Analyzing Tweets ��211 ■ Chapter 13: Analyzing Foursquare Check-Ins ��231 ■ Chapter 14: Sentiment Analysis ��255 v ■ CONTENTS AT A GLANCE ■ Chapter 15: Remote Data Collection ��283 ■ Chapter 16: Scaling and High Availability ��295 ■ Appendix A: The Performance of Splunk ��307 ■ Appendix B: Useful Splunk Apps ��323 Index ��345 vi CHAPTER 1 Big Data and Splunk In this introductory chapter we will discuss what big data is and different ways (including Splunk) to process big data. What Is Big Data? Big data is, admittedly, an overhyped buzzword used by software and hardware companies to boost their sales. Behind the hype, however, there is a real and extremely important technology trend with impressive business potential. Although big data is often associated with social media, we will show that it is about much more than that. Before we venture into definitions, however, let’s have a look at some facts about big data. Back in 2001, Doug Laney from Meta Group (an IT research company acquired by Gartner in 2005) wrote a research paper in which he stated that e-commerce had exploded data management along three dimensions: volumes, velocity, and variety. These are called the three Vs of big data and, as you would expect, a number of vendors have added more Vs to their own definitions. Volume is the first thought that comes with big data: the big part. Some experts consider Petabytes the starting point of big data. As we generate more and more data, we are sure this starting point will keep growing. However, volume in itself is not a perfect criterion of big data, as we feel that the other two Vs have a more direct impact. Velocity refers to the speed at which the data is being generated or the frequency with which it is delivered. Think of the stream of data coming from the sensors in the highways in the Los Angeles area, or the video cameras in some airports that scan and process faces in a crowd. There is also the click stream data of popular e-commerce web sites. Variety is about all the different data and file types that are available. Just think about the music files in the iTunes store (about 28 million songs and over 30 billion downloads), or the movies in Netflix (over 75,000), the articles in the New York Times web site (more than 13 million starting in 1851), tweets (over 500 million every day), foursquare check-ins with geolocation data (over five million every day), and then you have all the different log files produced by any system that has a computer embedded. When you combine these three Vs, you will start to get a more complete picture of what big data is all about. Another characteristic usually associated with big data is that the data is unstructured. We are of the opinion that there is no such thing as unstructured data. We think the confusion stems from a common belief that if data cannot conform to a predefined format, model, or schema, then it is considered unstructured. An e-mail message is typically used as an example of unstructured data; whereas the body of the e-mail could be considered unstructured, it is part of a well-defined structure that follows the specifications of RFC-2822, and contains a set of fields that include From, To, Subject, and Date. This is the same for Twitter messages, in which the body of the message, or tweet, can be considered unstructured as well as part of a well-defined structure. In general, free text can be considered unstructured, because, as we mentioned earlier, it does not necessarily conform to a predefined model. Depending on what is to be done with the text, there are many techniques to process it, most of which do not require predefined formats. 1 CHAPTER 1 ■ BIG DATA AND SPLUNK Relational databases impose the need for predefined data models with clearly defined fields that live in tables, which can have relations between them. We call this Early Structure Binding, in which you have to know in advance what questions are to be asked of the data, so that you can design the schema or structure and then work with the data to answer them. As big data tends to be associated with social media feeds that are seen as text-heavy, it is easy to understand why people associate the term unstructured with big data. From our perspective, multistructured is probably a more accurate description, as big data can contain a variety of formats (the third V of the three Vs). It would be unfair to insist that big data is limited to so-called unstructured data. Structured data can also be considered big data, especially the data that languishes in secondary storage hoping to make it some day to the data warehouse to be analyzed and expose all the golden nuggets it contains. The main reason this kind of data is usually ignored is because of its sheer volume, which typically exceeds the capacity of data warehouses based on relational databases. At this point, we can introduce the definition that Gartner, an Information Technology (IT) consultancy, proposed in 2012: “Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and processes optimization.” We like this definition, because it focuses not only on the actual data but also on the way that big data is processed. Later in this book, we will get into more detail on this. We also like to categorize big data, as we feel that this enhances understanding. From our perspective, big data can be broken down into two broad categories: human-generated digital footprints and machine data. As our interactions on the Internet keep growing, our digital footprint keeps increasing. Even though we interact on a daily basis with digital systems, most people do not realize how much information even trivial clicks or interactions leave behind. We must confess that before we started to read Internet statistics, the only large numbers we were familiar with were the McDonald’s slogan “Billions and Billions Served” and the occasional exposure to U.S. politicians talking about budgets or deficits in the order of trillions. Just to give you an idea, we present a few Internet statistics that show the size of our digital footprint. We are well aware that they are obsolete as we write them, but here they are anyway: • By February 2013, Facebook had more than one billion users, of which 618 million were active on a daily basis. They shared 2.5 billion items and “liked” other 2.7 billion every day, generating more than 500 terabytes of new data on a daily basis. • In March 2013, LinkedIn, which is a business-oriented social networking site, had more than 200 million members, growing at the rate of two new members every second, which generated 5.7 billion professionally oriented searches in 2012. • Photos are a hot subject, as most people have a mobile phone that includes a camera. The numbers are mind-boggling. Instagram users upload 40 million photos a day, like 8,500 of them every second, and create about 1,000 comments per second. On Facebook, photos are uploaded at the rate of 300

Big Data Analytics Using Splunk

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support