Real Time Searching of Big Data Using Hadoop, Lucene, and Solr

REAL TIME SEARCHING OF BIG DATA USING HADOOP, LUCENE, AND SOLR Dibyendu Bhattacharya Principal Software Engineer RSA the Security Division of EMC [email protected] Table of Contents Introduction ................................................................................................................................ 3 Understanding Big Data ............................................................................................................. 3 Power of Parallelism ............................................................................................................... 3 Parallel Processing Challenges .............................................................................................. 4 Hadoop Overview ................................................................................................................... 5 HDFS .................................................................................................................................. 6 MapReduce Engine ............................................................................................................ 6 Hadoop Data Flow .............................................................................................................. 8 Information Retrieval Overview .................................................................................................. 9 Search Engine Fundamental .................................................................................................10 Lucene Search Engine Library ..............................................................................................11 Solr Server ............................................................................................................................12 Big Data Indexing ......................................................................................................................13 Text Acquisition .....................................................................................................................14 Text Transformation ..............................................................................................................14 Solr Index Mapper ..............................................................................................................15 Index Creation .......................................................................................................................16 Solr Index Driver ................................................................................................................21 Solr Index Reducer ............................................................................................................22 What is ZooKeeper? ..........................................................................................................24 Index Merger ......................................................................................................................25 Zookeeper Integration ........................................................................................................26 Architecture Scalability ..........................................................................................................29 Performance Tuning ..............................................................................................................31 Hadoop Tuning ..................................................................................................................31 Solr Tuning ........................................................................................................................33 Final Execution .........................................................................................................................34 Summary ..................................................................................................................................36 Disclaimer: The views, processes, or methodologies published in this article are those of the author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies. 2012 EMC Proven Professional Knowledge Sharing 2 Introduction This Knowledge Sharing article contains an overview of how the distributed computing framework, Hadoop, can be integrated with powerful search engine Solr. The enormous explosion of data in today’s digital world requires Hadoop’s capability to process Big Data along with Solr’s indexing capability. Building the index of big data in parallel and storing in Solr Cloud will enable enterprises to easily search big data. Both Hadoop and Solr are very popular and successful open source projects widely used in large-scale distributed systems. This article will provide the reader with a fair amount of knowledge of how the Hadoop framework works, the concept of MapReduce, how to perform large-scale distributed indexing using Solr, and how to use a distributed consensus service such as ZooKeeper to coordinate all these. Understanding Big Data We are living in the data age. Big data is defined as amounts of data so large that it becomes impossible to manage in traditional data warehouse and database management systems. Big data creates major challenges for enterprises to store, share, manipulate, search, and perform analytics on this data set. McKinsey Global Institute (MGI) studied big data in five domains— healthcare in the United States, the public sector in Europe, retail in the United States, manufacturing, and global personal-location data1. In addition, International Data Corporation (IDC) survey projects that data will grow 50 times in the next 10 years.2 Now this huge data set is more than a challenge for an organization; it presents an opportunity to be exploited. Technology advancement in the area of Big Data Analytics gives enterprises opportunities like never before to poke at this data, mine it, uncover trends, perform statistical analysis, and drive business with better decision-making capabilities. EMC has estimated the Big Data market as $70 billion.3 Power of Parallelism To analyze huge volumes of data, we need to process the data in parallel. Without a robust parallel processing framework, analyzing and crunching Big Data is almost impossible. Let’s look at a simple example of why we need parallelism. 1 http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innov ation 2 http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm 3 http://www.forbes.com/sites/greatspeculations/2011/12/13/emc-riding-big-data-wave-to-42-stock-price/ 2012 EMC Proven Professional Knowledge Sharing 3 A commodity computer can read 50MB/second from hard disk and store 10 terabyte (TB) of data. Let’s assume we need to process a job on big data that is 1 TB. A job is basically some processing on a given data set, anything from a set of log files collected from distributed data centers to huge data a web crawler produced for indexing. With a commodity computer which performs sequential read of data from a single drive, the read part itself would take roughly 300 minutes! Given the fact that a computer with large volume of memory and CPU processing power can process the data very fast, I/O operation causes the bottleneck. The obvious solution is parallel reading. Instead of reading the 1 TB data from a single computer, if we read using 500 computers in parallel, we can read at 25GB/seconds! Parallel Processing Challenges But parallel processing comes with lots of challenges. Let’s try to understand some of the key challenges. If there are 500 computers (nodes) which read data in parallel, what will be the source of data? Will all nodes point to the same source? How do we then split the data such that a specific node reads only specific splits? Do we need to transfer the data to the node over the wire? What is the amount of bandwidth needed to transfer such huge data? Is there any data sharing among nodes? So many questions… One way we can solve this is by distributing and storing data locally into the nodes and move the computation logic to the node where data is available. For a massively parallel system sharing of data can be a costly affair, which may lead to serious synchronization problems and deadlocks. Shared nothing between nodes will simplify the parallelism. Let us assume that data will be distributed somehow to all those nodes and job execution will work on the local data. We need to ask a few more questions. What will happen if a node is crashed? Is there any way we can recover the data which was there on that node? The answer is yes, if we can replicate the same data to more than one node. Then, if one node crashed, the replica node has the backup. But replication has its own challenge; consistency. How do we make sure the replicated data is consistent across nodes? There is one famous theorem by Eric Brewer, computer scientist at University of California, Berkeley which says that, among three properties of a distributed shared data store—data consistency, system availability, and tolerance to network partitioning—one can achieve only two. This is called CAP 2012 EMC Proven Professional Knowledge Sharing 4 (Consistency, Availability, and Partitioning) theorem.4 Any distributed system must support network partitioning; otherwise, we cannot scale the system. This means either we have to compromise on data consistency or system availability. You can see the challenges of a parallel processing system and the need for a fault tolerant, distributed parallel computing

Load more