REAL TIME SEARCHING OF BIG DATA USING HADOOP, LUCENE, AND SOLR

Dibyendu Bhattacharya Principal Software Engineer RSA the Security Division of EMC [email protected] Table of Contents Introduction ...... 3 Understanding Big Data ...... 3 Power of Parallelism ...... 3 Parallel Processing Challenges ...... 4 Hadoop Overview ...... 5 HDFS ...... 6 MapReduce Engine ...... 6 Hadoop Data Flow ...... 8 Information Retrieval Overview ...... 9 Search Engine Fundamental ...... 10 Lucene Search Engine Library ...... 11 Solr Server ...... 12 Big Data Indexing ...... 13 Text Acquisition ...... 14 Text Transformation ...... 14 Solr Index Mapper ...... 15 Index Creation ...... 16 Solr Index Driver ...... 21 Solr Index Reducer ...... 22 What is ZooKeeper? ...... 24 Index Merger ...... 25 Zookeeper Integration ...... 26 Architecture Scalability ...... 29 Performance Tuning ...... 31 Hadoop Tuning ...... 31 Solr Tuning ...... 33 Final Execution ...... 34 Summary ...... 36

Disclaimer: The views, processes, or methodologies published in this article are those of the author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2012 EMC Proven Professional Knowledge Sharing 2

Introduction This Knowledge Sharing article contains an overview of how the distributed computing framework, Hadoop, can be integrated with powerful search engine Solr. The enormous explosion of data in today’s digital world requires Hadoop’s capability to process Big Data along with Solr’s indexing capability. Building the index of big data in parallel and storing in Solr Cloud will enable enterprises to easily search big data.

Both Hadoop and Solr are very popular and successful open source projects widely used in large-scale distributed systems. This article will provide the reader with a fair amount of knowledge of how the Hadoop framework works, the concept of MapReduce, how to perform large-scale distributed indexing using Solr, and how to use a distributed consensus service such as ZooKeeper to coordinate all these.

Understanding Big Data We are living in the data age. Big data is defined as amounts of data so large that it becomes impossible to manage in traditional data warehouse and database management systems. Big data creates major challenges for enterprises to store, share, manipulate, search, and perform analytics on this data set. McKinsey Global Institute (MGI) studied big data in five domains— healthcare in the United States, the public sector in Europe, retail in the United States, manufacturing, and global personal-location data1. In addition, International Data Corporation (IDC) survey projects that data will grow 50 times in the next 10 years.2

Now this huge data set is more than a challenge for an organization; it presents an opportunity to be exploited. Technology advancement in the area of Big Data Analytics gives enterprises opportunities like never before to poke at this data, mine it, uncover trends, perform statistical analysis, and drive business with better decision-making capabilities. EMC has estimated the Big Data market as $70 billion.3

Power of Parallelism To analyze huge volumes of data, we need to process the data in parallel. Without a robust parallel processing framework, analyzing and crunching Big Data is almost impossible. Let’s look at a simple example of why we need parallelism.

1 http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innov ation 2 http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm 3 http://www.forbes.com/sites/greatspeculations/2011/12/13/emc-riding-big-data-wave-to-42-stock-price/

2012 EMC Proven Professional Knowledge Sharing 3

A commodity computer can read 50MB/second from hard disk and store 10 terabyte (TB) of data. Let’s assume we need to process a job on big data that is 1 TB. A job is basically some processing on a given data set, anything from a set of log files collected from distributed data centers to huge data a produced for indexing. With a commodity computer which performs sequential read of data from a single drive, the read part itself would take roughly 300 minutes! Given the fact that a computer with large volume of memory and CPU processing power can process the data very fast, I/O operation causes the bottleneck.

The obvious solution is parallel reading.

Instead of reading the 1 TB data from a single computer, if we read using 500 computers in parallel, we can read at 25GB/seconds!

Parallel Processing Challenges But parallel processing comes with lots of challenges. Let’s try to understand some of the key challenges.

If there are 500 computers (nodes) which read data in parallel, what will be the source of data? Will all nodes point to the same source? How do we then split the data such that a specific node reads only specific splits? Do we need to transfer the data to the node over the wire? What is the amount of bandwidth needed to transfer such huge data? Is there any data sharing among nodes? So many questions…

One way we can solve this is by distributing and storing data locally into the nodes and move the computation logic to the node where data is available. For a massively parallel system sharing of data can be a costly affair, which may lead to serious synchronization problems and deadlocks. Shared nothing between nodes will simplify the parallelism. Let us assume that data will be distributed somehow to all those nodes and job execution will work on the local data. We need to ask a few more questions. What will happen if a node is crashed? Is there any way we can recover the data which was there on that node?

The answer is yes, if we can replicate the same data to more than one node. Then, if one node crashed, the replica node has the backup. But replication has its own challenge; consistency. How do we make sure the replicated data is consistent across nodes? There is one famous theorem by Eric Brewer, computer scientist at University of California, Berkeley which says that, among three properties of a distributed shared data store—data consistency, system availability, and tolerance to network partitioning—one can achieve only two. This is called CAP

2012 EMC Proven Professional Knowledge Sharing 4

(Consistency, Availability, and Partitioning) theorem.4 Any distributed system must support network partitioning; otherwise, we cannot scale the system. This means either we have to compromise on data consistency or system availability.

You can see the challenges of a parallel processing system and the need for a fault tolerant, distributed parallel computing framework which can address all of them. Google made the first breakthrough to solve this massively parallel computing and published a paper on Google File System and MapReduce in 2004. Doug Cutting got the idea from this paper and started building a similar framework in Java called Hadoop which became a top level Apache project (http://hadoop.apache.org/) and open source software.

Hadoop Overview Let’s see what Hadoop is and how it solves the parallel processing challenges.

Hadoop framework consists of two sub systems.

1. Hadoop Distributed File System (HDFS) 2. MapReduce Engine

Figure 1 gives a nice overview of Hadoop HDFS layer and Map Reduce layer. We will discuss JobTracker, TaskTracker, Data Node, and Name Node functionality shortly.

Figure 1

4 http://en.wikipedia.org/wiki/CAP_theorem

2012 EMC Proven Professional Knowledge Sharing 5

HDFS For a distributed parallel processing system to work across multiple nodes, Hadoop needs a distributed file system. Hadoop Distributed File System (HDFS) is block-level storage where each file is divided into equal size blocks (default 64MB) and distributed across nodes in the cluster. HDFS Block size is kept very high compared to a traditional file system to make sure that the time to transfer the data can be comparatively higher than time to seek the data since seek time is an overhead for any I/O operation.

In Hadoop, the Data Node is responsible for storing the data. Each node in a Hadoop cluster will contain one data node. Assume we are trying to load a 1 GB file into HDFS where block size is 64 MB. In this case the file will be divided into 16 blocks. Each block will be replicated to 3 or more data nodes. As you can see, a large file is distributed among blocks and blocks can span across nodes. HDFS is a Rack-Aware file system, which means the data node can be replicated in the same rack, or across the rack in a data center to prevent data loss due to rack failure.

There is another component in the HDFS layer called Name Node which basically coordinates access to the data node and stores the metadata about the file system (i.e., which block resides on which data node). Name Node is completely stored in the memory for faster lookup and it is the single point of failure for Hadoop Cluster. If Name Node goes down, the entire HDFS file system will go down.

MapReduce Engine With data stored across multiple nodes, Hadoop needs a data processing engine to process data in parallel. Hadoop MapReduce engine is a functional programming model for doing such a job.

MapReduce consists of two parts. The first part is called Map, which processes the input data as a Key Value pair and produces an intermediate list of Key Value pairs.

Map Function: Map (k1, v1) → list (k2, v2)

Map function will be applied to every item in the input data set in parallel which produce the intermediate pair of key and value.

The MapReduce framework then collects all pairs with the same key and groups them together to create one group for each of the generated keys and given to Reduce.

2012 EMC Proven Professional Knowledge Sharing 6

Reduce Function: Reduce (k2, list (v2)) → list (v3)

The Reduce function works on the output of the Map function. It also works in parallel and on the groups produced by the MapReduce engine and generates the final output value list.

Each Map task or Reduce task does not share any data, working only on the given data set, which requires that they run in parallel on multiple nodes.

For solving any Big Data related problem, we need to define the data processing logic such that it can be fitted into the Map Reduce paradigm. If the data can be processed using Map and Reduce task, it can be parallelized in the Hadoop framework.

Figure 2 depicts the Map Reduce process for given data.

Figure 2

As you can see, the data is divided into two data sets. There is one Map task for each data set in the figure above. Each Map task emits three intermediate keys (key1, key2, and key3) and associated values. The MapReduce engine groups the same key along with their values from

2012 EMC Proven Professional Knowledge Sharing 7 different Mapper and give it to Reduce task. There are three Reduce tasks; each get one of the intermediate keys along with value list. The Reduce tasks work on the value list for the given key and produces a final result.

Let’s look at a simple word count example and how it can be done using Map and Reduce task.

Here, the input file contains one line, “how much wood would a woodchuck chuck if a woodchuck could chuck wood”.

The Map Function emits the (word, 1) for every word it encounters in the line.

The Reduce function takes the intermediate keys (word itself) and their values (number of ONEs) as input, e.g., for the word “chuck”, the input to the Reduce is (chuck,{1,1}) as the word “chuck” occurs twice. The Reduce function simply sums the list of values (list of ONEs), and the output result is the word count number for a word.

Hadoop Data Flow Let us now see how the data flow happens within Hadoop framework when client submits a job. A job in Hadoop contains the logic to perform the Map and Reduce task on the input data. Hadoop HDFS contains the data in Data Nodes described earlier. When client submits the job, it finds the details of the data available for the given job by querying Name Node and splits the data; those splits are called input splits. The Input Split size is normally kept the same as the Block size (default 64 MB) of HDFS. E.g., if data is spread across 1000 blocks, there will be 1000 splits created. The number of Map tasks Hadoop produces is the same as the number of Splits. The number of Reduce tasks can be configured.

2012 EMC Proven Professional Knowledge Sharing 8

When a job is submitted to the MapReduce engine, the JobTracker pushes the task to the TaskTracker. JobTracker have the details of available input splits. The TaskTracker runs on the same node where the input splits/data block reside. The TaskTracker then spawns task instances and performs the Map task and Reduce task on every input split. JobTracker keep the status of each task and if any TaskTracker fails, that part of the task is

rescheduled.

For example, if the Job Size is 200 GB and HDFS Block size is 64 MB, the total number of blocks will be 3200. If we keep the input split size the same as the block size, the total number of splits also will be 3200. Thus, Hadoop MapReduce engine will produce 3200 Map tasks each for individual splits. As splits are spread across Hadoop cluster nodes (Data Nodes) and TaskTracker runs on the same node where data is available, multiple Map tasks can run in parallel, independent of each other. Once all Map tasks are over, the output of the Map task is given to the Reduce task which then continues processing the data and produces the final output.

We will see the Map and Reduce task for doing parallel indexing shortly. Before that, let’s jump to the concept of search engine and information retrieval which will give us some idea about the indexing process and how it can be applied to Hadoop.

Information Retrieval Overview By now we have learned the challenges of distributed computing and how Hadoop solves this problem. We also discussed the various construct of Hadoop framework and their functionality. Let’s now switch to another very interesting and challenging domain of computer science; the concept of information retrieval and search engine.

Search and Information Retrieval (IR) is the most common activity we do every day on the Internet. A huge amount of research has been done in the area of IR, specifically retrieve information from unstructured data such as documents and text. The source of this unstructured data is very broad. It could be log files, emails, news feeds, web pages, books, blogs posts, and

2012 EMC Proven Professional Knowledge Sharing 9 so on. Any artifacts which can be searched are referred to as a document which basically contains some common set of information or fields, e.g., for a log file. An IR document may contain date and time, severity level, log message, and log source as the common fields which can be searched.

There are many differences between the database records and the IR document. The main difference is the relational data is more structured in nature where query/compare/retrieve operations can be performed more easily than unstructured data, such as a Log file.

Search Engine Fundamental A search engine is the application of IR technique. The major challenges in search engines are:

• Relevance: Are you getting what you have searched for? • Performance and Scalability: Can the search engine scale with large volume of data? • Incorporating new data: Is the search engine able to retrieve newly added data? • Adaptability: Can the search engine be tuned per application needs?

To improve search performance and accuracy, Search Engines uses an index which is a data structure to store data for faster retrieval. Designing the index data structure is the key for search engine performance.

Any indexing process consists of the following phases5.

• Text Acquisition: Identify and store document for indexing. • Text Transformation: Transform document into index terms and features. • Index Creation: Take index terms and create the index data structure.

We will discuss these phases in detail later in this article. The following diagram shows the three phases of the indexing process.

5 http://www.search-engines-book.com/

2012 EMC Proven Professional Knowledge Sharing 10

The three phases of the indexing process

Let us now pause for a moment and recall our original goal. We are trying to build the index on Big Data using Hadoop, Solr, and Lucene. Obviously, we cannot perform the above three steps of Text Acquisition, Text Transformation, and Index Creation in sequence over the huge data set. This will take a long time to give us the final index. We somehow need to bring parallelism into this whole process. We will see how the same can be achieved when we dig deep into the implementation. For now, let’s jump to the architecture of Lucene and understanding of Solr which are the most popular and widely used search engine library and search engine server respectively for information retrieval.

Lucene Search Engine Library is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.6

Text search is a feature where users need to find the documents which the text belongs to. This is basically the concept of inverted index where documents are associated with words. While retrieving documents for the searched words, the search engine should retrieve them in the order of relevance. For calculating the relevance of a document, the search engine uses something called a Ranking Algorithm which calculates the score of a document.

6 http://lucene.apache.org/java/docs/index.html

2012 EMC Proven Professional Knowledge Sharing 11

This figure describes a simple inverted index concept. Here we have three documents with id 1, 2, and 3 containing some text. During the indexing process, the text of the document is broken into terms (words). Each term is then mapped to the document where it belongs.

Lucene Index contains multiple segments. Each segment contains multiple documents. Document is the key construct of Information Retrieval, which contains fields. Fields of a document have a name and associated value. The text value of each field is broken into terms during the indexing process as seen in the example above. As and when we add documents to indexes, new Segments are written and then Segments are merged based on certain configurations. During the index creation process, Lucene also keep a certain number of documents in memory and flushes them to a new Segment in disk when the number of documents in memory reaches a threshold. The performance of search engine depends on the number of Segments. The fewer segments, the better the search performance. Tuning Lucene to perform optimally with a large data set is controlled by how many documents we need to keep in memory and what is the threshold value after which Lucene needs to merge segments. These are essential concepts that will be helpful to understand the performance of indexing process.

Solr Server Lucene is a search library whereas Solr is the Web Application built on top of Lucene which simplify the use of underlying search features.

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and

2012 EMC Proven Professional Knowledge Sharing 12 geospatial search. Solr is highly scalable, providing distributed search and index replication. It powers the search and navigation features of many of the world's largest Internet sites.7

Big Data Indexing Thus far, we have learned about Solr/Lucene for searching/indexing and also Hadoop for parallel processing. We will now see how to integrate these two technologies for high performance, scalable indexing of Big Data.

The following technology stack is used for implementing Big Data Indexing on Hadoop cluster. (We are not going into details of installing each of them. Information can be found on the Internet on how to configure the stack in your development environment.)

 Java 1.6  -0.21.0: For data-intensive, fault-tolerant distributed computing  apache-tomcat-7.0.22: Web Server and Servlet engine for hosting Solr  apache-solr-1.4.1: Search Platform  zookeeper-3.3.4: Distributed consensus service

We are trying to index Log data using Hadoop and Solr. For simplicity, assume the following is our Log entry structure.

Log ID Log Code Log Partition Log Body ABCD123 9980 10.10.10.10 Some Message

 The Log ID uniquely identifies one Log entry which is searchable.  Log Code is some identifier of a log entry which is searchable but may not be unique.  Log Partition will partition a Log entry based on, for instance, the origin of log file (e.g., log from Server 1 or Server 2). This is also searchable. This partition will be used by Hadoop to distribute the Log document to the respective Reduce task.  Log Body contains the log message details. This is also searchable.

Each of these four fields is separated by one tab space.

One Log file may contain hundreds of thousands of such entries separated by a new line.

Recalling the Indexing process we described earlier, let’s try to map each of the phases into the Hadoop system.

7 http://lucene.apache.org/solr/

2012 EMC Proven Professional Knowledge Sharing 13

Text Acquisition This is the phase which identifies and acquires the documents for the search engine. Assume that you have a large data center with hundreds of servers which produces terabytes of log files which needs to be indexed. Text Acquisition system can be some kind of distributed log collection and merging framework which collect and consolidates log files from distributed servers. The Text Acquisition uses the Document and Data Store to store the files. In Hadoop, we use HDFS for storing large data files. We can think of the Text Acquisition component as a log collection and merging framework which collects the logs from various sources, merges them into a large log file, and then pushes into the HDFS for further processing.

Remember, Hadoop is not efficient for small files. Hadoop can perform optimally when the file size is minimum 64 MB, i.e., HDFS block size. If we push millions of small log files to HDFS, the performance will degrade. Hadoop produces one Map task for every input split. Since a small log file of 5 MB cannot even consume 10% of the default HDFS default block size of 64 MB, a huge amount of Block space will be wasted. There will be millions of Map tasks for every block which operate on small input splits. Effectively, the I/O overhead will be much larger than the actual processing of data. Also, Hadoop stores metadata of each block in Name Node memory. If we assume that every log file is fitted into one block with very little space utilization, and if every block consumes, for example, 150 KB of Name Node memory, 10 million log files will consume 3 GB of NameNode memory. Scaling beyond this limit in present hardware configuration is not possible. HDFS loaded with billions of small files will definitely not work.

To solve this small file problem, there are various open source scalable tools available to collect and merge logs. However, this article does not cover log collection and merging techniques.

To simulate the Text Acquisition phase, a program is used to generate large log files with millions of random log entries in the format mentioned earlier and pushed into the HDFS using Hadoop File System API.

Text Transformation Text transformation is the parser phase of the indexing process which identifies and populates the index document. In this phase, we tokenize the given input and extract the necessary field value and construct the document.

This is the phase where Index documents are built for given Big Data. This phase can run in parallel very well because building a document from one chunk of data does not depend on

2012 EMC Proven Professional Knowledge Sharing 14 building a document on another chunk of data. We can write a Hadoop Map task for parsing and extracting the data from input splits and construct the SolrDocument, the document construct for Solr.

Let us see the Map side code snippet for doing this.

Solr Index Mapper Below is a very simple Map task. Do not be concerned with the Hadoop API details, just concentrate on the highlighted block and signature of Map method. The Map method takes the Input key value pair of the calculated input split we described earlier. By default, the input key is the file offset of the input file and value is the line entry for that offset. In our Log file structure, every log entry is contained in a single line. In the Map task, we ignored the input key as we do not bother about the offset of the log entry, only the input value or the log entry is parsed by a Parser and converted to a SolrDocument object.

2012 EMC Proven Professional Knowledge Sharing 15

This is a snippet of the parser code generating Document object after getting the necessary fields from log entry.

In Mapper code, SolrDocument is converted into a Java Script Object Notation (Json) and emitted for further processing. Json is used for marshalling and un-marshalling SolrDocument object between Map and Reduce task as it is very lightweight and much faster than any other data communication format, such as XML8.

The partition field of the document is used as the intermediate key and the Json representation of the SolrDocument as the value of the Map output. As we explained earlier, the Hadoop MapReduce engine will group the values of the same output key and give to the Reduce job; this will ensure that a given partition will go to a specific Reducer.

Index Creation After the Map task is completed across all nodes in the Hadoop cluster and partition and shuffling are done, Reduce task can come into action. We can configure the number of Reduce task for a MapReduce job.

We will be doing the final indexing part in the Reducer nodes which eventually will combine all the documents for a given partition key in parallel and create an index. Lucene cannot write index on HDFS. It requires Windows or Linux file system or some NFS/NAS file system. For this example, Lucene index is created on Windows file system. In large scale clusters we can use some NAS storage where all the Reducer nodes have the access and create their respective indexes.

It is important to understand that if we set the number of Reduce task to N, there will be total N set of indexes created. How those N set of indexes will be merged at the end will be an open question for now (which we will answer later), but before that we need to somehow create indexes in a separate directory for each of the N Reduce tasks such that index created by ith Reduce task should not get overwritten by jth Reduce task. To achieve this, we can use Reduce Task ID which uniquely indentifies a Reduce task.

8 http://www.json.org/xml.html

2012 EMC Proven Professional Knowledge Sharing 16

When a job is submitted to MapReduce engine, a Job ID is created. We already know for a given Job there will be some Map task and some Reduce task. Each of the tasks is also given a unique identifier by the MapReduce engine. Now, if we create a directory in the target file system for a job and under that we create directories for every Reduce task, we can solve the problem of overwriting indexes of one reducer by another.

But wait! Let’s now try to understand who creates the index files and who writes those files to the index directory? We are not writing any code in Reducer to create the directory in NFS/NAS storage; it is done by the Solr indexer. But how will Solr indexer know the location of the index directory and details of the Reduce task (like task ID) which created the index. During the indexing process, we are just giving a set of SolrDocument to the Solr indexer, we cannot give any information about the Reducer task from where the set of SolrDocument came. Another problem; we cannot run one single Solr server centrally and access the same from each of the Reducer node for writing their respective indexes. This design will not scale. So what is the solution?

Here comes the concept called EmbeddedSolrServer. This is basically a way to “embed” Solr into a Java application. It provides the exact same API as you would use if you were connecting to a remote Solr instance—which makes it easy to convert later if you'd like; and by using it you can be sure you will be Using API calls that will be supported in the future.9

Now the question is how would you embed Solr into a Reduce task? To understand this we need to look at the way the Hadoop job is packaged.

Hadoop job mainly consists of the Mapper Class, the Reducer Class, and one Driver Class. Mapper and Reducer class have the logic to perform the Map task and Reduce task. We have already seen the Map Class. We will see the Reduce Class shortly. Driver Class is basically the job client which submits the job to the MapReduce engine.

To execute a Job in Hadoop, the necessary classes are bundled in a JAR file which can be launched from command line as shown below. Here the jar file name is the solrindex.jar and the Driver Class name is SolrIndexDriver.

9 http://wiki.apache.org/solr/EmbeddedSolr

2012 EMC Proven Professional Knowledge Sharing 17

This job submission does many steps internally; creating a job id, identifying the input splits, copying the jar file to HDFS, and telling the JobTracker that the job is now ready to run. When JobTracker schedules the task (Map or Reduce) to the available TaskTracker node, TaskTracker copies the Jar file from HDFS to local file system and will spawn a new Java Virtual Machine (JVM) on which it will execute the task for the given input split.

If a Map or Reduce task needs additional jars as dependency, it needs to be bundled into the job jar file so that when a new JVM executes the task, it can resolve the reference to the dependant jars. Similarly, if the Reduce task needs to write the index using EmbeddedSolrServer, the embedded solr needs to be bundled along with the job jar file. Each Reduce task will have its own copy of job jar file and hence its own copy of the EmbeddedSolrServer. As the EmbeddedSolrServer is running within the Reduce task, it has the details of Reduce task ID and creates its own copy of index files into the dedicated directory for the reducer. But for doing that, EmbeddedSolrServer requires some configuration settings which we will see shortly.

Here is the snapshot of the MapReduce project structure in Eclipse IDE with Mapper, Reducer, Drivers, and other helper classes such as Parser, dependency jars, and Embedded Solr Server.

2012 EMC Proven Professional Knowledge Sharing 18

Here In the diagram to the left, the HadoopSolrIntegration is the Java project in Eclipse. It has classes such as Mapper, Reducer, Driver, and Parser. The DistributedMergeQueue integrates with Zookeeper. We will see the ZooKeeper integration later in this document.

If Map or Reduce task need to reference any additional jars or EmbeddedSolrServer, it needs to be placed under the lib folder. As you can see here, lib folder contains solr folder. This solr folder is nothing but the embedded solr for the job. Within solr there is conf and data folder. Data folder is to hold index files. We will not be using this data folder but we will configure the EmbeddedSolrServer to write the index in some external storage. The conf folder contains two important files; schema. and solrconfig.xml. The schema.xml need to be edited to include the field details of a SolrDocument which need to be indexed. The solrconfig.xml is used to configure and tune Solr.

The remainder of the jars shown under lib are dependency jars for the Map Reduce job. As you can see, there are Lucene jars which are required for embedded solr, and some jars related to ZooKeeper integration.

Let us now explore briefly the schema file inside our EmbeddedSolrServer. We can customize the search capabilities by editing the schema.xml file. There are two main tags in this xml file; and . The tag contains the basic data types which we need to define a field. The section is where you list the individual declarations you wish to use in your documents. Each has a name that you will use to reference it when adding documents or executing searches, and an associated type which identifies the name of the field

2012 EMC Proven Professional Knowledge Sharing 19 type you wish to use for this field. There are various field options that apply to a field. These can be set in the field type declarations, and can also be overridden at an individual field's declaration.10

Our Solr Schema is very simple, containing four fields we defined earlier.

The following is the meaning of “indexed” and “stored” attribute of field tag.

 indexed=true|false o True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable.  stored=true|false o True if the value of the field should be retrievable during a search.

Difference between text field and string field is a text field uses Word Delimiter Filter to enable splitting and matching of words on case-change, alpha-numeric boundaries, and non- alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi- Fi" . On the other hand, string field does not use any analyzer.

There are many more advanced features available to customize the indexing; you can find good information in solr wiki to customize the schema for your need.

10 http://wiki.apache.org/solr/SchemaXml

2012 EMC Proven Professional Knowledge Sharing 20

Solr Index Driver Let’s now look at the Hadoop Driver class which submits the Hadoop Job to the MapReduce engine.

This is the driver class which constructs the Job Configuration object with details such as Mapper Class, Reducer Class, File Input, Output format, and so on. The Input Path is set to the path where the Big Data files are loaded into HDFS. The Output Path is the path where Reduce task finally write its output into HDFS. In our scenario, as Reduce task is writing the index to external directly in the local or network file system, the Output Path settings are being ignored

2012 EMC Proven Professional Knowledge Sharing 21 by Reduce task. There are a couple bits of interesting information we are setting in the JobConf. First is the location of the embedded.solr.home. This location is required to connect to an instance of the sol server. As we are using EmbeddedSolrServer which is under the lib folder of the Hadoop job jar, we are setting the relative path of the embedded solr server location which will be used in the Reduce task. This is the path relative to the reducer class where the jar is exploded into individual TaskTracker node. Second, we are setting the index.dir which is the location of the index directory for the given Job. Each Reduce task will create its own task level folder under this Job level index directory to make sure parallel Reduce task do not overwrite each other’s index. In the Driver Class, there are a few lines of code related to ZooKeeper integration which will be explained later in the document.

Solr Index Reducer Let’s now look at the Reducer Class which creates the index and writes to the specified directory. For now, we will not go into Hadoop API details for writing Reduce task; just concentrate on the highlighted code to understand the core logic for index creation.

2012 EMC Proven Professional Knowledge Sharing 22

The Reducer code is invoked by the MapReduce engine after the Map task is over. Each Reduce task is given the input key and value list. The key for a Reduce task is the partition ID of the Log file and value list is basically a list of SolrDocument for that partition. SolrDocument are encoded in the Json format as we have seen in Mapper code. In Reducer, we are decoding the Json back to SolrDocument object.

There are some highlighted portions of the Reducer code which need some explanation.

The first highlighted portion under configure method which gets the details of SOLR_HOME and index.dir. We have already set those values from the Driver Class. The remaining two config

2012 EMC Proven Professional Knowledge Sharing 23 items (mapred.tip.id which holds the Task ID and mapred.job.id which holds the Job ID) are populated by MapReduce engine.

The next highlighted block sets the solr.solr.home and the solr.data.dir in system property. Embedded Solr Server will get the solr directory location from solr.solr.home system property. Solr data directory is the location where index files will be created for a Reduce task. As you can see, solr.data.dir value is constructed by concatenating the index.dir + Job ID + Task ID.

Finally, in the Reducer code, the EmbeddedSolrServer object is created and we begin adding all SolrDocument objects into the server.

At the end, we do the commit to flush all indexes to the disk and close the SolrCore.

Let us now see what the index directory structure looks like after the Reduce task. For our MapReduce job, the index.dir is set to, for example, /x/y/generated_index. Three Hadoop Job are executed to create index for three sets of log data. Thus you can see three job folders (starting with job_). For each Job there is a single Reduce task folder (starts with task_). If a given job has multiple Reduce tasks, you will see multiple task folders under the job folder. The index folder holds the actual index files for the given task. Merging of indexes generated by different Reduce task is done outside Hadoop job. This merging process is controlled by Zookeeper.

What is ZooKeeper? Zookeeper is a coordination service for distributed application. It exposes a simple set of API that distributed applications can use to build higher level services for synchronization, configuration maintenance, and so on. It is designed to be easy to program to, and uses a data

2012 EMC Proven Professional Knowledge Sharing 24 model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.11

Coordination service is very complex to implement. If not done correctly, it will create issues such as deadlock and race condition. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch. Tce Concept of Zookeeper is very simple. Zookeeper server maintains a hierarchical namespace similar to the file system shown in the figure below. The construct which stores the hierarchical namespace information is called znode.

In this figure there is a root znode (/) and under that there are two znodes (/zoo1 and /zoo2). Znode has some properties described below which control the access of a znode.

 A client can create a znode, store up to 1MB of data, and associate as many child znodes as it wants. Data access (read or write) on znode is always atomic.  Znode can be one of two types: ephemeral and persistent. o Ephemeral znodes are deleted by ZooKeeper when the creating client’s session is closed. o Persistent znodes stay as long as they are not deleted explicitly.  Each znode has an Access Control List (ACL) that restricts who can do what.  Znode maintain version numbers for data changes, ACL changes, and timestamps.

Index Merger Let us now see how Zookeeper can be integrated to perform the automatic index merging operation as soon as the Hadoop job is completed. We have learned that a given job can have multiple reducers; and all reducers will create their own index in dedicated folders. Once the job is completed, we need to merge all those indexes into the target Solr installation where the end

11 http://zookeeper.apache.org/doc/trunk/zookeeperOver.html

2012 EMC Proven Professional Knowledge Sharing 25 user will perform the search. The schema of the target Solr installation should be the same as the schema of the embedded Solr server. This is understandable because we need to perform the search on the indexes produced by embedded Solr server and both schemas have to match. The figure below provides an overview of the architecture.

Hadoop Cluster HDFS Name Node Zookeeper 1: Notify Zookeeper Job Tracker Node

Solr 2: Trigger Installation Task Tracker 1 Task Tracker 2 Task Tracker 3 Index running on Merger Tomcat Embedded solr Embedded Solr Embedded Solr

Index Merger External Storage Location Index Dir 3: Merge Index to Target Solr Index1 Index2 Index3

Here, after the MapReduce job is completed and all Reducer task writes indexes to the dedicated index directory, the job client notifies the Zookeeper server which initiate the merging process.

Zookeeper Integration Zookeeper API is used for implementing a distributed queue. The queue is nothing but a persistent znode in the ZooKeeper server. This queue znode holds other znodes as its children which can be treated as the elements of the queue. When the MapReduce Job is completed, it produces the information about the index directory under the queue znode. The IndexMerger is the consumer of the queue which monitors the queue for a new element. As and when information about the index directory is added to the Zookeeper queue, IndexMerger is notified and gets the details of the directory where the index is written. It then merges the indexes with target Solr deployment. The whole concept is basically a classic producer-consumer solution. Until the producer produces the item, the consumer will wait. When producer produces the item, the consumer will be notified and consume the item. Let us now see the implementation of whole logic.

2012 EMC Proven Professional Knowledge Sharing 26

Recalling the MapReduce Driver Class, we specified the Zookeeper server IP and port number and logic for distributed queue. Let us take a look at the highlighted portion of the code below.

Here, Zookeeper service is running on localhost. The queue implementation connects to the Zookeeper server and creates a queue znode called “indexq”.

The queue API has a method to add/produce an element and associated data into the queue. Internally, it adds the new znode under the “indexq”.

As you can see in driver code, the element produced in the queue is the MapReduce Job ID, and value is the index directory location of the job. Driver Class receives the Job ID details from the RunningJob object after the job is submitted to the MapReduce engine. The figure below shows the details of the Zookeeper queue after a couple of MapReduce job runs.

2012 EMC Proven Professional Knowledge Sharing 27

This is the snapshot of a Zookeeper plug-in in Eclipse. You can see the “indexq” is created under the root (/) node of Zookeeper. Under indexq there are two job elements (znode). The value of the highlighted job element is the location of the index directory for that job. If you recall, under this job directory there are task directories which actually hold the indexes. The consumer part of the queue which is running on a different server will now

read this znode data and use Lucene API to merge all the indexes generated under this job directory.

The snippet below is a very simple queue consumer and index merger routine. It connects to the indexq and waits for the data in the queue. Once the queue has the data, it consumes the element and gives to IndexMergeJob which uses Lucene API to merge all index directories available under this job directory with the index of the target Solr server.

This article does not go into details of Distributed Queue implementation using ZooKeeper API. You can find some examples in http://zookeeper.apache.org/ about interesting Zookeeper-based solutions.

2012 EMC Proven Professional Knowledge Sharing 28

There are different options available to merge indexes.12 For this example, IndexMergedTool is used which is part of lucene-misc.jar file. Here is the usage of IndexMergedTool.

Usage: IndexMergeTool

This can merge multiple index folders and generate the final index in the folder.

Once the indexes are merged, we can use the Solr query features/API to retrieve the data.13 For testing the validity of the generated index, we can also use Solr admin console.

Architecture Scalability Up to this point, we have defined the architecture for parallel indexing using Hadoop and Solr. Let us now discuss a few scalability features in the given solution. Scalability means how a system can support the growth of data and traffic. A system is considered scalable if there is minimal impact on performance with larger dataset and higher traffic. Also system complexity and maintenance should not increase when it scales.

Hadoop is already a proven scalable architecture. Hadoop cluster performance increases with the increase of task tracker and data nodes. But this has a limit. As you know, Hadoop Name node stores information about the Data Nodes in memory; it cannot scale unbounded. There is lots of tuning we can do to make Hadoop perform better. We will look into some of the Hadoop tuning options in the next section.

12 http://wiki.apache.org/solr/MergingSolrIndexes 13 http://wiki.apache.org/solr/CommonQueryParameters

2012 EMC Proven Professional Knowledge Sharing 29

In our architecture, a major scalability bottleneck could be accessing the NAS/External storage device from individual Reducer nodes for writing the indexes. This bottleneck can be overcome with high-speed Ethernet and high-performance NAS storage.

Another bottleneck in our indexing architecture could be Solr itself. If the index grows so big as to become unmanageable by single Solr instance, we need to think of scaling Solr. There is a concept called Distributed Search in Solr. When an index becomes too large to fit on a single system, or when a single query takes too long to execute, an index can be split into multiple shards, and Solr can query and merge results across those shards.14

In our architecture, distributed search will be easier to accomplish as we already have index partitioned by Reducer task. If you remember, our index creation is divided based on partition key, i.e., each reducer creates indexes for a specific partition key. Even we can implement our own Partitioner logic to override the default Hash Partitioner of Hadoop for better control of the shuffling and partitioning of keys after the Map task. We also have to make the IndexMerger module intelligent enough to merge indexes only for the same partition to a dedicated Solr shard.

The diagram below depicts the shard concept.

Here you can see three Reduce tasks configured for three partitions. Each writes index to dedicated directory. There are three Solr Shards configured to for each of the partitions. The index merger merges the index to its shards.

Solr Cloud is another concept we can adopt for scaling solr.

14 http://wiki.apache.org/solr/DistributedSearch

2012 EMC Proven Professional Knowledge Sharing 30

Solr Cloud is the set of Solr features that take Solr's distributed search to the next level, enabling and simplifying the creation and use of Solr clusters.15 In Solr Cloud we can:

 Centralize the configuration for the entire cluster  Perform automatic load balancing and fail-over for queries

Solr Cloud uses Zookeeper to coordinate and store the configuration of the cluster. As we already have ZooKeeper integration in this architecture, we can leverage the same Zookeeper server for Solr Cloud configuration. Refer to the Solr Wiki on how to configure the Solr Cloud.

Performance Tuning Hadoop Tuning Hadoop performance tuning is a very advanced topic which requires detailed knowledge of how Hadoop framework works. Here we will discuss a few options you may need to consider to tune your Hadoop cluster.

In Hadoop there is a property which sets the maximum number of simultaneous Map or Reduce tasks that can be executed on a given TaskTracker node. They are called TaskTracker slots and default is set to 2. That means a maximum 2 Map or 2 Reduce tasks can execute on a TaskTracker node simultaneously. We can tune these slot values in such a way that they can fully utilize the number of processing (CPU) cores available in the task tracker machine. If the number of cores available is 4 (quad core CPU) we can set the number or Map/Reduce slots as 4. As Map/ Reduce task runs in parallel, the core will be utilized properly if the number of slots are equal or greater than the number of CPU cores.

We should keep the Input file size as large as possible to utilize the HDFS block size. The default block size is 64 MB, and 128 MB value is good block size. The file size to be loaded into HDFS has to significantly higher than this value.

If you recall, Hadoop job consists of the following steps; distribute data to HDFS, perform Map task, write output of Map task, shuffle the output keys, sort the keys, perform reduce, and write to HDFS. Each of these steps can cause a bottleneck for a large job. The figure below shows the flow of data in Hadoop cluster. Let’s look at what Hadoop does during each step and what tuning options are available.16

15 http://wiki.apache.org/solr/SolrCloud 16 Book : Hadoop The Definitive Guide

2012 EMC Proven Professional Knowledge Sharing 31

When Map task produce output, it buffers the data in memory before writing to disk. Each Map task has circular memory buffer size of 100 MB (default). When content of this buffer reaches a certain threshold, it spills the content to the disk. Map output will continue to be written into the buffer when spill happens. Spills are written to local node file system (not HDFS). But another interesting thing happens before it writes the data to disk. The data is divided into partition as it will be given to Reduce task, and within the partition the in-memory sort happens on the key and then it is written to the disk. This sorting and spilling continues whenever the buffer threshold is reached. During every spill, a new spill file is created. Each spill file is partitioned and keys are sorted within the partition. When the Map task is completed, each spill file is merged into one single partitioned and sorted file.

The tuning options available during this Map process are to control this buffering and spilling process. There are properties available to set the in-memory buffer size and the spill percentage (i.e., what percent of buffer to be full for spilling to kickoff). There are properties available to tune the number of spill files to be merged at once to produce the final sorted file.

Let us now see what happens at the Reduce side. After Map process completes, Map files are sitting on the task tracker node which produces this output. Now the “copy” phase starts during which a sorted partition form a Map output file getting copied to a Reducer task tracker’s memory. As you can imagine, different sorted partitions for the same key can come from a different Mapper. Buffering in memory and spilling to disk also happens here. Multiple partitions

2012 EMC Proven Professional Knowledge Sharing 32 from different Mapper are copied to Reducer memory. When memory reaches a threshold, partitions are merged and spilled into disk. This spilling can happen multiple times during the copy phase. There is a background process which merges all these spills to save time during final merging.

Multiple spill files are available in the disk after the copy phase completes. Now the final merging will start. This merging happens in round based on the merge configuration.

For this reduce phase, the possible tuning options are to control the number of threads to copy in parallel to Reducer node; configure the input buffer size to copy the files into memory; and as we did in Map task, we can also control the spill percentage. Finally, we can control how many files can be merged together to produce a final set of files for Reduce task.

Solr Tuning Thus far, we have seen some internal workings of Hadoop and how to tune Hadoop for better performance. Let us now quickly look at a few tuning options available for Solr.

Previously, we discussed how Lucene buffers documents in memory before flushing into disk and merge the segments onto disk when the number of segments reached a certain threshold. This is controlled by two properties: ramBufferSizeMB and mergeFactor. The ram size controls how many documents can be accommodated in the memory before being written into a single segment file. Higher the buffer size, better the performance. The mergeFactor roughly determines the number of segments. The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. 17 mergeFactor Tradeoffs

 High value merge factor (e.g., 50): o Pro: Generally improves indexing speed. o Con: Less frequent merges, resulting in a collection with more index files which may slow searching.  Low value merge factor (e.g., 5): o Pro: Smaller number of index files, which speeds up searching. o Con: More segment merges slow down indexing.

17 http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor

2012 EMC Proven Professional Knowledge Sharing 33

Final Execution In this final section we will see a couple of executions of this indexing process on single node Hadoop cluster. Running Hadoop on single node cluster does not give any benefit of parallelism. But a demo run on single cluster will at least ensure the correctness of the design and can be used as a Proof of Concept (POC). To test the scalability of the design, you need to have a multi-node cluster setup.

Here the indexing design is tested on a Dell 32-bit Laptop with Intel i5 CPU, M 560 @ 2.67 GHz, and 3.24 GB RAM where Hadoop is running on Cygwin along with Zookeeper and Tomcat on the same machine.

Run 1: Hadoop HDFS is loaded with 1 million log entries created programmatically. Here is the snapshot of a few lines taken from the randomly generated Log entries.

ID Code Partition Body

 1st highlighted block is: Log ID  2nd Highlighted block is: Log Code  3rd Highlighted block is: Log Partition ID  4th highlighted block is: Log body

The basic tuning used:

 JVM Heap size for Map/Reduce task set to 1024 MB.  Total buffer size for doing in-memory sorting is set to 200 MB.  MergeFactor in Solr side is set to 20.  Solr RAM Buffer size set to 500 MB.

It took around 7.5 minutes to create the indexes for 1 million log entries by Hadoop MapReduce and around 35 seconds by IndexMerger module to merge with the target Solr installation.

2012 EMC Proven Professional Knowledge Sharing 34

Index statistics taken from Solr that validate the total count of 1 million documents finally gone to Solr.

Run 2: 2nd run is executed with another 1 million of log entries, but this time the Solr RAM buffer size is set to 1024 MB.

While this time it took around 6 minutes to create the indexes, it took 50 seconds to merge the index with target Solr as the target Solr already had 1 million documents from the previous run.

See the index statistics which shows the 2 million documents.

Searching these 2 million documents by ID, Code, Partition, or Body will return the exact document within milliseconds. Let us see a query example.

Query: http://localhost:8080/solr/core0/select/?q=ABC&version=2.2&start=0&rows=10&indent=on

This query says, search for “ABC” in all fields of all Documents. Below are matches...

2012 EMC Proven Professional Knowledge Sharing 35

o See the highlighted fields. The Query took 234 ms to return. o There are 14 matches among 2 million documents. o Some of the matched field is highlighted above.

Summary As we have seen in the POC, with a single machine having only ~3 GB of memory and minimal tuning we can index 2 million records within 6 minutes and search the same within 250ms. As we know, Hadoop can scale horizontally. More machines can be added into the cluster and with a good amount of memory and processing power, this indexing can be done much faster. If we have an efficient Text Acquisition component which can collect and merge log files as and when it arrives and push it to HDFS for index building, the same will be available in minutes to the

2012 EMC Proven Professional Knowledge Sharing 36 target Solr server for search. Obviously, the time to build the index depends on many factors including the cluster configurations, network bandwidth, complexity of index requirement, proper tuning of the cluster, and so on.

In this Knowledge Sharing article we have shown how to perform parallel indexing of Big Data. We discussed Hadoop and its subsystems and how Hadoop Map Reduce works. We have also seen some tuning options for Hadoop. This article covers the concepts of Information Retrieval and popular open source search engine Solr and Lucene. We have seen the design to integrate Hadoop and Solr and have demonstrated the functionality of distributed consensus service, ZooKeeper.

This article does not cover details of Hadoop, Solr, Lucene, and Zookeeper programming and more complex features of the respective technology stacks. Performance numbers for performing large-scale indexing on multi-node clusters are also not evaluated in this article.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

2012 EMC Proven Professional Knowledge Sharing 37