Getting Started With Apache Hive

By Reeshu Patel

Getting Started With Apache Hive 1

Introduction – What is Hive

Apache Hive is a data warehouse system build on top of Hadoop to query Big Data. Hive originated at and was open sourced in August 2008. The challenge Facebook had to address is one faced by many companies since then. Eventually data growth in a company challenges the capabilities of deployed RDBMS or NoSQL systems. Reports and analytics start to take minutes, then hours, and eventually overlap with other queries and the whole system grinds to a halt. Another common scenario companies start processing big data with Hadoop discovers the value of making the data accessible beyond the development team capable of writing complex map-reduce jobs. What is Big Data The term Big Data is freely used in this context. A colleague defined Big Data jokingly as anything beyond one million rows – Microsoft Excels row limit. The underlying point is that Big Data is a point of view and can be generalised as the point where simple solutions and deployed technology fail. The subsequent question is if scaling and investing heavily in it is the most economical solution. Commercial large-scale data warehouse solutions are very expensive. Furthermore, some of the data collected today, e.g. poorly structured or highly denormalized data, can be impractical to manage with these systems. The Hadoop ecosystem regularly is utilised to scale data processing in a feasible manner. Hadoop becomes either a replacement or a batch process addition to the existing infrastructure for data analysis, extraction, loading, transformation, reporting, and machine learning. Accessing Big Data

2 Getting Started With Apache Hive

The downside, which Facebook encountered, was that data stored in Hadoop is inaccessible to business users. There are higher-level languages like Pig, Cascading, Crunch or Scalding. They are, however, geared towards software developers that want to avoid the verbosity of pure Java map-reduce development. Even Pig, a popular data-flow language, still requires users to learn a completely new skill. Facebook realised that most of their users already had a common skill – they knew SQL. Hive was developed to give access to data stored in Hadoop translating SQL- like statements into complex map-reduce jobs reading and processing data on large distributed scale. Democratising Big Data Hive is a success story. Today, Facebook’s largest Hadoop cluster consists of thousands of computers providing a combined storage of 150 Petabytes – roughly 150,000,000 Gigabytes. Hive provides access to literally thousands of employees to the data running together thousands of map-reduce jobs every day using Hive. The additional training for employees knowing SQL to use Hive is minimal. Most statements can be expressed equivalently and full SQL support is coming to Hive very soon. What Hive is not Hive does not use sophisticated indexes like many RDBMS which are able to answer queries in seconds. Hive queries usually take minutes or hours. However, Hive scales very well and can execute queries across data of the magnitude of Petabytes. The Hadoop ecosystem and Hive are under very active development, however, and speedups are achieved with every new iteration. In 2013 many new developments are due to be introduced, a new Hadoop framework that goes beyond map-reduce and manages resources better, and Hive improvements that will make queries on smaller data faster and interactive.

Getting Started With Apache Hive 3

Getting Started – Setup Hive In normal use, Hive runs on your workstation and converts your SQL query into a series of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,which provide a means for attaching structure to data stored in HDFS. Metadatasuch as table schemas—is stored in a database called the metastore. When starting out with Hive, it is convenient to run the metastore on your local machine. In this configuration, which is the default, the Hive table definitions that you create will be local to your machine, so you can’t share them with other users. We’ll see how to configure a shared remote metastore, which is the norm in production environments, Installation of Hive is straightforward. Java 6 is a prerequisite; and on Windows, you will need Cygwin, too. You also need to have the same version of Hadoop installed locally that your cluster is running.3 Of course, you may choose to run Hadoop locally, either in standalone or pseudo-distributed mode, while getting started with Hive. Which Versions of Hadoop Does Hive Work With

Any given release of Hive is designed to work with multiple versions of Hadoop. Generally, Hive works with the latest release of Hadoop, as well as supporting a number of older versions. For example, Hive 0.5.0 is compatible with versions of Hadoop between 0.17.x and 0.20.x (inclusive). You don’t need to do anything special to tell Hive which version of Hadoop you are using, beyond making sure that the hadoop executable is on the path or setting the HADOOP_HOME environment variable.

4 Getting Started With Apache Hive

Create, Load, Select, Drop – The basics

Hive uses a metastore a database – to store information about the tables it knows. Tables to Hive often are not much more than storing the information where the data is stored and how it is formatted. We do not have to know much about the metastore itself to use Hive though. Managed Table Managed tables’s data is controlled by Hive. It will create a directory for the data on HDFS and when we drop the table it will delete the data. Later in this chapter we will see that we can manage the data ourselves with an EXTERNAL table. Creating an empty table Let us create a table, which will be a simple list of all names of all countries. Go to the Beeswax query editor and execute the following query:

CREATE TABLE country_list (name STRING);

You can now find the table in the Beeswax table list. Once you selected the table you can view the file location on HDFS which Hive automatically selected and created for you. The table and directory are empty. Note: We can define the location of a new table as we will learn later.

Describing a table We can get all the information we need about a table through a query too. Go back to the query editor and execute:

Getting Started With Apache Hive 5

DESCRIBE EXTENDED country_list; The result describes the schema of the table and detailed table information. This includes the location of the table and the information that it uses TextInputFormat which is default setting in the Hive setup.

Loading data into a table We want to load data to the table so we need to upload it to HDFS. We know the table has single column of country name, uses a simple text format, and Hive always uses new line characters to as row delimiters. All we need to do to add data is upload one or several files into the directory linked with the table. The files have to be formatted to have one country name on each line. Go back to the table list and select the country_list table. Select View File Location and you will be viewing the HDFS directory of where the able stores its data. Upload the coutry_example.tsv file into the directory using the file upload function on the top right of the interface.

Query a table Hive is very flexible and checks the table’s data location for every query. Adding or removing data on the file system is reflected on the table. After adding the file its content is now automatically retrieved as table data. Try it out. Go back to the Beeswax table list and select the country_list again. Click on browse data on the left. You should see four countries which are read from the file. You can also query the table to get the same result. Go tot he query editor and execute:

6 Getting Started With Apache Hive

sql SELECT * FROM country_list; Drop a table You can drop the table in the table view of Beeswax through the graphic interface. Go to the query editor and execute the equivalent query to drop the table: sql DROP TABLE country_list; The table and the HDFS directory and file(s) are deleted.

Now we Configuring Hive with hadoop

P.S:-We have already installed hadoop . Installation and Configuration hive Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases). Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z: you can Downlode tar.jar here http://mirrors.ucr.ac.cr/apache//hive/stable/hive-0.8.1-bin.tar.gz

Now we will put this tar file in /usr/local/ then follow these command #ln -s /usr/local/hive-0.8.1-bin/ /usr/local/hive (for rename)

Getting Started With Apache Hive 7

#chown -R hduser:hadoop hive

Now we set the path in ~/.bashrc export HIVE_HOME=/usr/local/hive path=$path:$ HIVE_HOME/bin

Now we will give a class path in the hadoop_home/conf/hadoop-env.sh file export HADOOP_CLASSPATH=/usr/local/hadoop:/usr/local/hive/lib/*:/usr/local/h base/hb ase-0.90.4.jar:/usr/local/hbase/hbase-0.90.4- tests.jar:/usr/local/zookeeper/zookeeper- 3.4.5.jar:/usr/local/hbase/lib/zookeeper-3.3.2.jar:/usr/local/derby Create hive directories within hdfs and set permissions for table create

If you plan to have more than one Hive user sharing a Hadoop cluster, then you need to make the directories that Hive uses writable by all users. The following commands will create the directories and set their permissions appropriately: $ hadoop fs -mkdir /user/hive/warehouse

8 Getting Started With Apache Hive

$ hadoop fs -mkdir /tmp

$ hadoop fs -chmod g+w /user/hive/warehouse

$ hadoop fs -chmod g+w /tmp

Now we will go to the /usr/local/hive/conf/hive-env.sh

First we will rename this file hive-env.sh.templete to hive-env.sh then

# Set HADOOP_HOME to point to a specific hadoop install directory

HADOOP_HOME=/usr/local/hadoop Configuring Hive Hive is configured using an XML configuration file like Hadoop’s. The file is called hive-site.xml and is located in Hive’s conf directory. This file is where you can set properties that you want to set every time you run Hive. The same directory contains hive-default.xml, which documents the properties that Hive exposes and their default values. You can override the configuration directory that Hive looks for in hive-site.xml by passing the --config option to the hive command:

Getting Started With Apache Hive 9

Now we will go to /usr/local/hive/conf/$ and rename the hive- default.xml file mv conf/hive-default.xml.template conf/hive-site.xml\ and we also rename hive-.properties .templete to hive-log4j.properties and also rename hive.env.sh.templete to hive.env.sh

Now before starting hive we will change two file from #hive/lib to hadoop/lib Copy file from #hive/lib /hive.comman.jar and hive.exc.0.10.0.jar and copy to #hadoop/lib affter then give permission #~/hadoop/lib $chmod -R 777 Now we start Hive affter starting HDFS :start-all.sh then #/usr/local/hive/bin$./hive Launch hive and create sample tables

**************************************************************************** *** $ hive 10 Getting Started With Apache Hive

hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; hive> CREATE TABLE kjv (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; hive> exit;

**************************************************************************** **** Download sample data from Cloudera

$ wget -O shakespeare.tar.gz https://github.com/cloudera/cloudera- training/blob/master/data/shakespeare.tar.gz?raw=true

$ wget -O bible.tar.gz https://github.com/cloudera/cloudera- training/blob/master/data/bible.tar.gz?raw=true

$ tar -zvxf bible.tar.gz

$ tar -zvxf shakespeare.tar.gz

Put the Shakespeare sample data into hdfs

$ hadoop fs -mkdir shakespeare-input

$ hadoop fs -put /usr/local/shakespeare *.jar/all-shakespeare /user/hduser/shakespeare-input

$ hadoop fs -ls shakespeare-input

Getting Started With Apache Hive 11

Run the “grep” sample against the hdfs directory “shakespeare-input” and place results in “shakespeare_freq”

$ hadoop jar /usr/local/hadoop/hadoop-examples-1.0.3.jar grep shakespeare-input shakespeare_freq '\w+' then you get out in cansole

You can check out put in this command $ hadoop fs -ls shakespeare_freq

12 Getting Started With Apache Hive

You fallow same and Perform another example Put the bible sample data into hdfs

$ hadoop fs -mkdir bible-input

$ hadoop fs -put ~/bible/all-bible /user/hduser/bible-input

$ hadoop fs -ls bible-input

Run the “grep” sample against the hdfs directory “bible-input” and place results in “bible_freq”

$ hadoop jar ~/hadoop-1.0.3/hadoop-examples-1.0.3.jar grep bible-input bible_freq '\w+'

$ hadoop fs -ls bible_freq

Cleanup the logs

$ hadoop fs -rmr bible_freq/_logs

$ hadoop fs -rmr shakespeare_freq/_logs

MapReduce Scripts

Getting Started With Apache Hive 13

Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make it possible to invoke an external script or program from Hive. Suppose we want to use a script to filter out rows that don’t meet some condition, such as the script in shakespeare exp which removes poor quality readings.

Open Hive

*****************************************************************

$ hive hive> load data inpath "shakespeare_freq" into table shakespeare; hive> select * from shakespeare limit 10; hive> select * from shakespeare where freq > 20 sort by freq asc limit 10; hive> select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10; hive> explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10; hive> load data inpath “bible_freq” into table kjv; hive> create table merged (word string, shake_f int, kjv_f int); hive> insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word) where s.freq >= 1 and k.freq >= 1; hive> select * from merged limit 20;

14 Getting Started With Apache Hive

hive> select word, shake_f, kjv_f, (shake_f + kjv_f) as ss from merged sort by ss desc limit 20;

**************************************************************************** **** Now you know; and knowing is half the battle

Total MapReduce CPU Time Spent: 6 seconds 140 msec

OK the 25848 62394 88242 and 19671 38985 58656 of 16700 34654 51354

I 23031 8854 31885 to 18038 13526 31564 in 10797 12445 23242 a 14170 8057 22227 that 8869 12603 21472

And 7800 12846 20646 is 8882 6884 15766 my 11297 4135 15432 you 12702 2720 15422 he 5720 9672 15392

Getting Started With Apache Hive 15

his 6817 8385 15202 not 8409 6591 15000 be 6773 6913 13686 for 6309 7270 13579 with 7284 6057 13341 it 7178 5917 13095 shall 3293 9764 13057

Time taken: 67.711 seconds

The Hive Web Interface (HWI) As an alternative to the shell, you might want to try Hive’s simple web interface. Start it using the following commands:

% export ANT_LIB=/path/to/ant/lib

% hive --service hwi

(You only need to set the ANT_LIB environment variable if Ant’s library is not found in /opt/ant/lib on your system.) Then navigate to http://localhost:9999/hwi in your 16 Getting Started With Apache Hive

browser. From there, you can browse Hive database schemas and create sessions for issuing commands and queries. It’s possible to run the web interface as a shared service to give users within an organization access to Hive without having to install any client software. There are more details on the Hive Web Interface on the Hive wiki at https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface. Thrift Client

The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution.

JDBC Driver

Hive provides a Type 4 (pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings.)

You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client.

ODBC Driver

Getting Started With Apache Hive 17

The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it. There are more details on using these clients on the Hive wiki at https://cwiki.apache.org/confluence/display/Hive/HiveClient

The Metastore

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore configuration Using an embedded metastore is a simple way to get started with Hive; however, only one embedded Derby database can access the database files on disk at any one time, which means you can only have one Hive session open at a time that shares the same metastore. Trying to start a second session gives the error: Failed to start database 'metastore_db' when it attempts to open a connection to the metastore.

Buckets

There are two reasons why you might want to organize your tables (or partitions) into buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns— which include the join columns—can be efficiently implemented as a map-side join.

18 Getting Started With Apache Hive

The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them. We shall see how to do efficient sampling at this end of this section.

The data within a bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort. The syntax for declaring that a table has sorted buckets is:

CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

How can we make sure the data in our table is bucketed? While it’s possible to load data generated outside Hive into a bucketed table, it’s often easier to get Hive to do the bucketing, usually from an existing table.

Take an unbucketed users table: hive> SELECT * FROM users;

0 Nat

2 Joe

3 Kay

4 Ann

Tables | 431

Getting Started With Apache Hive 19

To populate the bucketed table, we need to set the hive.enforce.bucketing property to true, so that Hive knows to create the number of buckets declared in the table definition. Then it is a matter of just using the INSERT command:

INSERT OVERWRITE TABLE bucketed_users

SELECT * FROM users;

Physically, each bucket is just a file in the table (or partition) directory. The file name is not important, but bucket n is the nth file, when arranged in lexicographic order. In fact, buckets correspond to MapReduce output file partitions: a job will produce as many buckets (output files) as reduce tasks. We can see this by looking at the layout of the bucketed_users table we just created. Running this command: hive> dfs -ls /user/hive/warehouse/bucketed_users; shows that four files were created, with the following names (the name is generated by Hive and incorporates a timestamp, so it will change from run to run): attempt_201005221636_0016_r_000000_0 attempt_201005221636_0016_r_000001_0 attempt_201005221636_0016_r_000002_0 attempt_201005221636_0016_r_000003_0

The first bucket contains the users with IDs 0 and 4, since for an INT the hash is the integer itself, and the value is reduced modulo the number of buckets—4 in this case.

20 Getting Started With Apache Hive

hive> dfs -cat /user/hive/warehouse/bucketed_users/*0_0;

0Nat

4Ann

We can see the same thing by sampling the table using the TABLESAMPLE clause, which restricts the query to a fraction of the buckets in the table rather than the whole table:

hive> SELECT * FROM bucketed_users

> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

0 Nat

4 Ann

Bucket numbering is 1-based, so this query retrieves all the users from the first of four buckets. For a large, evenly distributed dataset, approximately one quarter of the table’s rows would be returned. It’s possible to sample a number of buckets by specifying a different proportion (which need not be an exact multiple of the number of buckets, since sampling is not intended to be a precise operation). For example, this query returns half of the buckets: hive> SELECT * FROM bucketed_users

> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id);

0 Nat

Getting Started With Apache Hive 21

4 Ann

2 Joe

Sampling a bucketed table is very efficient, since the query only has to read the buckets that match the TABLESAMPLE clause. Contrast this with sampling a non- bucketed table, using the rand() function, where the whole input dataset is scanned, even if a very small sample is needed: hive> SELECT * FROM users

> TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());

2 Joe

Storage Formats

There are two dimensions that govern table storage in Hive: the row format and the file format. The row format dictates how rows, and the fields in a particular row, are stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word for a Serializer-Deserializer.

When acting as a deserializer, which is the case when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. When used as a serializer, which is the case when performing an INSERT or CTAS the table’s SerDe will

22 Getting Started With Apache Hive

serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.

The file format dictates the container format for fields in a row. The simplest format is a plain text file, but there are row-oriented and column-oriented binary formats available, too.

HiveQL

Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification. There are a number of reasons for this. Being a fairly young project, it has not had time to provide the full repertoire of SQL-92 language constructs. More fundamentally,

SQL-92 compliance has never been an explicit project goal; rather, as an open source project, features were added by developers to meet their users’ needs. Furthermore, Hive has some extensions that are not in SQL-92, which have been inspired by syntax from other database systems, notably MySQL. In fact, to a first- order approximation, HiveQL most closely resembles MySQL’s SQL dialect.

Some of Hive’s extensions to SQL-92 were inspired by MapReduce, such as multitable inserts (see “Multitable insert” multicluster Mapreduce) and the TRANSFORM, MAP, and REDUCE clauses

It turns out that some SQL-92 constructs that are missing from HiveQL are easy to work around using other language features, so there has not been much pressure to implement them. For example, SELECT statements do not (at the time of writing) support a HAVING clause in HiveQL, but the same result can be achieved by adding a subquery in the FROM clause

Getting Started With Apache Hive 23

Table 12-2. A high-level comparison of SQL and HiveQL

Feature SQL HiveQL References

Updates UPDATE, INSERT “INSERTOVERW INSERT, OVERWRITE RITE TA- BLE” TABLE (populates onpage 438, DELETE whole table or “Updates, Transac- partition) tions, and Indexes” on page 422

Transactions Supported Not supported

Indexes Supported Not supported

Latency Sub-second Minutes

Data types Integral, floating ntegral, floating “Data Types” on point, point, page 424

fixed point, text and boolean, string, binary array, map,

strings, temporal struct

Functions Hundreds of built- Dozens of built-in “Operators and in Functions” on page

24 Getting Started With Apache Hive

functions functions 426

Multitable inserts Not supported Supported “Multitable insert” on page 439

Create table as Not valid SQL-92, Supported “CREATE but found TABLE...AS SE-

LECT” on page 440

Select SQL-92 Single table or view “Querying Data” in the FROM on page 441 clause. SORT BY for partial ordering. LIMIT to limit number of rows re-

turned.

Data Types

Hive supports both primitive and complex data types. Primitives include numeric, boolean, string, and timestamp types. The complex data types include arrays, maps, and structs. Hive’s data types are listed in Table 12-3. Note that the literals shown are those used from within HiveQL; they are not the serialized form used in the table’s storage format

Table 12-3. Hive data types

Getting Started With Apache Hive 25

Category Type Description Literal examples

Primitive TINYINT 1-byte (8-bit) signed 1 integer, from -128 to

127

SMALLINT 2-byte (16-bit) 1 signed integer, from

-32,768 to 32,767

INT 4-byte (32-bit) 1 signed integer, from

-2,147,483,648 to 2,147,483,647

BIGINT 8-byte (64-bit) 1 signed integer, from

- 9,223,372,036,854,7 75,808 to

9,223,372,036,854,7 75,807

26 Getting Started With Apache Hive

Complex ARRAY An ordered array(1, 2) a collection of fields. The fields

must all be of the same type

MAP An unordered map('a', 1, 'b', 2) collection of key- value pairs.

Keys must be primitives; values may be any

type. For a particular map, the keys must

be the same type, and the values must be

the same type.

STRUCT A collection of struct('a', 1, 1.0) b named fields. The fields may

be of different

Getting Started With Apache Hive 27

types.

Primitive DOUBLE 8-byte (64-bit) 1 double-precision floating-

point number

BOOLEAN true/false value TRUE

Operators and Functions

The usual set of SQL operators is provided by Hive: relational operators (such as x = 'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern matching), arithmetic operators (such as x + 1 for addition), and logical operators (such as x OR y for logical OR). The operators match those in MySQL, which deviates from SQL-92 since || is logical OR, not string concatenation. Use the concat function for the latter in both MySQL and Hive.

Hive comes with a large number of built-in functions—too many to list here— divided into categories including mathematical and statistical functions, string functions, date functions (for operating on string representations of dates), conditional functions, aggregate functions, and functions for working with XML (using the xpath function) and JSON. You can retrieve a list of functions from the Hive shell by typing SHOW FUNCTIONS.6 To get brief usage instructions for a particular function, use the DESCRIBE command: hive> DESCRIBE FUNCTION length; length(str) - Returns the length of str 28 Getting Started With Apache Hive

In the case when there is no built-in function that does what you want, you can write your own .

Joins

One of the nice things about using Hive, rather than raw MapReduce, is that it makes performing commonly used operations very simple. Join operations are a case in point, given how involved they are to implement in MapReduce.

Inner joins

The simplest kind of join is the inner join, where each match in the input tables results in a row in the output. Consider two small demonstration tables: sales, which lists the names of people and the ID of the item they bought; and things, which lists the item ID and its name: hive> SELECT * FROM sales;

Joe 2

Hank 4

Ali 0

Eve 3

Hank 2 hive> SELECT * FROM things;

2 Tie

Getting Started With Apache Hive 29

4 Coat

3 Hat

1 Scarf

We can perform an inner join on the two tables as follows: hive> SELECT sales.*, things.*

> FROM sales JOIN things ON (sales.id = things.id);

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

The table in the FROM clause (sales) is joined with the table in the JOIN clause (things), using the predicate in the ON clause. Hive only supports equijoins, which means that only equality can be used in the join predicate, which here matches on the id column in both tables.

Some databases, such as MySQL and Oracle, allow you to list the join tables in the FROM clause and specify the join condition in the WHERE clause of a SELECT statement. However, this syntax is not supported in Hive, so

30 Getting Started With Apache Hive

the following fails with a parse error:

SELECT sales.*, things.*

FROM sales, things

WHERE sales.id = things.id;

Hive only allows a single table in the FROM clause, and joins must follow the SQL-92 JOIN clause syntax.

In Hive, you can join on multiple columns in the join predicate by specifying a series of expressions, separated by AND keywords. You can also join more than two tables by supplying additional JOIN...ON... clauses in the query. Hive is intelligent about trying to minimize the number of MapReduce jobs to perform the joins.

Note:-You can see example in above task.

Outer joins

Outer joins allow you to find nonmatches in the tables being joined. In the current example, when we performed an inner join, the row for Ali did not appear in the output, since the ID of the item she purchased was not present in the things table.

Getting Started With Apache Hive 31

If we change the join type to LEFT OUTER JOIN, then the query will return a row for every row in the left table (sales), even if there is no corresponding row in the table it is being joined to (things): hive> SELECT sales.*, things.*

> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);

Ali 0 NULL NULL

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Notice that the row for Ali is now returned, and the columns from the things table are NULL, since there is no match. Hive supports right outer joins, which reverses the roles of the tables relative to the left join. In this case, all items from the things table are included, even those that weren’t purchased by anyone (a scarf): hive> SELECT sales.*, things.*

> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);

NULL NULL 1 Scarf

Joe 2 2 Tie 32 Getting Started With Apache Hive

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Finally, there is a full outer join, where the output has a row for each row from both tables in the join: hive> SELECT sales.*, things.*

> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);

Ali 0 NULL NULL

Ali NULL 1 Scarf

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Semi joins

Hive doesn’t support IN subqueries (at the time of writing), but you can use a LEFT SEMI JOIN to do the same thing. Consider this IN subquery, which finds all the items in the things table that are in the sales table:

SELECT * FROM things

Getting Started With Apache Hive 33

WHERE things.id IN (SELECT id from sales);

We can rewrite it as follows: hive> SELECT *

> FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);

2 Tie

3 Hat

4 Coat

There is a restriction that we must observe for LEFT SEMI JOIN queries: the right table (sales) may only appear in the ON clause. It cannot be referenced in a SELECT expression, for example.

Map joins

If one table is small enough to fit in memory, then Hive can load the smaller table into memory to perform the join in each of the mappers. The syntax for specifying a map join is a hint embedded in an SQL C-style comment:

SELECT /*+ MAPJOIN(things) */ sales.*, things.*

FROM sales JOIN things ON (sales.id = things.id);

34 Getting Started With Apache Hive

The job to execute this query has no reducers, so this query would not work for a RIGHT or FULL OUTER JOIN, since absence of matching can only be detected in an aggregating (reduce) step across all the inputs.

Map joins can take advantage of bucketed tables, since a mapper working on a bucket of the left table only needs to load the corresponding buckets of the right table to perform the join. The syntax for the join is the same as for the in-memory case above; however, you also need to enable the optimization with:

SET hive.optimize.bucketmapjoin=true;

Getting Started With Apache Hive 35