Apache Hive

Apache Hive
Developer(s) Contributors
Stable release
2.0.0[1] / February 15, 2016 (2016-02-15)
Development status Active
Written in Java
Operating system Cross-platform
Type Data warehouse
License Apache License 2.0
Website hive.apache.org

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. The traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over a distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API without the need to implement queries in the low-level Java API. Since most of the data warehousing application work with SQL based querying language, Hive supports easy portability of SQL-based application to Hadoop.[3] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA).[4][5] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[6]

Features

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL[7] with schema on read and transparently converts queries to MapReduce, Apache Tez[8] and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes.[9] Other features of Hive include:

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[10]

Four file formats are supported in Hive, which are TEXTFILE,[11] SEQUENCEFILE, ORC[12] and RCFILE.[13][14][15] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[16][17] Additional Hive plugins support querying of the Bitcoin Blockchain.[18]

Architecture

Major components of the Hive architecture are:

Hive Architecture[19]

HiveQL

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support.[24][25] Support for insert, update, and delete with full ACID functionality was made available with release 0.14.[26]

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hadoop for execution.[27]

Example

"Word count" program

The word count program counts the number of times each word occurs in the input. The word count can be written in HiveQL as:[3]

1 DROP TABLE IF EXISTS docs;
2 CREATE TABLE docs (line STRING);
3 LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;
4 CREATE TABLE word_counts AS
5 SELECT word, count(1) AS count FROM
6  (SELECT explode(split(line, '\s')) AS word FROM docs) temp
7 GROUP BY word
8 ORDER BY word;

A brief explanation of each of the statements is as follows:

1 DROP TABLE IF EXISTS docs;
2 CREATE TABLE docs (line STRING);

Checks if table docs exists and drops it if it does. Creates a new table called docs with a single column of type STRING called line.

3 LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

Loads the specified file or directory (In this case “input_file”) into the table. OVERWRITE specifies that the target table to which the data is being loaded into is to be re-written; Otherwise the data would be appended.

4 CREATE TABLE word_counts AS
5 SELECT word, count(1) AS count FROM
6 (SELECT explode(split(line, '\s')) AS word FROM docs) temp
7 GROUP BY word
8 ORDER BY word;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table called word_counts with two columns: word and count. This query draws its input from the inner query (SELECT explode(split(line, '\s')) AS word FROM docs) temp". This query serves to split the input words into different rows of a temporary table aliased as temp. The GROUP BY WORD groups the results based on their keys. This results in the count column holding the number of occurrences for each word of the word column. The ORDER BY WORDS sorts the words alphabetically.

Comparison with traditional databases

The storage and querying operations of Hive closely resemble with that of traditional databases. While Hive works on an SQL-dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of Hadoop ecosystem and has to comply with the restrictions of Hadoop and MapReduce.

Schema is applied to a table in traditional databases. However, the table enforces the schema at the time of loading the data. This enables the database to make sure that the data entered follows the representation of the table as specified by the user. This design is called schema on write. Hive, when it saves its data into the tables, does not verify it against the table schema during load time. Instead, it follows a run time check. This model is called schema on read.[24] The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead which is why traditional databases take a longer time to load data. Quality check is performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables have schema ready after the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load but displays comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, instead is generated later dynamically.[24]

Transactions are key operations in traditional databases. A typical RDBMS supports all 4 properties of Transactions (ACID): Atomicity, Consistency, Isolation, and Durability. Transactions in Hive were introduced in Hive 0.13 but were only limited to partition level.[28] Only in the recent version of Hive 0.14 were these functions fully added to support complete ACID properties. This is because Hadoop does not support row level updates over specific partitions. These partitioned data are immutable and a new table with updated values has to be created. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE.[29] Enabling INSERT, UPDATE, DELETE transactions require setting appropriate values for configuration properties such as hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode.[30]

Security

Hive v0.7.0 added integration with Hadoop security. Hadoop has begun using Kerberos authorization support to provide security. Kerberos allows for mutual authentication between client and server. In this system, the client’s request for a ticket is passed along with the request. The previous versions of Hadoop had several issues such as users being able to spoof their username by setting the hadoop.job.ugi property and also MapReduce operations being run under the same user: hadoop or mapred. With Hive v0.7.0’s integration with Hadoop security, these issues have largely been fixed. TaskTracker jobs are run by the user who launched it and the username can no longer be spoofed by setting the hadoop.job.ugi property. Permissions for newly created files in Hive are dictated by the HDFS. The HDFS (Hadoop distributed file system) is similar to the Unix file system, where there are three entities: user, group and others with three permissions: read, write and execute. The default permissions for newly created files can be set by changing the umask value for the Hive configuration variable hive.files.umask.value.[3]

See also

References

  1. "Apache Hive Download News".
  2. Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1-4302-1942-2.
  3. 1 2 3 Programming Hive [Book].
  4. Use Case Study of Hive/Hadoop
  5. OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube
  6. Amazon Elastic MapReduce Developer Guide
  7. HiveQL Language Manual
  8. Apache Tez
  9. Working with Students to Improve Indexing in Apache Hive
  10. Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6.
  11. Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive
  12. LanguageManual ORC
  13. Faster Big Data on Hadoop with Hive and RCFile
  14. 1 2 3 Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop
  15. Yongqiang He; Rubao Lee; Yin Huai; Zheng Shao; Namit Jain; Xiaodong Zhang; Zhiwei Xu. "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF).
  16. "Parquet". 18 Dec 2014. Archived from the original on 2 February 2015. Retrieved 2 February 2015.
  17. Massie, Matt (21 August 2013). "A Powerful Big Data Trio: Spark, Parquet and Avro". zenfractal.com. Archived from the original on 2 February 2015. Retrieved 2 February 2015.
  18. Franke, Jörn. "Hive & Bitcoin: Analytics on Blockchain data with SQL".
  19. Thusoo, Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony, Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham (2009-08-01). "Hive: A Warehousing Solution over a Map-reduce Framework". Proc. VLDB Endow. 2 (2): 1626–1629. doi:10.14778/1687553.1687609. ISSN 2150-8097.
  20. 1 2 "Design - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
  21. "Abstract Syntax Tree". c2.com. Retrieved 2016-09-12.
  22. 1 2 Dokeroglu, Tansel; Ozal, Serkan; Bayir, MuratAli; Cinar, MuhammetSerkan; Cosar, Ahmet (2014-07-29). "Improving the performance of Hadoop Hive by sharing scan and computation tasks". Journal of Cloud Computing. 3 (1): 1–11. doi:10.1186/s13677-014-0012-6.
  23. "HiveServer - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
  24. 1 2 3 White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4.
  25. Hive Language Manual
  26. ACID and Transactions in Hive
  27. Hive A Warehousing Solution Over a MapReduce Framework
  28. "Introduction to Hive transactions". datametica.com. Retrieved 2016-09-12.
  29. "Hive Transactions - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
  30. "Configuration Properties - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.

External links

This article is issued from Wikipedia - version of the 11/29/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.